Theory of indicators: Ostwald's and Quinonoid theories
Empirical Semantics
1. Empirical Semantics
modelling knowledge as it is,
not as it should be
Frank van Harmelen
Vrije Universiteit Amsterdam
Creative Commons License
CC BY 3.0:
Allowed to copy, redistribute
remix & transform
But must attribute
1
Many thanks to all at
KR&R@VU: Wouter Beek, Joe
Raad, Peter Bloem, Stefan
Schlobach, Zhisheng Huang,
and many others over the years
2. The ‘K’ in ‘Semantic Web’
stands for ‘Knowledge’
Frank van Harmelen
Vrije Universiteit Amsterdam
Creative Commons License
CC BY 3.0:
Allowed to copy, redistribute
remix & transform
But must attribute
2
Many thanks to all at
KR&R@VU: Wouter Beek, Joe
Raad, Peter Bloem, Stefan
Schlobach, Zhisheng Huang,
and many others over the years
6. OWL Semantics fits on one A4
• The world consists of
– Objects (“individuals”)
– Sets of objects (“types”)
– Pairs of objects (“relations”)
• The world can be described by operations of
these sets: 𝑇1 ∪ 𝑇2, 𝑇1 ∩ 𝑇2, 𝑇1 T2
12. LOD Laundromat
Beek & Rietveld et al. 2014,
LOD laundromat: a uniform way of
publishing other people's dirty data
http://lodlaundromat.org/pdf/lodla
undry.pdf
HDT
Fernández & Martínez-Prieto &
Gutiérrez, 2013, Binary RDF
representation for publication and
exchange (HDT)
LDF
Verborgh & Vander Sande et al.
2014, Web-Scale Querying through
Linked Data Fragments
13. LOD-a-lot
1 file
28,362,198,927 unique triples
>650K data documents
LDF queries in real time
Surprisingly efficient
524 GB of disk space
16 GB of RAM
Only 144 secs loading time
Only €305,- hardware cost
Meta-Data for a lot of LOD
http://www.semantic-web-journal.net/content/meta-data-lot-lod-2
http://lod-a-lot.lod.labs.vu.nl/
15. owl:sameAs is not optional
15
But in practice
it’s broken under
the formal semantics
16. Meet our observatory:
http://SameAs.cc
• 559 million owl:sameAs statements
(we created an HDT file in 4 hours on 1 CPU core)
= 4.5GB + 2.2GB index)
• 50 million equivalence classes after inference
(5 hours on 2CPU cores; 9.3Gb disk only(!) RocksDB)
16
17. The largest equivalence class has 177.749 entities
and contains:
• Albert Einstein
• all countries of the world
• the empty string
Formal Semantics says:
This is obviously broken…. 17
Refl: ∀𝑥: (𝑥 = 𝑥)
Symm: ∀𝑥, 𝑦: (𝑥 = 𝑦) → (𝑦 = 𝑥)
Trans: ∀𝑥, 𝑦, 𝑧: 𝑥 = 𝑦 ∧ 𝑦 = 𝑧 → (𝑥 = 𝑧)
21. Community 0
1. dbpedia.org/resource/B_hussein_obama
2. dbpedia.org/resource/Barack_H_Obama,_Jr
3. dbpedia.org/resource/Barak_hussein_obama
4. dbpedia.org/resource/President_Barack
5. dbpedia.org/resource/Senator_Barack_Obama
6. dbpedia.org/resource/Obama
…
99. dbpedia.org/resource/Hussein_Obama
Community 3
1. dbpedia.org/resource/Presidency_of_Barack_Obama
2. dbpedia.org/resource/Barack_Obama_Administration
3. dbpedia.org/resource/Barack_Obama_Cabinet
4. dbpedia.org/resource/Obama_White_House
5. dbpedia.org/resource/Obama_regime
6. dbpedia.org/resource/America_under_Obama
…
52. dbpedia.org/resource/Presidential_transition_of_Barac
k_Obama
Debugging identity
by community detection
Communities correspond to roles:
- Person
- Senator
- President
- Government
22. Message from Empirical
Semantics
It’s not the users that got owl:sameAs wrong,
It’s the formal semantics that got reality wrong
Challenge:
What alternative semantic model of equality
would fit the empirically observed usage better?
23. Insights from
Empirical Semantics:
2. Meaningful names
23
Steven de Rooij Peter Bloem Wouter Beek (ISWC 2016)
http://www.cs.vu.nl/~frankh/postscript/ISWC2016.pdf
24. Symbols or words?
(or: blasphemy for logicians)
Formal Semantics says:
Symbol names are supposed to be meaningless
Aspirin headache
analgesic pain
symptomdrug
treats
treats
25. Measure mutual information content
between URL-string and semantics
E(x) = efficient encoding of x,
If x y then E(x+y) E(x) else E(x+y) E(x)+E(y)
Mutual information content
M(x,y) =E(x) + E(y) – E(x+y)
Take x = symbol name of x as a string
Take 𝑦1 = types of x (≈ semantics of x)
Take 𝑦2 = properties of x (≈ semantics of x)
Calculate M(x, 𝑦1) and M(x, 𝑦2) for all symbols
in 600k datasets
26. But URL-strings do encode meaning!
Fraction of datasets with redundancy for types/predicates
at significance level > 0.99
BTW, this is 600.000 datapoints (RDF docs)
Properties
Types
27. Message from Empirical
Semantics
Users shouldn’t stop using meaningful names,
Formal semantics should capture their meaning
Challenge:
What alternative semantic models
could capture meaningful names?
29. Knowledge will be inconsistent
Because of:
• Homonyms
• Different ontological models
• migration from legacy data
• integration of multiple sources
• ….
30. Inconsistency through migration
DICE terminology,
in daily use at Amsterdam Medical Centre
for registration of Intensive Care patients
• Brain CentralNervousSystem
• Brain BodyPart
• CentralNervousSystem NervousSystem
• BodyPart NervousSystem
31. Inconsistency through automated learning
• Reservoir Lake
• Lake WaterRegion
• Reservoir HydrographicStructure
• HydrographicStrure Facility
• Disjoint(WaterRegion, Facility),
100% expert agreement
on this disjointness….
Inconsistency through merging
SUMO(1000) + CYC(1.6M) → 6000 inconsistencies…
34. Reservoir Lake
Lake WaterRegion
Reservoir HydrographicStructure
HydrographicStrure Facility
Disjoint(WaterRegion, Facility)
Google Distance for selection function in
local consistency reasoning
ISWC08
Formal
Semantics
says: this isn’t
supposed to
work!
35. Insight from
Empirical Semantics
Users shouldn’t stop using meaningful names,
Formal semantics should capture their meaning
Challenge:
What alternative semantic models
would capture meaningful names?
41. Message from
Empirical Semantics
None of these patterns have any semantic impact
(you can’t even detect them under the traditional semantics)
Challenge:
What alternative semantic models would
take such different patterns into account?
43. So what #1 (pragmatic)
• We now have larger KB’s than ever before
• We now have the instruments
to observe and analyse these very large KB’s
• We can use these insights for better tools:
– query & inference
– publish & maintain
– visualise & explain
– …
44. My secret hope is that this will help us
to understand the patterns of knowledge:
Not a prescriptive theory of
what knowledge should be,
But a descriptive theory of
what knowledge is actually like
So what #2 (pretentious)