We now have larger Knowledge Bases than ever before. (10 billion facts is now a small number).
We now have the instruments to observe and analyse these very large Knowledge Bases.
We can use these insights for better tools for querying, inferencing, publishing, maintaining, visualising and explaining.
1. Creative Commons CC BY 3.0:
allowed to share & remix
(also commercial)
but must attribute
Frank van Harmelen
The empirical turn
in
Knowledge Representation
Contributions from many people
in the KR&R group over many years.
And thanks to NWO for
a 750k€ TOP grant for this
4. KR metrics
in the pre-empirical era
KR = logic
• Show small examples
• Prove properties
(expressivity, complexity)
• Give algorithms
(sound, complete)
KR = engineering
• Build applications
• Show high performance
• Show low engineering
costs
5. BUT AN EXPERIMENT
IN THE PAST 10 YEARS
MADE IT POSSIBLE
TO DO SOMETHING VERY DIFFERENT:
OBSERVE HOW
KNOWLEDGE REPRESENTATIONS BEHAVE
AT VERY LARGE SCALE
6.
7. Rest of the talk
• Which KR’s were part of the experiment?
• How much of it was there to observe?
• How did we manage to observe it?
• What did we learn from observing it?
22. LOD Laundromat
Beek & Rietveld et al. 2014,
LOD laundromat: a uniform way of
publishing other people's dirty data
http://lodlaundromat.org/pdf/lodla
undry.pdf
HDT
Fernández & Martínez-Prieto &
Gutiérrez, 2013, Binary RDF
representation for publication and
exchange (HDT)
LDF
Verborgh & Vander Sande et al.
2014, Web-Scale Querying through
Linked Data Fragments
24. Surprisingly efficient
1 file
28,362,198,927 unique triples
>650K data documents
524 GB of disk space
16 GB of RAM
Only €305,- hardware cost
Meta-Data for a lot of LOD
http://www.semantic-web-journal.net/content/meta-data-lot-lod-2
28. Identity clusters
LOD-a-lot File
http: //lod-a-lot.lod.labs.vu.nl
[Fernández 2017]
558 millions owl:sameAs (309 millions distinct terms)
≈ 4 hours
1. Extracting all owl:sameAs statements on the LOD
HDT File
(4.5 GB)
29. HDT File
(4.5 GB)
Identity
Closure
1
Identity
Closure
2
Identity
Closure
89 387 082…
- The largest Identity Closure contains 177 794 terms
(contains all the countries in the world, Albert Enstein, « empty string », etc.)
- The smallest Identity Closure contains 2 terms
x owl:sameAs y
z owl:sameAs y
Identity Closure x y z
2. Generating the Identity Closure
30.
31. Identity Closure « Cities »
3. Detecting Communities (using the Louvain Algorithm)
This network (i.e. identity closure) has a community structure, as it can be grouped into
different sets of nodes, with each set of nodes being densely connected internally.
Goal: Find (and later Evaluate) the most “suspicious” identity links (i.e. the links
between different communities)
32. 4. Application: debugging identity statements
Identity closure
containing the term
“dbpedia.org/page/Barack_Obama”
This Identity Closure contains 388 terms
(i.e. 387 distinct terms are owl:sameAs this term)
95 communities detected
largest community = 99 terms
34. Symbols or words?
Steven de Rooij Peter Bloem Wouter Beek (ISWC 2016)
http://www.cs.vu.nl/~frankh/postscript/ISWC2016.pdf
35. Symbols or words?
Symbol names are supposed to be meaningless
Aspirin headache
analgesic pain
symptomdrug
treats
treats
36. Measure mutual information content
between string and semantics of a symbol
E(x) = efficient encoding of x
Mutual information content
M(x,y) =E(x) + E(y) – E(x,y)
Take x = symbol name of x as a string
Take 𝑦1 = {types of x} ≈ semantics of x
Take 𝑦2 = {properties of x} ≈ semantics of x
Calculate M(x, 𝑦1) and M(x, 𝑦2) for all symbols
in 600k datasets
37. But variables do encode meaning!
Fraction of datasets with redundancy for types/predicates
at significance level > 0.99
BTW, this is 600.000 datapoints (RDF docs)
44. • We now have larger KB’s than ever before
• We now have the instruments
to observe and analyse these very large KB’s
• We can use these insights for better tools:
– query & inference
– publish & maintain
– visualise & explain
– …
45. But my secret hope is that this will help us
to understand the patterns of knowledge:
AI as a computational theory of knowledge