SlideShare a Scribd company logo
1 of 43
Download to read offline
Beyond Linked Data –
Exploiting Entity-Centric Knowledge on the Web
Stefan Dietze
L3S Research Center, Hannover, Germany
- Linked Data on the Web (LDOW2017), WWW2017 -
05/04/17 1Stefan Dietze
Research areas
 Web science, Information Retrieval, Semantic Web, Social Web
Analytics, Knowledge Discovery, Human Computation
 Interdisciplinary application areas: digital humanities,
TEL/education, Web archiving, mobility, ...
Some projects
Research @ L3S
05/04/17 2
 See also: http://www.l3s.de
Stefan Dietze
Acknowledgements: team
05/04/17 3Stefan Dietze
 Pavlos Fafalios (L3S)
 Besnik Fetahu (L3S)
 Elena Demidova (L3S)
 Ujwal Gadiraju (L3S)
 Eelco Herder (L3S)
 Ivana Marenzi (L3S)
 Nicolas Tempelmeier (L3S)
 Ran Yu (L3S)
 Nilamadhaba Mohapatra (L3S, IIT India)
 Bernardo Pereira Nunes (L3S, PUC Rio de Janeiro)
 Mathieu d‘Aquin (The Open University, UK)
 Mohamed Ben Ellefi (LIRMM, France)
 Davide Taibi (CNR, Italy)
 Konstantin Todorov (LIRMM, France)
 ...
Back in September 2016
05/04/17 4Stefan Dietze
A new look at the semantic web. Abraham
Bernstein, James Hendler, Natalya Noy,
Communications of the ACM, Vol. 59 No. 9, Pages 35-
37, September 2016
Retrieval, Crawling and Fusion of Entity-centric Data
on the Web, Dietze, S., in Semantic Keyword-Based
Search on Structured Data Sources, In: Calì A., Gorgan
D., Ugarte M. (eds) Semantic Keyword-Based Search on
Structured Data Sources. KEYSTONE 2016. LNCS, Vol
10151. Springer, 2017.
Overview
05/04/17Stefan Dietze 6
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
 Dataset recommendation
 Dataset profiling
 Entity retrieval
III – Beyond Linked Data – exploiting embedded Web semantics
 Web markup as emerging data source
 Case studies
 Data fusion for entity reconciliation (and retrieval)
III Wrap-up
Other emerging forms of
semantics/structured data on
the Web („Future“)
Dealing with heterogeneity &
shortcomings („Present“)
Data accessibility & quality?
SPARQL endpoint availability over time [Buil-Aranda et al 2013]
Accessibility of (linked) datasets?
 Less than 50% of all SPARQL endpoints actually responsive at given point of time [Buil-Aranda2013]
 “THE” SPARQL protocol? No, but variants, subsets and local restrictions
Semantics, links, quality?
 …data accuracy (eg DBpedia)? [Paulheim2013]
 …schema compliance & evolution [HoganJWS2012]
 …vocabulary reuse? [D’AquinWebSci13]
Stefan Dietze
Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A.,
Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013.
Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC
2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525
An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth,
A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012
05/04/17 7
SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil-
Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International
Semantic Web Conference 2013, (ISWC2013).
Co-occurence of
types
(in 146 datasets:
144 vocabularies,
588 overlapping
types, 719
predicates)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web
Science 2013 (WebSci2013), Paris, May 2013.
po:Programme
yov:Video
?
bibo:Book
Vocabulary reuse/linking?
05/04/17 8Stefan Dietze
typeX
typeX
Co-occurence after
mapping
(201 frequently
occuring types,
mapped into 79 types)
bibo:Film
bibo:Document
po:Programme
bibo:Book
foaf:Document
yov:Video
typeX
Co-occurence of
types
(in 146 datasets:
144 vocabularies,
588 overlapping
types, 719
predicates)
05/04/17 9
Vocabulary reuse/linking? Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web
Science 2013 (WebSci2013), Paris, May 2013.
“Completeness” ?
05/04/17Stefan Dietze 10
 Example: varying completeness of “book” (“movie”) entity
descriptions
 Missing facts: 49.8% (37.1%) in DBpedia, 63.8% (23.3%) in
Freebase and 60.9 % (40%) in Wikidata
(varies heavily across attributes)
Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM:
Query-Centric Data Fusion on Structured Web
Markup, ICDE2017.
Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze,
D., Dietze, S., KnowMore - Knowledge Base
Augmentation with Structured Web Markup,
Semantic Web Journal 2017, under review.
Consistency? Analyzing Relative Incompleteness of Movie
Descriptions in the Web of Data: A Case Study,
Yuan, W., Demidova, E., Dietze, S., Zhu, X.,
ISWC2014
05/04/17Stefan Dietze 11
Challenge for search/retrieval – heterogeneity of datasets & entities
Stefan Dietze 05/04/17
??? ?? ?
Discovery of suitable (1) datasets & (2) entities:
 Quality? Currentness, dynamics, accessability/reliability,
data quantity & quality?
 Topics/scope? Datasets/entities useful & trustworthy for
topic XY?
 Types? Datasets/entities about statistics, organisations,
videos, slides, publications etc?
12
Overview
05/04/17Stefan Dietze 13
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
 Dataset recommendation
 Dataset profiling
 Entity retrieval
III – Beyond Linked Data – exploiting embedded Web semantics
 Web markup as emerging data source
 Case studies
 Data fusion for entity reconciliation (and retrieval)
III Wrap-up
Other emerging forms of
semantics/structured data on
the Web („Future“)
Dealing with heterogeneity &
shortcomings („Now“)
05/04/17
Dataset recommendation I
14
S
Linkset1
Linkset2
Approach
 Given dataset s, ranking datasets from D
according to probability score (di, t) to
contain linking candidates (entities)
 Features:
 Approach 1: vocabulary overlap
 Approach 2: existing links (SNA)
 Linking candidates likely if datasets share
common (a) schema elements, or (b) links
(friend of a friend)
Conclusions
 Roughly 50% MAP for both approaches
 Simplistic approach (!)
Lopes, G.R., Paes Leme, L. A., Nunes, B.P., Casanova,
M.A., Dietze, S., Two approaches to the dataset
interlinking recommendation problem, 15th
International Conference on Web Information System
Engineering (WISE 2014), Thessaloniki, Greece.
Rank
1 DBLP
2 ACM
3 OAI
4 CiteSeer
5 IBM
6 Roma
7 IEEE
8 Ulm
9 Pisa
?
?
Stefan Dietze 14
Goal: finding candidate datasets, e.g. for entity retrieval
or interlinking tasks (eg enrichment)
Dataset recommendation II
05/04/17
Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.,
Intension-based Dataset Recommendation for Data
Linking, 13th Extended Semantic Web Conference
(ESWC2016), Heraklion, Crete, May, 2016, ESWC2016
Stefan Dietze 15
L. Han, A. L. Kashyap, T. Finin, J. Mayeld, and J. Weese, "Umbc ebiquity-core: Semantic textual similarity systems", in Proc. of the *SEM, Association for Computational Linguistics, 2013.
Preprocessing Datasets rankingDatasets filtering
Dataset recommendation II: results
05/04/17Stefan Dietze 16
Data & ground truth
 Experiments on (responsive) datasets
from LOD Cloud (http://datahub.io)
 Concept profiles from
http://lov.okfn.org
 Ground truth: existing links from VOID
profiles of datasets
(issue: not always representative for
actual linksets)
Results
 MAP for different similarity thresholds
from step 2 max. 54% (UMBC@0.7)
 Recall 100% below indicated similarity
(clustering) thresholds
Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.,
Intension-based Dataset Recommendation for Data
Linking, 13th Extended Semantic Web Conference
(ESWC2016), Heraklion, Crete, May, 2016, ESWC2016
Dataset search through dataset cataloging & profiling
Dataset
Catalog/Registry
http://data.linkededucation.org/linkedup/catalog/
 LinkedUp project (FP7 project: L3S, OU, OKFN, Elsevier, Exact Learning solutions)
 LinkedUp Catalog: largest collection of LD of educationally relevant resources (approx. 50 Datasets)
 Original datasets published with key content providers, automatically extracted metadata
05/04/17 17Stefan Dietze
05/04/17 18Stefan Dietze
LinkedUp Catalog: dataset index & registry, federated search
 “Federated queries” through schema mappings [WebSci13]
 Dataset accessibility
 Linking & topic profiling
Schema/Types
http://data.linkededucation.org/linkedup/catalog/
05/04/17 19Stefan Dietze
LinkedUp Catalog: dataset index & registry, federated search
 “Federated queries” through schema mappings [WebSci13]
 Dataset accessibility
 Linking & topic profiling [ESWC14]
Dataset topic
profiles
http://data.linkededucation.org/linkedup/catalog/
db:Biology
db:Cell biology
Dataset
Catalog/Registry
yov:Video
<yo:Video …>
<dc:title>Lecture 29 –
Stem Cells</dc:title>
…
</yo:Video…>
Yovisto Video
 Extraction of representative (DBpedia) categories („topic profile“) for arbitrary datasets ?
 Technically trivial through established NER/NED approaches, but scalability issues
(recall: LOD Cloud 1000+ datasets with <100 billion RDF statements)
 Efficient approach: sampling & ranking for balance between scalability and precision /recall
Scalable profiling of datasets
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl,
W., 11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
db:Cell
(Biology)
05/04/17 21
db:Cell
(Biology)
Stefan Dietze
Efficient dataset profiling
1. Sampling of resources
(random sampling, weighted sampling, resource
centrality sampling)
2. Entity- & topic-extraction (NER via DBpedia Spotlight,
category mapping & -expansion)
3. Normalisation & ranking (graph-based models such as
PageRank with Priors, HITS with Priors & K-Step Markov)
 Result: weighted dataset-topic profile graph
05/04/17 22Stefan Dietze
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl,
W., 11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
Search & exploration of datasets through topic profiles
 Applied to entire LOD cloud/graph
 Visual exploration of extracted RDF dataset profiles
(datasets, topics, relationships)
 Evaluation results: K-Step Markov (10% sampling size)
outperforms baselines (LDA, tf/idf on entire datasets)
http://data-observatory.org/lod-profiles/
05/04/17 23Stefan Dietze
Search: entity retrieval on large LD crawls?
 How to efficiently retrieve (related) entities/resources for given entity-seeking (keyword) query?
 State of the art: BM25F on inverted entity index (Blanco et al, ISWC2011)
 Challenges/observations:
 Explicit entity links (owl:sameAs etc) are sparse yet important to facilitate state of the art methods
 Query type affinity?
05/04/17 24Stefan Dietze
??
Large dataset/crawl
e.g. LinkedUp dataset graph, BTC2014, Dynamic LD Observatory
entities related to <Tim Berners Lee>
?
BTC2014
DyLDO
Entity retrieval: approach
(I) Offline processing (clustering to address link sparsity)
1. Feature vectors (lexical and structural features)
2. Bucketing: per type (LSH algorithm)
3. Clustering: X-means & Spectral clustering per bucket
Improving Entity Retrieval on Structured Data,
Fetahu, B., Gadiraju, U., Dietze, S., 14th
International Semantic Web Conference
(ISWC2015), Bethlehem, US, (2015).
(II) Online processing (retrieval)
1. Retrieval & expansion:
a) BM25F results
b) expansion from clusters (related entities)
2. Re-Ranking
(context terms & query type affinity)
05/04/17 25Stefan Dietze
Dataset
 BTC2014 (4 billion entities)
 92 SemSearch queries
Methods
 Our approaches: XM: Xmeans, SP: Spectral
 Baselines B: BM25F, S1: Tonon et al [SIGIR12]
Conclusions
 XM & SP outperform baselines
 Clustering to remedy link sparsity
(yet extensive offline processing required)
 Relevance to query more important than
relevance to BM25F results
Entity retrieval: evaluation
05/04/17 26Stefan Dietze
Improving Entity Retrieval on Structured Data,
Fetahu, B., Gadiraju, U., Dietze, S., 14th
International Semantic Web Conference
(ISWC2015), Bethlehem, US, (2015).
PROFILES2017 - Profiling & search of Linked Data
05/04/17 27Stefan Dietze
https://profiles2017.wordpress.com/
• Probably co-located with ISWC2017 (Vienna)
• Submissions due 21 June
Overview
05/04/17Stefan Dietze 28
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
 Dataset recommendation
 Dataset profiling
 Entity retrieval
III – Beyond Linked Data – exploiting embedded Web semantics
 Web markup as emerging data source
 Case studies
 Data fusion for entity reconciliation (and retrieval)
III Wrap-up
Other emerging forms of
structured data on the Web
(„Future“)?
Dealing with heterogeneity &
shortcomings („Present“)
 Linked Data: approx.
1000+ datasets & 100 billion statements
 Open Data: XXX datasets
Web semantics & entity-centric Web data
05/04/17 29Stefan Dietze
 Web (of documents):
approx. 46.000.000.000.000 (46 trillion)
Web pages indexed by Google
 Other forms of Web semantics
and entity-centric knowledge?
 Dynamics?
 Quality?
 Accessibility?
 Scale?
 Embedded markup (RDFa, Microdata, Microformats) for
interpretation of Web documents (search, retrieval)
 Arbitrary vocabularies; schema.org used at scale:
(700 classes, 1000 predicates)
 Adoption on the Web: 26 %
(2014 Google study of 12 bn Web pages)
 “Web Data Commons” (Meusel & Paulheim [ISWC2014])
• Markup from Common Crawl (3.2 billion pages):
44 billion RDF quads (2016)
• Markup in 38% of pages in 2016
 Same order of magnitude as “the Web” (!)
Embedded Web page markup & schema.org
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Forrest Gump</h1>
<span>Actor: <span itemprop=„actor">Tom Hanks</span>
<span itemprop="genre">Drama</span>
...
</div>
05/04/17 30
RDF statements
node1 actor _node-x
node1 actor Robin Wright
node1 genre Comedy
node2 actor T. Hanks
node2 distributed by Paramount Pic.
node3 actor Tom Cruise
node3 distributed by Paramount Pic.
Stefan Dietze
http://webdatacommons.org
 schema:Product instances in WDC2015
 Facts: 1.414.937.431
(= 302.246.120 instances, i.e. products)
 Providers (distinct Pay Level Domains, PLDs): 93.705
 Power law distribution of terms across PLDs
 Top 10 PLDs
 Top provider ? (company)
05/04/17 31Stefan Dietze
Example: embedded Web markup about „products“
PLD # Resources
www.crateandbarrel.com 33.517.936,00
www.bentgate.com 17.215.499,00
www.aliexpress.com 9.621.943,00
www.ebay.com.au 8.861.308,00
us.fotolia.com 7.939.982,00
www.ebay.co.uk 6.556.820,00
www.competitivecyclist.com 6.214.500,00
www.maxstudio.com 6.075.626,00
approx. 35 million resources
1
10
100
1000
10000
100000
1000000
10000000
1 51 101 151 201
count(log)
PLD (ranked)
# entities # statements
Study on sample Web crawl (WDC2015)
 Metadata about scholarly articles, e.g.
s:ScholarlyArticle): 6.793.764 quads, 1.184.623
entities, 429 distinct predicates
(in WDC and for 1 type alone)
 Top 5 domains: Springer, MDPI, BMJ,
mendeley.com, Biodiversitylibrary.org
Domains, topics, disciplines?
 Life Sciences and Computer Science predominant
 Top-10 article titles
 Noise
Example: markup of bibliographic resources
05/04/17 32Stefan Dietze
Sahoo, P., Gadiraju, U., Yu, R., Saha, S., Dietze, S.,
Analysing Structured Scholarly Data embedded in Web
Pages, SAVE-SD2016, co-located with the WWW2016
Example: markup of learning resources on the Web
 “Learning Resources Metadata Intiative (LRMI)”:
schema.org vocabulary for annotation of learning
resources
 Developed through DCMI Task Force on LRMI
 Approx. 5000 PLDs (incl. subdomains) in CC
 LRMI adoption (WDC) [WWW17]:
 2015: 44,108,511 quads
 2014: 30,599,024 quads
 2013: 10.636873 quads
05/04/17 33
Dietze, S., Taibi, D., Yu, R., Barker, P., d’Aquin, M., Analysing and
Improving embedded Markup of Learning Resources on the
Web, 26th International World Wide Web Conference
(WWW2017), Digital Learning track, Perth, April 2017.
Stefan Dietze
Example: markup of learning resources on the Web
 “Learning Resources Metadata Intiative (LRMI)”:
schema.org vocabulary for annotation of learning
resources
 Developed through DCMI Task Force on LRMI
 Approx. 5000 PLDs (incl. subdomains) in CC
 LRMI adoption (WDC) [WWW17]:
 2015: 44,108,511 quads
 2014: 30,599,024 quads
 2013: 10.636873 quads
 Frequent errors and unintended use (e.g. porn)
05/04/17 34
Dietze, S., Taibi, D., Yu, R., Barker, P., d’Aquin, M., Analysing and
Improving embedded Markup of Learning Resources on the
Web, 26th International World Wide Web Conference
(WWW2017), Digital Learning track, Perth, April 2017.
Stefan Dietze
7xxxtube.com
1amateurporntube.com
virtualpornstars.com
sunriseseniorliving.com
simplyfinance.co.uk
menslifestyles.com
audiobooks.com
simplypsychology.org
helles-koepfchen.de
05/04/17 35Stefan Dietze
Entity retrieval on Web markup: state of the art
 Glimmer
(http://glimmer.research.yahoo.com)
 Entity retrieval on WDC dataset
[Blanco, Mika & Vigna, ISWC2011]
 BM25F retrieval model on WDC index
Web markup: challenges
05/04/17 36
Characteristics Example
Coreferences
18.000 results for <„Iphone 6“, type, s:Product>
(8,6 quads on average) in CommonCrawl
Redundancy <s, schema:name, „Iphone 6“> occurring 1000 times in CC
Lack of links Largely unlinked entity descriptions
Errors
(typos & schema
violations, see Meusel
et al [ESWC2015])
Wrong namespaces, such as http://schma.org
Undefined types & predicates:
9,7 %, less common than in LOD
Confusion of datatype and object properties:
<s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8%
in LOD
Data property range violations: e.g. literals vs numbers
(12,6% vs 4,6 in LOD)
 Using markup as knowledge graph, similar to Linked Data?
Stefan Dietze
A Survey on Challenges for Entity Retrieval in Markup
Data, Yu, R., Gadiraju, U., Fetahu, B., Dietze, S., 15th
International Semantic Web Conference (ISWC2016),
Kobe, Japan (2016).
“Strings, not things”
 Bias towards datatype properties / using any
property as such (!)
 Numbers from LRMI2015 markup corpus:
o 46 million “transversal” quads (i.e. excluding
hierarchical statements such as rdfs:typeOf)
o 64 % are actual datatype properties yet 97%
refer to literals (up from 70% in 2013)
 Challenges
o Markup data = flat entity descriptions
(=> fairly unconnected graph)
o Data reuse requires identity resolution
 Obtaining consolidated & verified entity description/facts (or
graph) for a given resource/entity from Web markup?
 Aiding tasks: such as document annotation, augmentation
or enrichment of existing data- or knowledge bases/graphs
Entity retrieval & reconciliation on markup
05/04/17 37
Query
iPhone 6, type:(Product)
Entity Description
brand Apple Inc.
weight 129
date 30.09.2015
manufacturer Foxconn
Storage 16 GB
<e1, s:name, „Iphone 6“>
<e2, s:brand, „Apple Inc.“>
<e3, s:brand, „Apple“> <e4, s:weight, 127>
<e5, s:releaseDate, „1.12.1972“>
Web (crawl)
(e.g. Common Crawl/WDC, focused crawl)
Stefan Dietze
Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM:
Query-Centric Data Fusion on Structured Web
Markup, ICDE2017.
Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O.,
Ritze, D., Dietze, S., KnowMore - Knowledge Base
Augmentation with Structured Web Markup,
Semantic Web Journal 2017, under review.
FuseM: query-centric data fusion on Web markup
05/04/17 38
 Entity matching: BM25 entity retrieval model on markup index (Common Crawl) & similarity-based matching
 Data fusion: ML classifier (SVM, knn, RandomForest), 3 feature categories (relevance, authority, clustering)
1. Matching
2. Fact selection
New Queries
Foxconn, type:(Organization)
Cupertino, type:(City)
Apple Inc., type:(Organization)
(supervised SVM classifier)
Entity Description
brand Apple Inc.
weight 129
date 30.09.2015
manufacturer Foxconn
Storage 16 GB
Query
iPhone 6, type:(Product)
Candidate Facts
node1 brand _node-x
node1 brand Apple Inc.
node1 weight 129
node2 weight 172
node2 manufacturer Foxconn
node3 releasedate 01.12.1972
node3 manufacturer Foxconn
Web page
markup
Web (crawl)
approx. 125.000 facts for „iPhone6“
Stefan Dietze
Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM:
Query-Centric Data Fusion on Structured Web
Markup, ICDE2017.
FuseM classifier: features
05/04/17 39Stefan Dietze
Evaluation & results: data fusion performance
05/04/17 40Stefan Dietze
Setup
 Dataset: Products, Movies, Books
(approx. 3 billion. facts) from Common
Crawl / WDC
 Baselines:
 BM25: top-k diverse facts via BM25
(Glimmer)
 CBFS: clustering-based approach
[ESWC2015]
 PreRecCorr: “Fusing data with
correlations” [Pochampally et. al.,
ACM SIGMOD 2014]
 10-fold cross validation
Results
 FuseM beats baselines in both tasks
(strong variance of baselines across
tasks)
 All feature categories contribute
Query-centric data fusion (precision)
Query-independent data fusion (P/R/F1)
05/04/17 42Stefan Dietze
Results: example of fused entity description
 Data fusion result for book „Brideshead Revisited“ (20 distinct facts)
New facts (compared to DBpedia):
• 60% - 70% of all facts for books & movies
new (across all KBs)
• 100% new for products
(„long tail entities“ not existing in KBs yet)
New facts and attributes
05/04/17 43Stefan Dietze
Results: KB augmentation
 Augmentation of 15 properties of
books (& movies) in three KBs
 DB: DBpedia
 FB: Freebase
 WD: Wikidata
 Augmentation performance: % of filled
slots (or „knowledge gaps“) in KB
 Performance varies heavily (yet some
attributes completed to 100%)
KBA result for entities of type „Book“
Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O.,
Ritze, D., Dietze, S., KnowMore - Knowledge Base
Augmentation with Structured Web Markup,
Semantic Web Journal 2017, under review.
Linked Data & knowledge graphs
Conclusions & outlook
05/04/17 45Stefan Dietze
 Retrieval/search of Linked Data hindered by
heterogeneity, quality, dynamics etc
 Dealing with diversity & heterogeneity
o Profiling & recommendation: dataset search &
recommendation
o Entity retrieval & clustering: entity search
Entity
node1 name
Molecular structure of
nucleic acids
node1 author James D. Watson
node1 publisher Nature
node1 datePublished 1956
node1 datePublished 1953
Entity
node2 name Francis Crick
node2 name Cricks
node2 born 1916
Embedded data/markup/tables
Unstructured (Web) data/docs
Linked Data & knowledge graphs
Conclusions & outlook
05/04/17 46Stefan Dietze
 Retrieval/search of Linked Data hindered by
heterogeneity, quality, dynamics etc
 Dealing with diversity & heterogeneity
o Profiling & recommendation: dataset search &
recommendation
o Entity retrieval & clustering: entity search
 New forms of (structured) Web data:
Web markup (schema.org et al.) & tables
o Convergence of structured and unstructured Web
(e.g. Voldemort KG, Tonon et al., ISWC2016)
o Scale and dynamics (!)
o Potential to augment existing knowledge graphs
(e.g. Google KG or Microsoft Satori)
o Potential training data for NED, entity interlinking
and other entity-centric tasks (e.g. OKE Challenge)
Entity
node1 name
Molecular structure of
nucleic acids
node1 author James D. Watson
node1 publisher Nature
node1 datePublished 1956
node1 datePublished 1953
Entity
node2 name Francis Crick
node2 name Cricks
node2 born 1916
Contact & resources
05/04/17 47Stefan Dietze
@stefandietze
http://stefandietze.net
More on Web markup: talk on
Wednesday, 11:00, WW2017/Digital
Learning track
Embedded data/markup/tables
Unstructured (Web) data/docs
Linked Data & knowledge graphs

More Related Content

What's hot

euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)Besnik Fetahu
 
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisExtracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisMathieu d'Aquin
 
DMPTool: Data Management Made Easier at CNI 2012
DMPTool: Data Management Made Easier at CNI 2012DMPTool: Data Management Made Easier at CNI 2012
DMPTool: Data Management Made Easier at CNI 2012Carly Strasser
 
Doing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebDoing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebMathieu d'Aquin
 
Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Stefan Dietze
 
Semantic Web / Linked Data Technologies
Semantic Web / Linked Data TechnologiesSemantic Web / Linked Data Technologies
Semantic Web / Linked Data TechnologiesMathieu d'Aquin
 
A structured catalog of open educational datasets
A structured catalog of open educational datasetsA structured catalog of open educational datasets
A structured catalog of open educational datasetsStefan Dietze
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...Armin Haller
 
LinkedUp - Linked Data & Education
LinkedUp - Linked Data & EducationLinkedUp - Linked Data & Education
LinkedUp - Linked Data & EducationStefan Dietze
 
LUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked DataLUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked DataMathieu d'Aquin
 
Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?Mathieu d'Aquin
 
Interpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning AnalyticsInterpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning AnalyticsMathieu d'Aquin
 
Presentation of LUCERO at EURECOM
Presentation of LUCERO at EURECOMPresentation of LUCERO at EURECOM
Presentation of LUCERO at EURECOMMathieu d'Aquin
 
Working with Social Media Data: Ethics & good practice around collecting, usi...
Working with Social Media Data: Ethics & good practice around collecting, usi...Working with Social Media Data: Ethics & good practice around collecting, usi...
Working with Social Media Data: Ethics & good practice around collecting, usi...Nicola Osborne
 
ESWC2015 opening ceremony
ESWC2015 opening ceremonyESWC2015 opening ceremony
ESWC2015 opening ceremonyFabien Gandon
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 

What's hot (20)

euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
 
DataUp at ACRL 2013
DataUp at ACRL 2013DataUp at ACRL 2013
DataUp at ACRL 2013
 
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisExtracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
 
DMPTool: Data Management Made Easier at CNI 2012
DMPTool: Data Management Made Easier at CNI 2012DMPTool: Data Management Made Easier at CNI 2012
DMPTool: Data Management Made Easier at CNI 2012
 
Doing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebDoing Clever Things with the Semantic Web
Doing Clever Things with the Semantic Web
 
Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)
 
Semantic Web / Linked Data Technologies
Semantic Web / Linked Data TechnologiesSemantic Web / Linked Data Technologies
Semantic Web / Linked Data Technologies
 
A structured catalog of open educational datasets
A structured catalog of open educational datasetsA structured catalog of open educational datasets
A structured catalog of open educational datasets
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
 
LinkedUp - Linked Data & Education
LinkedUp - Linked Data & EducationLinkedUp - Linked Data & Education
LinkedUp - Linked Data & Education
 
LUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked DataLUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked Data
 
Alamw15 VIVO
Alamw15 VIVOAlamw15 VIVO
Alamw15 VIVO
 
Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?
 
Sanderson Shout It Out: LOUD
Sanderson Shout It Out: LOUDSanderson Shout It Out: LOUD
Sanderson Shout It Out: LOUD
 
Interpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning AnalyticsInterpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning Analytics
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Presentation of LUCERO at EURECOM
Presentation of LUCERO at EURECOMPresentation of LUCERO at EURECOM
Presentation of LUCERO at EURECOM
 
Working with Social Media Data: Ethics & good practice around collecting, usi...
Working with Social Media Data: Ethics & good practice around collecting, usi...Working with Social Media Data: Ethics & good practice around collecting, usi...
Working with Social Media Data: Ethics & good practice around collecting, usi...
 
ESWC2015 opening ceremony
ESWC2015 opening ceremonyESWC2015 opening ceremony
ESWC2015 opening ceremony
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 

Similar to Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web

What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsStefan Dietze
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebStefan Dietze
 
Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesStefan Dietze
 
Open Data Dialog 2013 - Linked Data in Education
Open Data Dialog 2013 - Linked Data in EducationOpen Data Dialog 2013 - Linked Data in Education
Open Data Dialog 2013 - Linked Data in EducationStefan Dietze
 
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataStefan Dietze
 
WWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationWWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationStefan Dietze
 
LinkedUp - Linked Data Europe Workshop 2014
LinkedUp - Linked Data Europe Workshop 2014LinkedUp - Linked Data Europe Workshop 2014
LinkedUp - Linked Data Europe Workshop 2014Stefan Dietze
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsFrom Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsStefan Dietze
 
Linked Data vs Open Educational Resources
Linked Data vs Open Educational ResourcesLinked Data vs Open Educational Resources
Linked Data vs Open Educational ResourcesStefan Dietze
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningStefan Dietze
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Stefan Dietze
 
Exposing Humanities Data for Reuse and Linking - RED, linked data and the sem...
Exposing Humanities Data for Reuse and Linking - RED, linked data and the sem...Exposing Humanities Data for Reuse and Linking - RED, linked data and the sem...
Exposing Humanities Data for Reuse and Linking - RED, linked data and the sem...Mathieu d'Aquin
 
Charleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data WorldCharleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data WorldProQuest
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphsStefan Dietze
 
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and FuturePoster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and FutureASIS&T
 
Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Open Data & Education Seminar, ITMO, St Petersburg, March 2014Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Open Data & Education Seminar, ITMO, St Petersburg, March 2014Stefan Dietze
 
Semantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenzaSemantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenzaGiorgia Lodi
 
NetIKX Semantic Search Presentation
NetIKX Semantic Search PresentationNetIKX Semantic Search Presentation
NetIKX Semantic Search Presentationurvics
 
KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013Stefan Dietze
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Stefan Dietze
 

Similar to Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web (20)

What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked Datasets
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
 
Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital Libraries
 
Open Data Dialog 2013 - Linked Data in Education
Open Data Dialog 2013 - Linked Data in EducationOpen Data Dialog 2013 - Linked Data in Education
Open Data Dialog 2013 - Linked Data in Education
 
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open Data
 
WWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationWWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & Education
 
LinkedUp - Linked Data Europe Workshop 2014
LinkedUp - Linked Data Europe Workshop 2014LinkedUp - Linked Data Europe Workshop 2014
LinkedUp - Linked Data Europe Workshop 2014
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsFrom Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web Datasets
 
Linked Data vs Open Educational Resources
Linked Data vs Open Educational ResourcesLinked Data vs Open Educational Resources
Linked Data vs Open Educational Resources
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday Learning
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
 
Exposing Humanities Data for Reuse and Linking - RED, linked data and the sem...
Exposing Humanities Data for Reuse and Linking - RED, linked data and the sem...Exposing Humanities Data for Reuse and Linking - RED, linked data and the sem...
Exposing Humanities Data for Reuse and Linking - RED, linked data and the sem...
 
Charleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data WorldCharleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data World
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphs
 
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and FuturePoster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
 
Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Open Data & Education Seminar, ITMO, St Petersburg, March 2014Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Open Data & Education Seminar, ITMO, St Petersburg, March 2014
 
Semantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenzaSemantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenza
 
NetIKX Semantic Search Presentation
NetIKX Semantic Search PresentationNetIKX Semantic Search Presentation
NetIKX Semantic Search Presentation
 
KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)
 

More from Stefan Dietze

AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...Stefan Dietze
 
An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...Stefan Dietze
 
Research Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISResearch Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISStefan Dietze
 
Research Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceResearch Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceStefan Dietze
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Stefan Dietze
 
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Stefan Dietze
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Stefan Dietze
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...Stefan Dietze
 
Using AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebUsing AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebStefan Dietze
 
Analysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesAnalysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesStefan Dietze
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebStefan Dietze
 
Dietze linked data-vr-es
Dietze linked data-vr-esDietze linked data-vr-es
Dietze linked data-vr-esStefan Dietze
 
Towards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeTowards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeStefan Dietze
 

More from Stefan Dietze (13)

AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...
 
An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...
 
Research Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISResearch Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESIS
 
Research Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceResearch Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScience
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
 
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
 
Using AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebUsing AI to understand everyday learning on the Web
Using AI to understand everyday learning on the Web
 
Analysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesAnalysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online Activities
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the Web
 
Dietze linked data-vr-es
Dietze linked data-vr-esDietze linked data-vr-es
Dietze linked data-vr-es
 
Towards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeTowards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledge
 

Recently uploaded

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dashnarutouzumaki53779
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Recently uploaded (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dash
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web

  • 1. Beyond Linked Data – Exploiting Entity-Centric Knowledge on the Web Stefan Dietze L3S Research Center, Hannover, Germany - Linked Data on the Web (LDOW2017), WWW2017 - 05/04/17 1Stefan Dietze
  • 2. Research areas  Web science, Information Retrieval, Semantic Web, Social Web Analytics, Knowledge Discovery, Human Computation  Interdisciplinary application areas: digital humanities, TEL/education, Web archiving, mobility, ... Some projects Research @ L3S 05/04/17 2  See also: http://www.l3s.de Stefan Dietze
  • 3. Acknowledgements: team 05/04/17 3Stefan Dietze  Pavlos Fafalios (L3S)  Besnik Fetahu (L3S)  Elena Demidova (L3S)  Ujwal Gadiraju (L3S)  Eelco Herder (L3S)  Ivana Marenzi (L3S)  Nicolas Tempelmeier (L3S)  Ran Yu (L3S)  Nilamadhaba Mohapatra (L3S, IIT India)  Bernardo Pereira Nunes (L3S, PUC Rio de Janeiro)  Mathieu d‘Aquin (The Open University, UK)  Mohamed Ben Ellefi (LIRMM, France)  Davide Taibi (CNR, Italy)  Konstantin Todorov (LIRMM, France)  ...
  • 4. Back in September 2016 05/04/17 4Stefan Dietze A new look at the semantic web. Abraham Bernstein, James Hendler, Natalya Noy, Communications of the ACM, Vol. 59 No. 9, Pages 35- 37, September 2016 Retrieval, Crawling and Fusion of Entity-centric Data on the Web, Dietze, S., in Semantic Keyword-Based Search on Structured Data Sources, In: Calì A., Gorgan D., Ugarte M. (eds) Semantic Keyword-Based Search on Structured Data Sources. KEYSTONE 2016. LNCS, Vol 10151. Springer, 2017.
  • 5. Overview 05/04/17Stefan Dietze 6 I – Challenges II – Enabling discovery & search in Linked Data & Knowledge Graphs  Dataset recommendation  Dataset profiling  Entity retrieval III – Beyond Linked Data – exploiting embedded Web semantics  Web markup as emerging data source  Case studies  Data fusion for entity reconciliation (and retrieval) III Wrap-up Other emerging forms of semantics/structured data on the Web („Future“) Dealing with heterogeneity & shortcomings („Present“)
  • 6. Data accessibility & quality? SPARQL endpoint availability over time [Buil-Aranda et al 2013] Accessibility of (linked) datasets?  Less than 50% of all SPARQL endpoints actually responsive at given point of time [Buil-Aranda2013]  “THE” SPARQL protocol? No, but variants, subsets and local restrictions Semantics, links, quality?  …data accuracy (eg DBpedia)? [Paulheim2013]  …schema compliance & evolution [HoganJWS2012]  …vocabulary reuse? [D’AquinWebSci13] Stefan Dietze Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525 An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012 05/04/17 7 SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil- Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International Semantic Web Conference 2013, (ISWC2013).
  • 7. Co-occurence of types (in 146 datasets: 144 vocabularies, 588 overlapping types, 719 predicates) Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, May 2013. po:Programme yov:Video ? bibo:Book Vocabulary reuse/linking? 05/04/17 8Stefan Dietze
  • 8. typeX typeX Co-occurence after mapping (201 frequently occuring types, mapped into 79 types) bibo:Film bibo:Document po:Programme bibo:Book foaf:Document yov:Video typeX Co-occurence of types (in 146 datasets: 144 vocabularies, 588 overlapping types, 719 predicates) 05/04/17 9 Vocabulary reuse/linking? Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, May 2013.
  • 9. “Completeness” ? 05/04/17Stefan Dietze 10  Example: varying completeness of “book” (“movie”) entity descriptions  Missing facts: 49.8% (37.1%) in DBpedia, 63.8% (23.3%) in Freebase and 60.9 % (40%) in Wikidata (varies heavily across attributes) Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM: Query-Centric Data Fusion on Structured Web Markup, ICDE2017. Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze, D., Dietze, S., KnowMore - Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2017, under review.
  • 10. Consistency? Analyzing Relative Incompleteness of Movie Descriptions in the Web of Data: A Case Study, Yuan, W., Demidova, E., Dietze, S., Zhu, X., ISWC2014 05/04/17Stefan Dietze 11
  • 11. Challenge for search/retrieval – heterogeneity of datasets & entities Stefan Dietze 05/04/17 ??? ?? ? Discovery of suitable (1) datasets & (2) entities:  Quality? Currentness, dynamics, accessability/reliability, data quantity & quality?  Topics/scope? Datasets/entities useful & trustworthy for topic XY?  Types? Datasets/entities about statistics, organisations, videos, slides, publications etc? 12
  • 12. Overview 05/04/17Stefan Dietze 13 I – Challenges II – Enabling discovery & search in Linked Data & Knowledge Graphs  Dataset recommendation  Dataset profiling  Entity retrieval III – Beyond Linked Data – exploiting embedded Web semantics  Web markup as emerging data source  Case studies  Data fusion for entity reconciliation (and retrieval) III Wrap-up Other emerging forms of semantics/structured data on the Web („Future“) Dealing with heterogeneity & shortcomings („Now“)
  • 13. 05/04/17 Dataset recommendation I 14 S Linkset1 Linkset2 Approach  Given dataset s, ranking datasets from D according to probability score (di, t) to contain linking candidates (entities)  Features:  Approach 1: vocabulary overlap  Approach 2: existing links (SNA)  Linking candidates likely if datasets share common (a) schema elements, or (b) links (friend of a friend) Conclusions  Roughly 50% MAP for both approaches  Simplistic approach (!) Lopes, G.R., Paes Leme, L. A., Nunes, B.P., Casanova, M.A., Dietze, S., Two approaches to the dataset interlinking recommendation problem, 15th International Conference on Web Information System Engineering (WISE 2014), Thessaloniki, Greece. Rank 1 DBLP 2 ACM 3 OAI 4 CiteSeer 5 IBM 6 Roma 7 IEEE 8 Ulm 9 Pisa ? ? Stefan Dietze 14 Goal: finding candidate datasets, e.g. for entity retrieval or interlinking tasks (eg enrichment)
  • 14. Dataset recommendation II 05/04/17 Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K., Intension-based Dataset Recommendation for Data Linking, 13th Extended Semantic Web Conference (ESWC2016), Heraklion, Crete, May, 2016, ESWC2016 Stefan Dietze 15 L. Han, A. L. Kashyap, T. Finin, J. Mayeld, and J. Weese, "Umbc ebiquity-core: Semantic textual similarity systems", in Proc. of the *SEM, Association for Computational Linguistics, 2013. Preprocessing Datasets rankingDatasets filtering
  • 15. Dataset recommendation II: results 05/04/17Stefan Dietze 16 Data & ground truth  Experiments on (responsive) datasets from LOD Cloud (http://datahub.io)  Concept profiles from http://lov.okfn.org  Ground truth: existing links from VOID profiles of datasets (issue: not always representative for actual linksets) Results  MAP for different similarity thresholds from step 2 max. 54% (UMBC@0.7)  Recall 100% below indicated similarity (clustering) thresholds Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K., Intension-based Dataset Recommendation for Data Linking, 13th Extended Semantic Web Conference (ESWC2016), Heraklion, Crete, May, 2016, ESWC2016
  • 16. Dataset search through dataset cataloging & profiling Dataset Catalog/Registry http://data.linkededucation.org/linkedup/catalog/  LinkedUp project (FP7 project: L3S, OU, OKFN, Elsevier, Exact Learning solutions)  LinkedUp Catalog: largest collection of LD of educationally relevant resources (approx. 50 Datasets)  Original datasets published with key content providers, automatically extracted metadata 05/04/17 17Stefan Dietze
  • 17. 05/04/17 18Stefan Dietze LinkedUp Catalog: dataset index & registry, federated search  “Federated queries” through schema mappings [WebSci13]  Dataset accessibility  Linking & topic profiling Schema/Types http://data.linkededucation.org/linkedup/catalog/
  • 18. 05/04/17 19Stefan Dietze LinkedUp Catalog: dataset index & registry, federated search  “Federated queries” through schema mappings [WebSci13]  Dataset accessibility  Linking & topic profiling [ESWC14] Dataset topic profiles http://data.linkededucation.org/linkedup/catalog/
  • 19. db:Biology db:Cell biology Dataset Catalog/Registry yov:Video <yo:Video …> <dc:title>Lecture 29 – Stem Cells</dc:title> … </yo:Video…> Yovisto Video  Extraction of representative (DBpedia) categories („topic profile“) for arbitrary datasets ?  Technically trivial through established NER/NED approaches, but scalability issues (recall: LOD Cloud 1000+ datasets with <100 billion RDF statements)  Efficient approach: sampling & ranking for balance between scalability and precision /recall Scalable profiling of datasets A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). db:Cell (Biology) 05/04/17 21 db:Cell (Biology) Stefan Dietze
  • 20. Efficient dataset profiling 1. Sampling of resources (random sampling, weighted sampling, resource centrality sampling) 2. Entity- & topic-extraction (NER via DBpedia Spotlight, category mapping & -expansion) 3. Normalisation & ranking (graph-based models such as PageRank with Priors, HITS with Priors & K-Step Markov)  Result: weighted dataset-topic profile graph 05/04/17 22Stefan Dietze A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014).
  • 21. Search & exploration of datasets through topic profiles  Applied to entire LOD cloud/graph  Visual exploration of extracted RDF dataset profiles (datasets, topics, relationships)  Evaluation results: K-Step Markov (10% sampling size) outperforms baselines (LDA, tf/idf on entire datasets) http://data-observatory.org/lod-profiles/ 05/04/17 23Stefan Dietze
  • 22. Search: entity retrieval on large LD crawls?  How to efficiently retrieve (related) entities/resources for given entity-seeking (keyword) query?  State of the art: BM25F on inverted entity index (Blanco et al, ISWC2011)  Challenges/observations:  Explicit entity links (owl:sameAs etc) are sparse yet important to facilitate state of the art methods  Query type affinity? 05/04/17 24Stefan Dietze ?? Large dataset/crawl e.g. LinkedUp dataset graph, BTC2014, Dynamic LD Observatory entities related to <Tim Berners Lee> ? BTC2014 DyLDO
  • 23. Entity retrieval: approach (I) Offline processing (clustering to address link sparsity) 1. Feature vectors (lexical and structural features) 2. Bucketing: per type (LSH algorithm) 3. Clustering: X-means & Spectral clustering per bucket Improving Entity Retrieval on Structured Data, Fetahu, B., Gadiraju, U., Dietze, S., 14th International Semantic Web Conference (ISWC2015), Bethlehem, US, (2015). (II) Online processing (retrieval) 1. Retrieval & expansion: a) BM25F results b) expansion from clusters (related entities) 2. Re-Ranking (context terms & query type affinity) 05/04/17 25Stefan Dietze
  • 24. Dataset  BTC2014 (4 billion entities)  92 SemSearch queries Methods  Our approaches: XM: Xmeans, SP: Spectral  Baselines B: BM25F, S1: Tonon et al [SIGIR12] Conclusions  XM & SP outperform baselines  Clustering to remedy link sparsity (yet extensive offline processing required)  Relevance to query more important than relevance to BM25F results Entity retrieval: evaluation 05/04/17 26Stefan Dietze Improving Entity Retrieval on Structured Data, Fetahu, B., Gadiraju, U., Dietze, S., 14th International Semantic Web Conference (ISWC2015), Bethlehem, US, (2015).
  • 25. PROFILES2017 - Profiling & search of Linked Data 05/04/17 27Stefan Dietze https://profiles2017.wordpress.com/ • Probably co-located with ISWC2017 (Vienna) • Submissions due 21 June
  • 26. Overview 05/04/17Stefan Dietze 28 I – Challenges II – Enabling discovery & search in Linked Data & Knowledge Graphs  Dataset recommendation  Dataset profiling  Entity retrieval III – Beyond Linked Data – exploiting embedded Web semantics  Web markup as emerging data source  Case studies  Data fusion for entity reconciliation (and retrieval) III Wrap-up Other emerging forms of structured data on the Web („Future“)? Dealing with heterogeneity & shortcomings („Present“)
  • 27.  Linked Data: approx. 1000+ datasets & 100 billion statements  Open Data: XXX datasets Web semantics & entity-centric Web data 05/04/17 29Stefan Dietze  Web (of documents): approx. 46.000.000.000.000 (46 trillion) Web pages indexed by Google  Other forms of Web semantics and entity-centric knowledge?  Dynamics?  Quality?  Accessibility?  Scale?
  • 28.  Embedded markup (RDFa, Microdata, Microformats) for interpretation of Web documents (search, retrieval)  Arbitrary vocabularies; schema.org used at scale: (700 classes, 1000 predicates)  Adoption on the Web: 26 % (2014 Google study of 12 bn Web pages)  “Web Data Commons” (Meusel & Paulheim [ISWC2014]) • Markup from Common Crawl (3.2 billion pages): 44 billion RDF quads (2016) • Markup in 38% of pages in 2016  Same order of magnitude as “the Web” (!) Embedded Web page markup & schema.org <div itemscope itemtype ="http://schema.org/Movie"> <h1 itemprop="name">Forrest Gump</h1> <span>Actor: <span itemprop=„actor">Tom Hanks</span> <span itemprop="genre">Drama</span> ... </div> 05/04/17 30 RDF statements node1 actor _node-x node1 actor Robin Wright node1 genre Comedy node2 actor T. Hanks node2 distributed by Paramount Pic. node3 actor Tom Cruise node3 distributed by Paramount Pic. Stefan Dietze http://webdatacommons.org
  • 29.  schema:Product instances in WDC2015  Facts: 1.414.937.431 (= 302.246.120 instances, i.e. products)  Providers (distinct Pay Level Domains, PLDs): 93.705  Power law distribution of terms across PLDs  Top 10 PLDs  Top provider ? (company) 05/04/17 31Stefan Dietze Example: embedded Web markup about „products“ PLD # Resources www.crateandbarrel.com 33.517.936,00 www.bentgate.com 17.215.499,00 www.aliexpress.com 9.621.943,00 www.ebay.com.au 8.861.308,00 us.fotolia.com 7.939.982,00 www.ebay.co.uk 6.556.820,00 www.competitivecyclist.com 6.214.500,00 www.maxstudio.com 6.075.626,00 approx. 35 million resources
  • 30. 1 10 100 1000 10000 100000 1000000 10000000 1 51 101 151 201 count(log) PLD (ranked) # entities # statements Study on sample Web crawl (WDC2015)  Metadata about scholarly articles, e.g. s:ScholarlyArticle): 6.793.764 quads, 1.184.623 entities, 429 distinct predicates (in WDC and for 1 type alone)  Top 5 domains: Springer, MDPI, BMJ, mendeley.com, Biodiversitylibrary.org Domains, topics, disciplines?  Life Sciences and Computer Science predominant  Top-10 article titles  Noise Example: markup of bibliographic resources 05/04/17 32Stefan Dietze Sahoo, P., Gadiraju, U., Yu, R., Saha, S., Dietze, S., Analysing Structured Scholarly Data embedded in Web Pages, SAVE-SD2016, co-located with the WWW2016
  • 31. Example: markup of learning resources on the Web  “Learning Resources Metadata Intiative (LRMI)”: schema.org vocabulary for annotation of learning resources  Developed through DCMI Task Force on LRMI  Approx. 5000 PLDs (incl. subdomains) in CC  LRMI adoption (WDC) [WWW17]:  2015: 44,108,511 quads  2014: 30,599,024 quads  2013: 10.636873 quads 05/04/17 33 Dietze, S., Taibi, D., Yu, R., Barker, P., d’Aquin, M., Analysing and Improving embedded Markup of Learning Resources on the Web, 26th International World Wide Web Conference (WWW2017), Digital Learning track, Perth, April 2017. Stefan Dietze
  • 32. Example: markup of learning resources on the Web  “Learning Resources Metadata Intiative (LRMI)”: schema.org vocabulary for annotation of learning resources  Developed through DCMI Task Force on LRMI  Approx. 5000 PLDs (incl. subdomains) in CC  LRMI adoption (WDC) [WWW17]:  2015: 44,108,511 quads  2014: 30,599,024 quads  2013: 10.636873 quads  Frequent errors and unintended use (e.g. porn) 05/04/17 34 Dietze, S., Taibi, D., Yu, R., Barker, P., d’Aquin, M., Analysing and Improving embedded Markup of Learning Resources on the Web, 26th International World Wide Web Conference (WWW2017), Digital Learning track, Perth, April 2017. Stefan Dietze 7xxxtube.com 1amateurporntube.com virtualpornstars.com sunriseseniorliving.com simplyfinance.co.uk menslifestyles.com audiobooks.com simplypsychology.org helles-koepfchen.de
  • 33. 05/04/17 35Stefan Dietze Entity retrieval on Web markup: state of the art  Glimmer (http://glimmer.research.yahoo.com)  Entity retrieval on WDC dataset [Blanco, Mika & Vigna, ISWC2011]  BM25F retrieval model on WDC index
  • 34. Web markup: challenges 05/04/17 36 Characteristics Example Coreferences 18.000 results for <„Iphone 6“, type, s:Product> (8,6 quads on average) in CommonCrawl Redundancy <s, schema:name, „Iphone 6“> occurring 1000 times in CC Lack of links Largely unlinked entity descriptions Errors (typos & schema violations, see Meusel et al [ESWC2015]) Wrong namespaces, such as http://schma.org Undefined types & predicates: 9,7 %, less common than in LOD Confusion of datatype and object properties: <s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8% in LOD Data property range violations: e.g. literals vs numbers (12,6% vs 4,6 in LOD)  Using markup as knowledge graph, similar to Linked Data? Stefan Dietze A Survey on Challenges for Entity Retrieval in Markup Data, Yu, R., Gadiraju, U., Fetahu, B., Dietze, S., 15th International Semantic Web Conference (ISWC2016), Kobe, Japan (2016). “Strings, not things”  Bias towards datatype properties / using any property as such (!)  Numbers from LRMI2015 markup corpus: o 46 million “transversal” quads (i.e. excluding hierarchical statements such as rdfs:typeOf) o 64 % are actual datatype properties yet 97% refer to literals (up from 70% in 2013)  Challenges o Markup data = flat entity descriptions (=> fairly unconnected graph) o Data reuse requires identity resolution
  • 35.  Obtaining consolidated & verified entity description/facts (or graph) for a given resource/entity from Web markup?  Aiding tasks: such as document annotation, augmentation or enrichment of existing data- or knowledge bases/graphs Entity retrieval & reconciliation on markup 05/04/17 37 Query iPhone 6, type:(Product) Entity Description brand Apple Inc. weight 129 date 30.09.2015 manufacturer Foxconn Storage 16 GB <e1, s:name, „Iphone 6“> <e2, s:brand, „Apple Inc.“> <e3, s:brand, „Apple“> <e4, s:weight, 127> <e5, s:releaseDate, „1.12.1972“> Web (crawl) (e.g. Common Crawl/WDC, focused crawl) Stefan Dietze Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM: Query-Centric Data Fusion on Structured Web Markup, ICDE2017. Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze, D., Dietze, S., KnowMore - Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2017, under review.
  • 36. FuseM: query-centric data fusion on Web markup 05/04/17 38  Entity matching: BM25 entity retrieval model on markup index (Common Crawl) & similarity-based matching  Data fusion: ML classifier (SVM, knn, RandomForest), 3 feature categories (relevance, authority, clustering) 1. Matching 2. Fact selection New Queries Foxconn, type:(Organization) Cupertino, type:(City) Apple Inc., type:(Organization) (supervised SVM classifier) Entity Description brand Apple Inc. weight 129 date 30.09.2015 manufacturer Foxconn Storage 16 GB Query iPhone 6, type:(Product) Candidate Facts node1 brand _node-x node1 brand Apple Inc. node1 weight 129 node2 weight 172 node2 manufacturer Foxconn node3 releasedate 01.12.1972 node3 manufacturer Foxconn Web page markup Web (crawl) approx. 125.000 facts for „iPhone6“ Stefan Dietze Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM: Query-Centric Data Fusion on Structured Web Markup, ICDE2017.
  • 38. Evaluation & results: data fusion performance 05/04/17 40Stefan Dietze Setup  Dataset: Products, Movies, Books (approx. 3 billion. facts) from Common Crawl / WDC  Baselines:  BM25: top-k diverse facts via BM25 (Glimmer)  CBFS: clustering-based approach [ESWC2015]  PreRecCorr: “Fusing data with correlations” [Pochampally et. al., ACM SIGMOD 2014]  10-fold cross validation Results  FuseM beats baselines in both tasks (strong variance of baselines across tasks)  All feature categories contribute Query-centric data fusion (precision) Query-independent data fusion (P/R/F1)
  • 39. 05/04/17 42Stefan Dietze Results: example of fused entity description  Data fusion result for book „Brideshead Revisited“ (20 distinct facts) New facts (compared to DBpedia): • 60% - 70% of all facts for books & movies new (across all KBs) • 100% new for products („long tail entities“ not existing in KBs yet) New facts and attributes
  • 40. 05/04/17 43Stefan Dietze Results: KB augmentation  Augmentation of 15 properties of books (& movies) in three KBs  DB: DBpedia  FB: Freebase  WD: Wikidata  Augmentation performance: % of filled slots (or „knowledge gaps“) in KB  Performance varies heavily (yet some attributes completed to 100%) KBA result for entities of type „Book“ Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze, D., Dietze, S., KnowMore - Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2017, under review.
  • 41. Linked Data & knowledge graphs Conclusions & outlook 05/04/17 45Stefan Dietze  Retrieval/search of Linked Data hindered by heterogeneity, quality, dynamics etc  Dealing with diversity & heterogeneity o Profiling & recommendation: dataset search & recommendation o Entity retrieval & clustering: entity search
  • 42. Entity node1 name Molecular structure of nucleic acids node1 author James D. Watson node1 publisher Nature node1 datePublished 1956 node1 datePublished 1953 Entity node2 name Francis Crick node2 name Cricks node2 born 1916 Embedded data/markup/tables Unstructured (Web) data/docs Linked Data & knowledge graphs Conclusions & outlook 05/04/17 46Stefan Dietze  Retrieval/search of Linked Data hindered by heterogeneity, quality, dynamics etc  Dealing with diversity & heterogeneity o Profiling & recommendation: dataset search & recommendation o Entity retrieval & clustering: entity search  New forms of (structured) Web data: Web markup (schema.org et al.) & tables o Convergence of structured and unstructured Web (e.g. Voldemort KG, Tonon et al., ISWC2016) o Scale and dynamics (!) o Potential to augment existing knowledge graphs (e.g. Google KG or Microsoft Satori) o Potential training data for NED, entity interlinking and other entity-centric tasks (e.g. OKE Challenge)
  • 43. Entity node1 name Molecular structure of nucleic acids node1 author James D. Watson node1 publisher Nature node1 datePublished 1956 node1 datePublished 1953 Entity node2 name Francis Crick node2 name Cricks node2 born 1916 Contact & resources 05/04/17 47Stefan Dietze @stefandietze http://stefandietze.net More on Web markup: talk on Wednesday, 11:00, WW2017/Digital Learning track Embedded data/markup/tables Unstructured (Web) data/docs Linked Data & knowledge graphs

Editor's Notes

  1. Definition 2.1. ith-Order Value Inconsistency (Dx, Dy, P) between the pair of datasets Dx, Dy with respect to the ith-Order single-value property P is the proportion of the equivalent entities in Dx and Dy having contradicting values in P Definition 2.2. ith-Order Value Incompleteness (Dx, Dy, P) between the pair of datasets Dx, Dy with respect to a ith-Order multi-value property P is the proportion of entities in Dx and Dy having dierent values in P. ISSUES: different can mean incorrectness as well as incompleteness
  2. Filtering: identifying cluster of datasets which are similar to Ds (two metrics: LSA-based, Wordnet-based), threshold theta Ranking: cosine between profiles Experimentally better results than using the ranks from filtering step
  3. Evalualtion: map for different similarity thresholds (theta) from filtering step when explaining: - why is MAP decreasing with higher similarity thresholds? &amp;quot;For the given intervals [0, 0.7], [0, 0.8] and [0, 0.9], with respect to the used measures, we have 100% of recall --&amp;gt; all datasets considered as true are present in the recommanda list.
  4. Random Sampling: randomly selects resource instances from Ri 2 Di for further analysis in the proling pipeline. Weighted Sampling: weigh each resource as the ratio of the number of datatype properties used to dene a resource over the maximum number of datatype properties over all resources for a specic dataset. The weight for rk Fig. 1. Processing pipeline for generating structured proles of Linked Data graphs. is computed by wk = jf(rk)j=maxfjf(rj)jg (rj 2 Rijj = 1; ; n), where f(rk) represents the datatype properties of resource rk. An instance is included in a sample if, for a randomly generated number p from a uniform distribution, the weight wk such that wk &amp;gt; (1 􀀀 p). Such a strategy ensures that resources that carry more information (having more literal values) have higher chances of being included earlier at low cut-os of analysed samples. Resource Centrality Sampling: weighs each resource as the ratio of the number of resource types used to describe a particular resource (V 0 k Vk) divided by the total number of resource types in a dataset. The weight is dened by ck = jC0k j=jCj with C0k = C \ V 0 k. Similarly to `weighted sampling&amp;apos;, for a randomly generated number p, rk is included in the sample if ck &amp;gt; (1 􀀀 p). The main motivation behind computing the centrality of a resource is that important concepts in a dataset tend to be more structured and linked to other concepts.
  5. The underlying assumption is that very specific and targeted seed lists will require different crawling and relevance computation methods than very broad and unspecific seed lists. http://www.visualdataweb.org/relfinder/demo.swf?obj1=TWluaW9ucyAoZmlsbSl8aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL01pbmlvbnNfKGZpbG0p&amp;obj2=U2FuZHJhIEJ1bGxvY2t8aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL1NhbmRyYV9CdWxsb2Nr&amp;obj3=Sm9uIEhhbW18aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL0pvbl9IYW1t&amp;obj4=TWljaGFlbCBLZWF0b258aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL01pY2hhZWxfS2VhdG9u&amp;obj5=QWxsaXNvbiBKYW5uZXl8aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL0FsbGlzb25fSmFubmV5&amp;obj6=RGVzcGljYWJsZSBNZSAyfGh0dHA6Ly9kYnBlZGlhLm9yZy9yZXNvdXJjZS9EZXNwaWNhYmxlX01lXzI=&amp;obj7=U3RldmUgQ29vZ2FufGh0dHA6Ly9kYnBlZGlhLm9yZy9yZXNvdXJjZS9TdGV2ZV9Db29nYW4=&amp;obj8=R2VvZmZyZXkgUnVzaHxodHRwOi8vZGJwZWRpYS5vcmcvcmVzb3VyY2UvR2VvZmZyZXlfUnVzaA==&amp;name=REJwZWRpYSAobWlycm9yKQ==&amp;abbreviation=ZGJw&amp;description=TGlua2VkIERhdGEgdmVyc2lvbiBvZiBXaWtpcGVkaWEu&amp;endpointURI=aHR0cDovL2RicGVkaWEuaW50ZXJhY3RpdmVzeXN0ZW1zLmluZm8=&amp;dontAppendSPARQL=ZmFsc2U=&amp;defaultGraphURI=aHR0cDovL2RicGVkaWEub3Jn&amp;isVirtuoso=dHJ1ZQ==&amp;useProxy=ZmFsc2U=&amp;method=UE9TVA==&amp;autocompleteLanguage=ZW4=&amp;autocompleteURIs=aHR0cDovL3d3dy53My5vcmcvMjAwMC8wMS9yZGYtc2NoZW1hI2xhYmVs&amp;ignoredProperties=aHR0cDovL3d3dy53My5vcmcvMTk5OS8wMi8yMi1yZGYtc3ludGF4LW5zI3R5cGUsaHR0cDovL3d3dy53My5vcmcvMjAwNC8wMi9za29zL2NvcmUjc3ViamVjdCxodHRwOi8vZGJwZWRpYS5vcmcvcHJvcGVydHkvd2lraVBhZ2VVc2VzVGVtcGxhdGUsaHR0cDovL2RicGVkaWEub3JnL3Byb3BlcnR5L3dvcmRuZXRfdHlwZSxodHRwOi8vZGJwZWRpYS5vcmcvcHJvcGVydHkvd2lraWxpbmssaHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L3dpa2lQYWdlV2lraUxpbmssaHR0cDovL3d3dy53My5vcmcvMjAwMi8wNy9vd2wjc2FtZUFzLGh0dHA6Ly9wdXJsLm9yZy9kYy90ZXJtcy9zdWJqZWN0&amp;abstractURIs=aHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L2Fic3RyYWN0&amp;imageURIs=aHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L3RodW1ibmFpbCxodHRwOi8veG1sbnMuY29tL2ZvYWYvMC4xL2RlcGljdGlvbg==&amp;linkURIs=aHR0cDovL3B1cmwub3JnL29udG9sb2d5L21vL3dpa2lwZWRpYSxodHRwOi8veG1sbnMuY29tL2ZvYWYvMC4xL2hvbWVwYWdlLGh0dHA6Ly94bWxucy5jb20vZm9hZi8wLjEvcGFnZQ==&amp;maxRelationLegth=Mg==
  6. As shown in our experimental evaluation, specific entities within a seed list strongly reflect the crawl intent. {Pulp Fiction, Film, Entertainment}, the most specific entity \texttt{`Pulp Fiction&amp;apos;}, reflects the most specific crawl intent, whereas the entities \texttt{`Film&amp;apos;} and \texttt{`Entertainment&amp;apos;} provide contextual information, namely that \texttt{`Pulp Fiction&amp;apos;} is a movie. Motivated by this, we assume that the relevance of specific candidate entities is dependent on the seed entity they are related to. For example, candidate entities similar to entity \texttt{`Pulp Fiction&amp;apos;} will be ranked higher than entities that are similar to other seed entities.
  7. The average improvement across different NDCG levels is 1.6% on depth 2 and 4.3% on depth 3, suggesting a positive effect of the attrition factor for the cases of our seed lists. On the other hand, the coherence of the seed list appears to have no significant impact on the suitability of particular configuration. Given the significantly increased runtime when crawling beyond hop 2, a crawl depth of 2 seems to provide optimal efficiency, and it is not advisable to crawl to a higher distance.
  8. The average improvement across different NDCG levels is 1.6% on depth 2 and 4.3% on depth 3, suggesting a positive effect of the attrition factor for the cases of our seed lists. On the other hand, the coherence of the seed list appears to have no significant impact on the suitability of particular configuration. Given the significantly increased runtime when crawling beyond hop 2, a crawl depth of 2 seems to provide optimal efficiency, and it is not advisable to crawl to a higher distance.
  9. This is due to the fact that high coherence seed lists have a more specific crawl intent, leading to narrow and often small result sets, and hence also a limited ground truth, while the low coherence lists have a much broader crawl intent as well as relevant entity set. This is reflected in our ground truth: the average number of entities labeled as related (score≥ 3 and beyond) is 208 for low coherence seed list, and 145 for high coherence seed lists. Meanwhile, the narrow search intent also causes more disagreement among crowdsourcing workers for generating the ground truth, which makes the results for high coherence seed lists less consensual. Another difficulty faced when evaluating the crawling task is the highly heteroge- neous and varied nature of the possible result sets, originating from a highly heteroge- neous Linked Data graph.