SlideShare a Scribd company logo
1 of 35
Download to read offline
Retrieval, Crawling and Fusion of
Entity-centric Data on the Web
Stefan Dietze
L3S Research Center, Hannover, Germany
- Keynote at 2nd International Keystone Conference, IKC2016 -
09/09/16 1Stefan Dietze
Research areas
 Web science, Information Retrieval, Semantic Web, Social Web
Analytics, Knowledge Discovery, Human Computation
 Interdisciplinary application areas: digital humanities,
TEL/education, Web archiving, mobility
Some projects
L3S Research Center
09/09/16 2
 See also: http://www.l3s.de
Stefan Dietze
Acknowledgements: team
09/09/16 3Stefan Dietze
 Pavlos Fafalios (L3S)
 Besnik Fetahu (L3S)
 Ujwal Gadiraju (L3S)
 Eelco Herder (L3S)
 Ivana Marenzi (L3S)
 Ran Yu (L3S)
 Pracheta Sahoo (L3S, IIT India)
 Bernardo Pereira Nunes (L3S, PUC Rio de Janeiro)
 Mathieu d‘Aquin (The Open University, UK)
 Mohamed Ben Ellefi (LIRMM, France)
 Davide Taibi (CNR, Italy)
 Konstantin Todorov (LIRMM, France)
 ...
Structured (linked) data on the Web: state of affairs
SPARQL endpoint availability over time [Buil-Aranda et al 2013]
Accessibility of datasets?
 Less than 50% of all SPARQL endpoints actually responsive at given point of time [Buil-Aranda2013]
 “THE” SPARQL protocol? No, but many variants & subsets
Semantics, links, quality?
 …data accuracy (eg DBpedia)? [Paulheim2013]
 …vocabulary reuse? [D’AquinWebSci13]
 …schema compliance (RDFS, schemas) [HoganJWS2012]
Stefan Dietze
Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A.,
Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013.
Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC
2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525
An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth,
A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012
09/09/16 4
SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil-
Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International
Semantic Web Conference 2013, (ISWC2013).
Data quality and consistency
Analyzing Relative Incompleteness of Movie Descriptions
in the Web of Data: A Case Study, Yuan, W., Demidova, E.,
Dietze, S., Zhu, X., International Semantic Web Conference
2014 (ISWC2014)
09/09/16Stefan Dietze 5
Challenge for search/retrieval – heterogeneity of datasets & entities
Stefan Dietze 09/09/16
??? ?? ?
Discovery of suitable (1) datasets & (2) entities matching:
 Quality? Currentness, dynamics, accessability/reliability,
data quantity & quality?
 Topics/scope? Datasets/entities useful & trustworthy for
topic XY?
 Types? Datasets/entities about statistics, organisations,
videos, slides, publications etc?
6
Overview
09/09/16Stefan Dietze 7
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
 Dataset recommendation
 Dataset profiling
 Entity retrieval
III – Beyond Linked Data – exploiting embedded Web semantics
 Web markup as emerging data source
 Case studies
 Data fusion for entity reconciliation (and retrieval)
III Wrap-up
Dealing with diversity
and heterogeneity
Overview
09/09/16Stefan Dietze 8
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
 Dataset recommendation
 Dataset profiling
 Entity retrieval
III – Beyond Linked Data – exploiting embedded Web semantics
 Web markup as emerging data source
 Case studies
 Data fusion for entity reconciliation (and retrieval)
III Wrap-up
Dealing with diversity
and heterogeneity
Other emerging forms of
structured data on the Web?
09/09/16
Dataset recommendation I
9
S
Linkset1
Linkset2
Approach
 Given dataset s, ranking datasets from D
according to probability score (di, t) to
contain linking candidates (entities)
 Features:
 Approach 1: vocabulary overlap
 Approach 2: existing links (SNA)
 Linking candidates likely if datasets share
common (a) schema elements, or (b) links
(friend of a friend)
Conclusions
 Roughly 50% MAP for both approaches
 Simplistic approach (!)
Lopes, G.R., Paes Leme, L. A., Nunes, B.P., Casanova, M.A.,
Dietze, S., Two approaches to the dataset interlinking
recommendation problem, 15th International Conference on
Web Information System Engineering (WISE 2014),
Thessaloniki, Greece.
Rank
1 DBLP
2 ACM
3 OAI
4 CiteSeer
5 IBM
6 Roma
7 IEEE
8 Ulm
9 Pisa
?
?
Stefan Dietze 9
Goal: finding candidate datasets, e.g. for entity retrieval
or interlinking tasks (eg enrichment)
09/09/16
Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.,
Intension-based Dataset Recommendation for Data
Linking, 13th Extended Semantic Web Conference
(ESWC2016), Heraklion, Crete, May, 2016, ESWC2016
Stefan Dietze 10
Dataset recommendation II
L. Han, A. L. Kashyap, T. Finin, J. Mayeld, and J. Weese, "Umbc ebiquity-core: Semantic textual similarity systems", in Proc. of the *SEM, Association for Computational Linguistics, 2013.
Preprocessing Datasets rankingDatasets filtering
09/09/16
Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.,
Intension-based Dataset Recommendation for Data
Linking, 13th Extended Semantic Web Conference
(ESWC2016), Heraklion, Crete, May, 2016.ESWC2016
Stefan Dietze 11
Dataset recommendation II: results
Data & ground truth
 Experiments on (responsive) datasets
from LOD Cloud (http://datahub.io)
 Concept profiles from
http://lov.okfn.org
 Ground truth: existing links from VOID
profiles of datasets
(issue: not always representative for
actual linksets)
Results
 MAP for different similarity thresholds
from step 2 max. 54%
 Recall 100% below indicated similarity
(clustering) thresholds
Dataset search through dataset cataloging & profiling
Dataset
Catalog/Registry
http://data.linkededucation.org/linkedup/catalog/
 LinkedUp project (FP7 project: L3S, OU, OKFN, Elsevier, Exact Learning solutions)
 LinkedUp Catalog: largest collection of LD of educationally relevant resources (approx. 50 Datasets)
 Original datasets published with key content providers, automatically extracted metadata
09/09/16 12Stefan Dietze
09/09/16 13Stefan Dietze
http://data.linkededucation.org/linkedup/catalog/
LinkedUp Catalog: dataset index & registry, federated search
 “Federated queries” through schema mappings [WebSci13]
 Dataset accessibility
 Linking & topic profiling
Schema/Types
09/09/16 14Stefan Dietze
http://data.linkededucation.org/linkedup/catalog/
LinkedUp Catalog: dataset index & registry, federated search
 “Federated queries” through schema mappings [WebSci13]
 Dataset accessibility
 Linking & topic profiling [ESWC14]
Dataset topic
profiles
db:Biology
db:Cell biology
Dataset
Catalog/Registry
yov:Video
<yo:Video …>
<dc:title>Lecture 29 –
Stem Cells</dc:title>
…
</yo:Video…>
Yovisto Video
 Extraction of representative (DBpedia) categories („topic profile“) for arbitrary datasets ?
 Technically trivial through established NER/NED approaches, but scalability issues
(recall: LOD Cloud 1000+ datasets with <100 billion RDF statements)
 Efficient approach: sampling & ranking for balance between scalability and precision /recall
Scalable profiling of datasets
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
db:Cell
(Biology)
09/09/16 16
db:Cell
(Biology)
Stefan Dietze
Efficient dataset profiling
1. Sampling of resources
(random sampling, weighted sampling, resource
centrality sampling)
2. Entity- & topic-extraction (NER via DBpedia Spotlight,
category mapping & -expansion)
3. Normalisation & ranking (graph-based models such as
PageRank with Priors, HITS with Priors & K-Step Markov)
 Result: weighted dataset-topic profile graph
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
09/09/16 17Stefan Dietze
Search & exploration of datasets through topic profiles
 Applied to entire LOD cloud/graph
 Visual exploration of extracted RDF dataset profiles
(datasets, topics, relationships)
 Evaluation results: K-Step Markov (10% sampling size)
outperforms baselines (LDA, tf/idf on entire datasets)
http://data-observatory.org/lod-profiles/
09/09/16 18Stefan Dietze
Search: entity retrieval on large structured datasets?
 How to efficiently retrieve (related) entities/resources for given entity-seeking (keyword) query?
 State of the art: BM25F on inverted entity index (Blanco et al, ISWC2011)
 Challenges/observations:
 Explicit entity links (owl:sameAs etc) are sparse yet important to facilitate state of the art methods
 Query type affinity?
09/09/16 19Stefan Dietze
??
Large dataset/crawl
e.g. LinkedUp dataset graph, BTC2014, Dynamic LD Observatory
entities related to <Tim Berners Lee>
?
BTC2014
DyLDO
Entity retrieval: approach
(I) Offline processing (clustering to address link sparsity)
1. Feature vectors (lexical and structural features)
2. Bucketing: per type (LSH algorithm)
3. Clustering: X-means & Spectral clustering per bucket
Improving Entity Retrieval on Structured Data,
Fetahu, B., Gadiraju, U., Dietze, S., 14th International
Semantic Web Conference (ISWC2015), Bethlehem,
US, (2015).
(II) Online processing (retrieval)
1. Retrieval & expansion:
a) BM25F results
b) expansion from clusters (related entities)
2. Re-Ranking
(context terms & query type affinity)
09/09/16 20Stefan Dietze
Dataset
 BTC2014 (4 billion entities)
 92 SemSearch queries
Methods
 Our approaches: XM: Xmeans, SP: Spectral
 Baselines B: BM25F, S1: Tonon et al [SIGIR12]
Conclusions
 XM & SP outperform baselines
 Clustering to remedy link sparsity
 Relevance to query more important than
relevance to BM25F results
Improving Entity Retrieval on Structured Data,
Fetahu, B., Gadiraju, U., Dietze, S., 14th International
Semantic Web Conference (ISWC2015), Bethlehem,
US, (2015).
Entity retrieval: evaluation
09/09/16 21Stefan Dietze
Overview
09/09/16Stefan Dietze 22
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
 Dataset recommendation
 Dataset profiling
 Entity retrieval
III – Beyond Linked Data – exploiting embedded Web semantics
 Web markup as emerging data source
 Case studies
 Data fusion for entity reconciliation (and retrieval)
III Wrap-up
Dealing with diversity
and dynamics
Other emerging forms of
structured data on the Web?
 Linked Data: approx. 1000 datasets & 100 billion statements
- different order of magnitude wrt scale & dynamics
vs
 The Web: approx. 46.000.000.000.000 (46 trillion) Web pages indexed
by Google
 Other „semantics“ (structured facts) on the Web?
Semantics (structured data) on the Web?
09/09/16 23Stefan Dietze
 Embedded markup (RDFa, Microdata, Microformats) for
interpretation of Web documents (search, retrieval)
 Arbitrary vocabularies; schema.org used at scale:
(700 classes, 1000 predicates)
 Adoption on the Web: 26 %
(2014 Google study of 12 bn Web pages)
 “Web Data Commons” (Meusel & Paulheim [ISWC2014])
• Markup from Common Crawl (2.2 billion pages):
17 billion RDF quads
• Markup in 26% of pages, 14% of PLDs in 2013
(increase from 6% in 2011)
 Same order of magnitude as “the Web”
Embedded semantics: Web page markup & schema.org
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Forrest Gump</h1>
<span>Actor: <span itemprop=„actor">Tom Hanks</span>
<span itemprop="genre">Drama</span>
...
</div>
09/09/16 24
RDF statements
node1 actor _node-x
node1 actor Robin Wright
node1 genre Comedy
node2 actor T. Hanks
node2 distributed by Paramount Pic.
node3 actor Tom Cruise
node3 distributed by Paramount Pic.
Stefan Dietze
 schema:Product instances in Web Data Commons
 Facts: 1.414.937.431
(= 302.246.120 instances, i.e. products)
 Providers (distinct Pay Level Domains, PLDs): 93.705
 Power Law distribution of terms across PLDs
 Top 10 PLDs
 Top provider ? (company)
09/09/16 25Stefan Dietze
Example: embedded Web markup data about „products“
PLD # Resources
www.crateandbarrel.com 33.517.936,00
www.bentgate.com 17.215.499,00
www.aliexpress.com 9.621.943,00
www.ebay.com.au 8.861.308,00
us.fotolia.com 7.939.982,00
www.ebay.co.uk 6.556.820,00
www.competitivecyclist.com 6.214.500,00
www.maxstudio.com 6.075.626,00
approx. 35 million resources
1
10
100
1000
10000
100000
1000000
10000000
1 51 101 151 201
count(log)
PLD (ranked)
# entities # statements
Study on sample Web crawl (WDC)
 Metadata about scholarly articles, e.g.
s:ScholarlyArticle): 6.793.764 quads, 1.184.623
entities, 429 distinct predicates
(in WDC and for 1 type alone)
 Top 5 domains: Springer, MDPI, BMJ,
diabetesjournals.org, mendeley.com,
Biodiversitylibrary.org
Domains, topics, disciplines?
 Life Sciences and Computer Science predominant
 Top-10 article titles
 Most important publishers/journals, libraries
represented
Example: Web markup of bibliographic resources
09/09/16 26Stefan Dietze
Sahoo, P., Gadiraju, U., Yu, R., Saha, S., Dietze, S., Analysing
Structured Scholarly Data embedded in Web Pages, Semantics,
Analytics, Visualisation: Enhancing Scholarly Data (SAVE-SD2016), co-
located with the 25th International World Wide Web Conference,
Montreal, Canada, April 11, 2016
Example: entity markup of learning resources on the Web
 “Learning Resources Metadata Intiative (LRMI)”:
schema.org vocabulary for annotation of learning
resources (informal, formal, etc)
 Approx. 5000 PLDs in “Common Crawl”
 LRMI-Adaptation on the Web (WDC) [LILE16]:
 2014: 30.599.024 quads, 4.182.541 resources
 2013: 10.636873 quads, 1.461.093 resources
09/09/16 27
Power law distribution across providers
4805 Provider / PLDs
Taibi, D., Dietze, S., Towards embedded markup of learning resources
on the Web: a quantitative Analysis of LRMI Terms Usage, in
Companion Publication of the IW3C2 WWW 2016 Conference, IW3C2
2016, Montreal, Canada, April 11, 2016
Stefan Dietze
09/09/16 28Stefan Dietze
Entity retrieval on Web markup: state of the art
 Glimmer
(http://glimmer.research.yahoo.com)
 Entity retrieval on WDC dataset
[Blanco, Mika & Vigna, ISWC2011]
 BM25F retrieval model on WDC index
Entity retrieval on Web markup: challenges
09/09/16 29
Characteristics Example
Coreferences
18.000 results for <„Iphone 6“, type, s:Product>
(8,6 quads on average) in CommonCrawl
Redundancy <s, schema:name, „Iphone 6“> occurring 1000 times in CC
Lack of links Largely unlinked entity descriptions
Errors
(typos & schema
violations, see Meusel
et al [ESWC2015])
Wrong namespaces, such as http://schma.org
Undefined types & predicates:
9,7 %, less common than in LOD
Confusion of datatype and object properties:
<s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8%
in LOD
Data property range violations: e.g. literals vs numbers
(12,6% vs 4,6 in LOD)
 Using markup as (highly distributed) knowledge graph?
Stefan Dietze
A Survey on Challenges for Entity Retrieval
in Markup Data, Yu, R., Gadiraju, U., Fetahu,
B., Dietze, S., 15th International Semantic Web
Conference (ISWC2016), Kobe, Japan (2016).
 Obtaining consolidated entity description/facts (or graph) for a
given resource/entity from Web markup?
 Aiding tasks: such as document annotation, augmentation
or semantic enrichment of existing data- or knowledge bases
Entity retrieval & reconciliation on markup
09/09/16 30
Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., S. Dietze, Entity
summarisation on structured web markup. In The
Semantic Web: ESWC 2016 Satellite Events. Springer, 2016.
Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., S. Dietze, Fact
Selection for data fusion on structured web markup.
ICDE2017, IEEE International Conference on Data
Engineering, in progress.
Query
iPhone 6, type:(Product)
Entity Description
brand Apple Inc.
weight 129
date 30.09.2015
manufacturer Foxconn
Storage 16 GB
<e1, s:name, „Iphone 6“>
<e2, s:brand, „Apple Inc.“>
<e3, s:brand, „Apple“> <e4, s:weight, 127>
<e5, s:releaseDate, „1.12.1972“>
Web (crawl)
(e.g. Common Crawl/WDC, focused crawl)
Stefan Dietze
A supervised approach for data fusion on markup
09/09/16 31
 Fact/entity retrieval: BM25 entity retrieval model on markup index (Common Crawl)
 Fact selection/data fusion: ML classifier (SVM), using 3 feature categories (relevance, authority, clustering)
 Experiments on Common Crawl: products, movies, books (approx. 3 billion facts)
1. Retrieval
2. Fact selection
New Queries
Foxconn, type:(Organization)
Cupertino, type:(City)
Apple Inc., type:(Organization)
(supervised SVM classifier)
Entity Description
brand Apple Inc.
weight 129
date 30.09.2015
manufacturer Foxconn
Storage 16 GB
Query
iPhone 6, type:(Product)
Candidate Facts
node1 brand _node-x
node1 brand Apple Inc.
node1 weight 129
node2 weight 172
node2 manufacturer Foxconn
node3 releasedate 01.12.1972
node3 manufacturer Foxconn
Web page
markup
Web (crawl)
approx. 125.000 facts for „iPhone6“
Stefan Dietze
Evaluation & results (1/2)
09/09/16 32Stefan Dietze
Evaluation setup
 Comparison with baselines:
 BM25: Top-k distinct facts via BM25
 CBFS: clustering/heuristics-based
approach
 Expert-labeled ground truth
Results
 Supervised learning approach (SumSVM,
SumDIV) outperforms baselines
 Strong variance of results across query
sets (for baselines, not our approach)
 Strongest performance considering all
feature sets
Precision results
09/09/16 33Stefan Dietze
Evaluation & results (2/2): markup for KB augmentation?
 Comparison of obtained facts with existing
knowledge bases (DBpedia)
o „existing“: fact already in DBpedia
o „new“: fact not existing in DBpedia
(eg a book‘s releaseDate in Wiki/DBpedia)
o „new-p“: property not existing in DBpedia
(eg a book‘s release countries)
 60-70% new facts for books & movies
 100% new facts for queried products
(not existing in DBpedia apparently)
 Vast potential for KB augmentation (!)
Linked Data & knowledge graphs
Conclusions & outlook
09/09/16 34Stefan Dietze
 Retrieval/search of Linked Data hindered by
heterogeneity, quality, dynamics etc
 Dealing with diversity & heterogeneity
o Profiling & recommendation: dataset search &
recommendation
o Entity retrieval & clustering: entity search
Entity
node1 name
Molecular structure of
nucleic acids
node1 author James D. Watson
node1 publisher Nature
node1 datePublished 1956
node1 datePublished 1953
Entity
node2 name Francis Crick
node2 name Cricks
node2 born 1916
Embedded data/markup
Unstructured (Web) data/docs
Linked Data & knowledge graphs
Conclusions & outlook
09/09/16 35Stefan Dietze
 Retrieval/search of Linked Data hindered by
heterogeneity, quality, dynamics etc
 Dealing with diversity & heterogeneity
o Profiling & recommendation: dataset search &
recommendation
o Entity retrieval & clustering: entity search
 New forms of (structured) Web data:
Web markup (schema.org et al)
o Convergence of structured and unstructured
Web
o Scale and dynamics (!)
o Potential to augment existing knowledge
graphs
o Potential training data for NED, entity
interlinking and similar entity-centric problems
Entity
node1 name
Molecular structure of
nucleic acids
node1 author James D. Watson
node1 publisher Nature
node1 datePublished 1956
node1 datePublished 1953
Entity
node2 name Francis Crick
node2 name Cricks
node2 born 1916
Embedded data/markup
Unstructured (Web) data/docs
Linked Data & knowledge graphs
Thank you!
09/09/16 36Stefan Dietze
?
http://stefandietze.net
@stefandietze

More Related Content

What's hot

Sharing knowledge is what we do: The Education and/or The Semantic Web
Sharing knowledge is what we do: The Education and/or The Semantic WebSharing knowledge is what we do: The Education and/or The Semantic Web
Sharing knowledge is what we do: The Education and/or The Semantic WebMathieu d'Aquin
 
Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingBesnik Fetahu
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebStefan Dietze
 
KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013Stefan Dietze
 
Semantic Web / Linked Data Technologies
Semantic Web / Linked Data TechnologiesSemantic Web / Linked Data Technologies
Semantic Web / Linked Data TechnologiesMathieu d'Aquin
 
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsStefan Dietze
 
Experience from 10 months of University Linked Data
Experience from 10 months of University Linked Data Experience from 10 months of University Linked Data
Experience from 10 months of University Linked Data Mathieu d'Aquin
 
Working with data.open.ac.uk, the Linked Data Platform of the Open University
Working with data.open.ac.uk, the Linked Data Platform of the Open UniversityWorking with data.open.ac.uk, the Linked Data Platform of the Open University
Working with data.open.ac.uk, the Linked Data Platform of the Open UniversityMathieu d'Aquin
 
Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...Mathieu d'Aquin
 
Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?Mathieu d'Aquin
 
Doing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebDoing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebMathieu d'Aquin
 
Data4Ed - How data sharing, curation and analytics support innovation in educ...
Data4Ed - How data sharing, curation and analytics support innovation in educ...Data4Ed - How data sharing, curation and analytics support innovation in educ...
Data4Ed - How data sharing, curation and analytics support innovation in educ...Mathieu d'Aquin
 
LinkedUp - Linked Data & Education
LinkedUp - Linked Data & EducationLinkedUp - Linked Data & Education
LinkedUp - Linked Data & EducationStefan Dietze
 
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedStefan Dietze
 
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisExtracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisMathieu d'Aquin
 
Interpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning AnalyticsInterpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning AnalyticsMathieu d'Aquin
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesLaura Po
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
 
Research as infrastructure, Digital Humanities Congress, Sheffield 2012
Research as infrastructure, Digital Humanities Congress, Sheffield 2012Research as infrastructure, Digital Humanities Congress, Sheffield 2012
Research as infrastructure, Digital Humanities Congress, Sheffield 2012University of South Australlia
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)Besnik Fetahu
 

What's hot (20)

Sharing knowledge is what we do: The Education and/or The Semantic Web
Sharing knowledge is what we do: The Education and/or The Semantic WebSharing knowledge is what we do: The Education and/or The Semantic Web
Sharing knowledge is what we do: The Education and/or The Semantic Web
 
Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linking
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
 
KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013
 
Semantic Web / Linked Data Technologies
Semantic Web / Linked Data TechnologiesSemantic Web / Linked Data Technologies
Semantic Web / Linked Data Technologies
 
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked Datasets
 
Experience from 10 months of University Linked Data
Experience from 10 months of University Linked Data Experience from 10 months of University Linked Data
Experience from 10 months of University Linked Data
 
Working with data.open.ac.uk, the Linked Data Platform of the Open University
Working with data.open.ac.uk, the Linked Data Platform of the Open UniversityWorking with data.open.ac.uk, the Linked Data Platform of the Open University
Working with data.open.ac.uk, the Linked Data Platform of the Open University
 
Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...
 
Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?
 
Doing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebDoing Clever Things with the Semantic Web
Doing Clever Things with the Semantic Web
 
Data4Ed - How data sharing, curation and analytics support innovation in educ...
Data4Ed - How data sharing, curation and analytics support innovation in educ...Data4Ed - How data sharing, curation and analytics support innovation in educ...
Data4Ed - How data sharing, curation and analytics support innovation in educ...
 
LinkedUp - Linked Data & Education
LinkedUp - Linked Data & EducationLinkedUp - Linked Data & Education
LinkedUp - Linked Data & Education
 
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
 
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisExtracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
 
Interpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning AnalyticsInterpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning Analytics
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sources
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
Research as infrastructure, Digital Humanities Congress, Sheffield 2012
Research as infrastructure, Digital Humanities Congress, Sheffield 2012Research as infrastructure, Digital Humanities Congress, Sheffield 2012
Research as infrastructure, Digital Humanities Congress, Sheffield 2012
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
 

Similar to Retrieval, Crawling and Fusion of Entity-centric Data on the Web

Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesStefan Dietze
 
LinkedUp - Linked Data Europe Workshop 2014
LinkedUp - Linked Data Europe Workshop 2014LinkedUp - Linked Data Europe Workshop 2014
LinkedUp - Linked Data Europe Workshop 2014Stefan Dietze
 
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataStefan Dietze
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsFrom Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsStefan Dietze
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Stefan Dietze
 
Linked Data vs Open Educational Resources
Linked Data vs Open Educational ResourcesLinked Data vs Open Educational Resources
Linked Data vs Open Educational ResourcesStefan Dietze
 
WWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationWWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationStefan Dietze
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningStefan Dietze
 
Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Open Data & Education Seminar, ITMO, St Petersburg, March 2014Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Open Data & Education Seminar, ITMO, St Petersburg, March 2014Stefan Dietze
 
Charleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data WorldCharleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data WorldProQuest
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Stefan Dietze
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphsStefan Dietze
 
Exposing Humanities Data for Reuse and Linking - RED, linked data and the sem...
Exposing Humanities Data for Reuse and Linking - RED, linked data and the sem...Exposing Humanities Data for Reuse and Linking - RED, linked data and the sem...
Exposing Humanities Data for Reuse and Linking - RED, linked data and the sem...Mathieu d'Aquin
 
CNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundationCNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundationJohn Doove
 
LUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked DataLUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked DataMathieu d'Aquin
 
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and FuturePoster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and FutureASIS&T
 
Harmony project - JISC Synthesis meeting 2001
Harmony project - JISC Synthesis meeting 2001Harmony project - JISC Synthesis meeting 2001
Harmony project - JISC Synthesis meeting 2001Dan Brickley
 
A Framework Concept for Profiling Researchers on Twitter using the Web of Data
A Framework Concept for Profiling Researchers on Twitter using the Web of DataA Framework Concept for Profiling Researchers on Twitter using the Web of Data
A Framework Concept for Profiling Researchers on Twitter using the Web of DataLaurens De Vocht
 
lodlam summit session browsable linked data
lodlam summit session browsable linked datalodlam summit session browsable linked data
lodlam summit session browsable linked dataEnno Meijers
 
Presentation of LUCERO at EURECOM
Presentation of LUCERO at EURECOMPresentation of LUCERO at EURECOM
Presentation of LUCERO at EURECOMMathieu d'Aquin
 

Similar to Retrieval, Crawling and Fusion of Entity-centric Data on the Web (20)

Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital Libraries
 
LinkedUp - Linked Data Europe Workshop 2014
LinkedUp - Linked Data Europe Workshop 2014LinkedUp - Linked Data Europe Workshop 2014
LinkedUp - Linked Data Europe Workshop 2014
 
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open Data
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsFrom Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web Datasets
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
 
Linked Data vs Open Educational Resources
Linked Data vs Open Educational ResourcesLinked Data vs Open Educational Resources
Linked Data vs Open Educational Resources
 
WWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationWWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & Education
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday Learning
 
Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Open Data & Education Seminar, ITMO, St Petersburg, March 2014Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Open Data & Education Seminar, ITMO, St Petersburg, March 2014
 
Charleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data WorldCharleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data World
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphs
 
Exposing Humanities Data for Reuse and Linking - RED, linked data and the sem...
Exposing Humanities Data for Reuse and Linking - RED, linked data and the sem...Exposing Humanities Data for Reuse and Linking - RED, linked data and the sem...
Exposing Humanities Data for Reuse and Linking - RED, linked data and the sem...
 
CNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundationCNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundation
 
LUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked DataLUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked Data
 
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and FuturePoster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
 
Harmony project - JISC Synthesis meeting 2001
Harmony project - JISC Synthesis meeting 2001Harmony project - JISC Synthesis meeting 2001
Harmony project - JISC Synthesis meeting 2001
 
A Framework Concept for Profiling Researchers on Twitter using the Web of Data
A Framework Concept for Profiling Researchers on Twitter using the Web of DataA Framework Concept for Profiling Researchers on Twitter using the Web of Data
A Framework Concept for Profiling Researchers on Twitter using the Web of Data
 
lodlam summit session browsable linked data
lodlam summit session browsable linked datalodlam summit session browsable linked data
lodlam summit session browsable linked data
 
Presentation of LUCERO at EURECOM
Presentation of LUCERO at EURECOMPresentation of LUCERO at EURECOM
Presentation of LUCERO at EURECOM
 

More from Stefan Dietze

AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...Stefan Dietze
 
An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...Stefan Dietze
 
Research Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISResearch Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISStefan Dietze
 
Research Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceResearch Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceStefan Dietze
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Stefan Dietze
 
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Stefan Dietze
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Stefan Dietze
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...Stefan Dietze
 
Using AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebUsing AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebStefan Dietze
 
Analysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesAnalysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesStefan Dietze
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebStefan Dietze
 
Dietze linked data-vr-es
Dietze linked data-vr-esDietze linked data-vr-es
Dietze linked data-vr-esStefan Dietze
 
Towards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeTowards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeStefan Dietze
 

More from Stefan Dietze (13)

AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...
 
An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...
 
Research Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISResearch Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESIS
 
Research Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceResearch Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScience
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
 
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
 
Using AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebUsing AI to understand everyday learning on the Web
Using AI to understand everyday learning on the Web
 
Analysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesAnalysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online Activities
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the Web
 
Dietze linked data-vr-es
Dietze linked data-vr-esDietze linked data-vr-es
Dietze linked data-vr-es
 
Towards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeTowards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledge
 

Recently uploaded

Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 

Recently uploaded (20)

Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 

Retrieval, Crawling and Fusion of Entity-centric Data on the Web

  • 1. Retrieval, Crawling and Fusion of Entity-centric Data on the Web Stefan Dietze L3S Research Center, Hannover, Germany - Keynote at 2nd International Keystone Conference, IKC2016 - 09/09/16 1Stefan Dietze
  • 2. Research areas  Web science, Information Retrieval, Semantic Web, Social Web Analytics, Knowledge Discovery, Human Computation  Interdisciplinary application areas: digital humanities, TEL/education, Web archiving, mobility Some projects L3S Research Center 09/09/16 2  See also: http://www.l3s.de Stefan Dietze
  • 3. Acknowledgements: team 09/09/16 3Stefan Dietze  Pavlos Fafalios (L3S)  Besnik Fetahu (L3S)  Ujwal Gadiraju (L3S)  Eelco Herder (L3S)  Ivana Marenzi (L3S)  Ran Yu (L3S)  Pracheta Sahoo (L3S, IIT India)  Bernardo Pereira Nunes (L3S, PUC Rio de Janeiro)  Mathieu d‘Aquin (The Open University, UK)  Mohamed Ben Ellefi (LIRMM, France)  Davide Taibi (CNR, Italy)  Konstantin Todorov (LIRMM, France)  ...
  • 4. Structured (linked) data on the Web: state of affairs SPARQL endpoint availability over time [Buil-Aranda et al 2013] Accessibility of datasets?  Less than 50% of all SPARQL endpoints actually responsive at given point of time [Buil-Aranda2013]  “THE” SPARQL protocol? No, but many variants & subsets Semantics, links, quality?  …data accuracy (eg DBpedia)? [Paulheim2013]  …vocabulary reuse? [D’AquinWebSci13]  …schema compliance (RDFS, schemas) [HoganJWS2012] Stefan Dietze Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525 An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012 09/09/16 4 SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil- Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International Semantic Web Conference 2013, (ISWC2013).
  • 5. Data quality and consistency Analyzing Relative Incompleteness of Movie Descriptions in the Web of Data: A Case Study, Yuan, W., Demidova, E., Dietze, S., Zhu, X., International Semantic Web Conference 2014 (ISWC2014) 09/09/16Stefan Dietze 5
  • 6. Challenge for search/retrieval – heterogeneity of datasets & entities Stefan Dietze 09/09/16 ??? ?? ? Discovery of suitable (1) datasets & (2) entities matching:  Quality? Currentness, dynamics, accessability/reliability, data quantity & quality?  Topics/scope? Datasets/entities useful & trustworthy for topic XY?  Types? Datasets/entities about statistics, organisations, videos, slides, publications etc? 6
  • 7. Overview 09/09/16Stefan Dietze 7 I – Challenges II – Enabling discovery & search in Linked Data & Knowledge Graphs  Dataset recommendation  Dataset profiling  Entity retrieval III – Beyond Linked Data – exploiting embedded Web semantics  Web markup as emerging data source  Case studies  Data fusion for entity reconciliation (and retrieval) III Wrap-up Dealing with diversity and heterogeneity
  • 8. Overview 09/09/16Stefan Dietze 8 I – Challenges II – Enabling discovery & search in Linked Data & Knowledge Graphs  Dataset recommendation  Dataset profiling  Entity retrieval III – Beyond Linked Data – exploiting embedded Web semantics  Web markup as emerging data source  Case studies  Data fusion for entity reconciliation (and retrieval) III Wrap-up Dealing with diversity and heterogeneity Other emerging forms of structured data on the Web?
  • 9. 09/09/16 Dataset recommendation I 9 S Linkset1 Linkset2 Approach  Given dataset s, ranking datasets from D according to probability score (di, t) to contain linking candidates (entities)  Features:  Approach 1: vocabulary overlap  Approach 2: existing links (SNA)  Linking candidates likely if datasets share common (a) schema elements, or (b) links (friend of a friend) Conclusions  Roughly 50% MAP for both approaches  Simplistic approach (!) Lopes, G.R., Paes Leme, L. A., Nunes, B.P., Casanova, M.A., Dietze, S., Two approaches to the dataset interlinking recommendation problem, 15th International Conference on Web Information System Engineering (WISE 2014), Thessaloniki, Greece. Rank 1 DBLP 2 ACM 3 OAI 4 CiteSeer 5 IBM 6 Roma 7 IEEE 8 Ulm 9 Pisa ? ? Stefan Dietze 9 Goal: finding candidate datasets, e.g. for entity retrieval or interlinking tasks (eg enrichment)
  • 10. 09/09/16 Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K., Intension-based Dataset Recommendation for Data Linking, 13th Extended Semantic Web Conference (ESWC2016), Heraklion, Crete, May, 2016, ESWC2016 Stefan Dietze 10 Dataset recommendation II L. Han, A. L. Kashyap, T. Finin, J. Mayeld, and J. Weese, "Umbc ebiquity-core: Semantic textual similarity systems", in Proc. of the *SEM, Association for Computational Linguistics, 2013. Preprocessing Datasets rankingDatasets filtering
  • 11. 09/09/16 Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K., Intension-based Dataset Recommendation for Data Linking, 13th Extended Semantic Web Conference (ESWC2016), Heraklion, Crete, May, 2016.ESWC2016 Stefan Dietze 11 Dataset recommendation II: results Data & ground truth  Experiments on (responsive) datasets from LOD Cloud (http://datahub.io)  Concept profiles from http://lov.okfn.org  Ground truth: existing links from VOID profiles of datasets (issue: not always representative for actual linksets) Results  MAP for different similarity thresholds from step 2 max. 54%  Recall 100% below indicated similarity (clustering) thresholds
  • 12. Dataset search through dataset cataloging & profiling Dataset Catalog/Registry http://data.linkededucation.org/linkedup/catalog/  LinkedUp project (FP7 project: L3S, OU, OKFN, Elsevier, Exact Learning solutions)  LinkedUp Catalog: largest collection of LD of educationally relevant resources (approx. 50 Datasets)  Original datasets published with key content providers, automatically extracted metadata 09/09/16 12Stefan Dietze
  • 13. 09/09/16 13Stefan Dietze http://data.linkededucation.org/linkedup/catalog/ LinkedUp Catalog: dataset index & registry, federated search  “Federated queries” through schema mappings [WebSci13]  Dataset accessibility  Linking & topic profiling Schema/Types
  • 14. 09/09/16 14Stefan Dietze http://data.linkededucation.org/linkedup/catalog/ LinkedUp Catalog: dataset index & registry, federated search  “Federated queries” through schema mappings [WebSci13]  Dataset accessibility  Linking & topic profiling [ESWC14] Dataset topic profiles
  • 15. db:Biology db:Cell biology Dataset Catalog/Registry yov:Video <yo:Video …> <dc:title>Lecture 29 – Stem Cells</dc:title> … </yo:Video…> Yovisto Video  Extraction of representative (DBpedia) categories („topic profile“) for arbitrary datasets ?  Technically trivial through established NER/NED approaches, but scalability issues (recall: LOD Cloud 1000+ datasets with <100 billion RDF statements)  Efficient approach: sampling & ranking for balance between scalability and precision /recall Scalable profiling of datasets A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). db:Cell (Biology) 09/09/16 16 db:Cell (Biology) Stefan Dietze
  • 16. Efficient dataset profiling 1. Sampling of resources (random sampling, weighted sampling, resource centrality sampling) 2. Entity- & topic-extraction (NER via DBpedia Spotlight, category mapping & -expansion) 3. Normalisation & ranking (graph-based models such as PageRank with Priors, HITS with Priors & K-Step Markov)  Result: weighted dataset-topic profile graph A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). 09/09/16 17Stefan Dietze
  • 17. Search & exploration of datasets through topic profiles  Applied to entire LOD cloud/graph  Visual exploration of extracted RDF dataset profiles (datasets, topics, relationships)  Evaluation results: K-Step Markov (10% sampling size) outperforms baselines (LDA, tf/idf on entire datasets) http://data-observatory.org/lod-profiles/ 09/09/16 18Stefan Dietze
  • 18. Search: entity retrieval on large structured datasets?  How to efficiently retrieve (related) entities/resources for given entity-seeking (keyword) query?  State of the art: BM25F on inverted entity index (Blanco et al, ISWC2011)  Challenges/observations:  Explicit entity links (owl:sameAs etc) are sparse yet important to facilitate state of the art methods  Query type affinity? 09/09/16 19Stefan Dietze ?? Large dataset/crawl e.g. LinkedUp dataset graph, BTC2014, Dynamic LD Observatory entities related to <Tim Berners Lee> ? BTC2014 DyLDO
  • 19. Entity retrieval: approach (I) Offline processing (clustering to address link sparsity) 1. Feature vectors (lexical and structural features) 2. Bucketing: per type (LSH algorithm) 3. Clustering: X-means & Spectral clustering per bucket Improving Entity Retrieval on Structured Data, Fetahu, B., Gadiraju, U., Dietze, S., 14th International Semantic Web Conference (ISWC2015), Bethlehem, US, (2015). (II) Online processing (retrieval) 1. Retrieval & expansion: a) BM25F results b) expansion from clusters (related entities) 2. Re-Ranking (context terms & query type affinity) 09/09/16 20Stefan Dietze
  • 20. Dataset  BTC2014 (4 billion entities)  92 SemSearch queries Methods  Our approaches: XM: Xmeans, SP: Spectral  Baselines B: BM25F, S1: Tonon et al [SIGIR12] Conclusions  XM & SP outperform baselines  Clustering to remedy link sparsity  Relevance to query more important than relevance to BM25F results Improving Entity Retrieval on Structured Data, Fetahu, B., Gadiraju, U., Dietze, S., 14th International Semantic Web Conference (ISWC2015), Bethlehem, US, (2015). Entity retrieval: evaluation 09/09/16 21Stefan Dietze
  • 21. Overview 09/09/16Stefan Dietze 22 I – Challenges II – Enabling discovery & search in Linked Data & Knowledge Graphs  Dataset recommendation  Dataset profiling  Entity retrieval III – Beyond Linked Data – exploiting embedded Web semantics  Web markup as emerging data source  Case studies  Data fusion for entity reconciliation (and retrieval) III Wrap-up Dealing with diversity and dynamics Other emerging forms of structured data on the Web?
  • 22.  Linked Data: approx. 1000 datasets & 100 billion statements - different order of magnitude wrt scale & dynamics vs  The Web: approx. 46.000.000.000.000 (46 trillion) Web pages indexed by Google  Other „semantics“ (structured facts) on the Web? Semantics (structured data) on the Web? 09/09/16 23Stefan Dietze
  • 23.  Embedded markup (RDFa, Microdata, Microformats) for interpretation of Web documents (search, retrieval)  Arbitrary vocabularies; schema.org used at scale: (700 classes, 1000 predicates)  Adoption on the Web: 26 % (2014 Google study of 12 bn Web pages)  “Web Data Commons” (Meusel & Paulheim [ISWC2014]) • Markup from Common Crawl (2.2 billion pages): 17 billion RDF quads • Markup in 26% of pages, 14% of PLDs in 2013 (increase from 6% in 2011)  Same order of magnitude as “the Web” Embedded semantics: Web page markup & schema.org <div itemscope itemtype ="http://schema.org/Movie"> <h1 itemprop="name">Forrest Gump</h1> <span>Actor: <span itemprop=„actor">Tom Hanks</span> <span itemprop="genre">Drama</span> ... </div> 09/09/16 24 RDF statements node1 actor _node-x node1 actor Robin Wright node1 genre Comedy node2 actor T. Hanks node2 distributed by Paramount Pic. node3 actor Tom Cruise node3 distributed by Paramount Pic. Stefan Dietze
  • 24.  schema:Product instances in Web Data Commons  Facts: 1.414.937.431 (= 302.246.120 instances, i.e. products)  Providers (distinct Pay Level Domains, PLDs): 93.705  Power Law distribution of terms across PLDs  Top 10 PLDs  Top provider ? (company) 09/09/16 25Stefan Dietze Example: embedded Web markup data about „products“ PLD # Resources www.crateandbarrel.com 33.517.936,00 www.bentgate.com 17.215.499,00 www.aliexpress.com 9.621.943,00 www.ebay.com.au 8.861.308,00 us.fotolia.com 7.939.982,00 www.ebay.co.uk 6.556.820,00 www.competitivecyclist.com 6.214.500,00 www.maxstudio.com 6.075.626,00 approx. 35 million resources
  • 25. 1 10 100 1000 10000 100000 1000000 10000000 1 51 101 151 201 count(log) PLD (ranked) # entities # statements Study on sample Web crawl (WDC)  Metadata about scholarly articles, e.g. s:ScholarlyArticle): 6.793.764 quads, 1.184.623 entities, 429 distinct predicates (in WDC and for 1 type alone)  Top 5 domains: Springer, MDPI, BMJ, diabetesjournals.org, mendeley.com, Biodiversitylibrary.org Domains, topics, disciplines?  Life Sciences and Computer Science predominant  Top-10 article titles  Most important publishers/journals, libraries represented Example: Web markup of bibliographic resources 09/09/16 26Stefan Dietze Sahoo, P., Gadiraju, U., Yu, R., Saha, S., Dietze, S., Analysing Structured Scholarly Data embedded in Web Pages, Semantics, Analytics, Visualisation: Enhancing Scholarly Data (SAVE-SD2016), co- located with the 25th International World Wide Web Conference, Montreal, Canada, April 11, 2016
  • 26. Example: entity markup of learning resources on the Web  “Learning Resources Metadata Intiative (LRMI)”: schema.org vocabulary for annotation of learning resources (informal, formal, etc)  Approx. 5000 PLDs in “Common Crawl”  LRMI-Adaptation on the Web (WDC) [LILE16]:  2014: 30.599.024 quads, 4.182.541 resources  2013: 10.636873 quads, 1.461.093 resources 09/09/16 27 Power law distribution across providers 4805 Provider / PLDs Taibi, D., Dietze, S., Towards embedded markup of learning resources on the Web: a quantitative Analysis of LRMI Terms Usage, in Companion Publication of the IW3C2 WWW 2016 Conference, IW3C2 2016, Montreal, Canada, April 11, 2016 Stefan Dietze
  • 27. 09/09/16 28Stefan Dietze Entity retrieval on Web markup: state of the art  Glimmer (http://glimmer.research.yahoo.com)  Entity retrieval on WDC dataset [Blanco, Mika & Vigna, ISWC2011]  BM25F retrieval model on WDC index
  • 28. Entity retrieval on Web markup: challenges 09/09/16 29 Characteristics Example Coreferences 18.000 results for <„Iphone 6“, type, s:Product> (8,6 quads on average) in CommonCrawl Redundancy <s, schema:name, „Iphone 6“> occurring 1000 times in CC Lack of links Largely unlinked entity descriptions Errors (typos & schema violations, see Meusel et al [ESWC2015]) Wrong namespaces, such as http://schma.org Undefined types & predicates: 9,7 %, less common than in LOD Confusion of datatype and object properties: <s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8% in LOD Data property range violations: e.g. literals vs numbers (12,6% vs 4,6 in LOD)  Using markup as (highly distributed) knowledge graph? Stefan Dietze A Survey on Challenges for Entity Retrieval in Markup Data, Yu, R., Gadiraju, U., Fetahu, B., Dietze, S., 15th International Semantic Web Conference (ISWC2016), Kobe, Japan (2016).
  • 29.  Obtaining consolidated entity description/facts (or graph) for a given resource/entity from Web markup?  Aiding tasks: such as document annotation, augmentation or semantic enrichment of existing data- or knowledge bases Entity retrieval & reconciliation on markup 09/09/16 30 Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., S. Dietze, Entity summarisation on structured web markup. In The Semantic Web: ESWC 2016 Satellite Events. Springer, 2016. Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., S. Dietze, Fact Selection for data fusion on structured web markup. ICDE2017, IEEE International Conference on Data Engineering, in progress. Query iPhone 6, type:(Product) Entity Description brand Apple Inc. weight 129 date 30.09.2015 manufacturer Foxconn Storage 16 GB <e1, s:name, „Iphone 6“> <e2, s:brand, „Apple Inc.“> <e3, s:brand, „Apple“> <e4, s:weight, 127> <e5, s:releaseDate, „1.12.1972“> Web (crawl) (e.g. Common Crawl/WDC, focused crawl) Stefan Dietze
  • 30. A supervised approach for data fusion on markup 09/09/16 31  Fact/entity retrieval: BM25 entity retrieval model on markup index (Common Crawl)  Fact selection/data fusion: ML classifier (SVM), using 3 feature categories (relevance, authority, clustering)  Experiments on Common Crawl: products, movies, books (approx. 3 billion facts) 1. Retrieval 2. Fact selection New Queries Foxconn, type:(Organization) Cupertino, type:(City) Apple Inc., type:(Organization) (supervised SVM classifier) Entity Description brand Apple Inc. weight 129 date 30.09.2015 manufacturer Foxconn Storage 16 GB Query iPhone 6, type:(Product) Candidate Facts node1 brand _node-x node1 brand Apple Inc. node1 weight 129 node2 weight 172 node2 manufacturer Foxconn node3 releasedate 01.12.1972 node3 manufacturer Foxconn Web page markup Web (crawl) approx. 125.000 facts for „iPhone6“ Stefan Dietze
  • 31. Evaluation & results (1/2) 09/09/16 32Stefan Dietze Evaluation setup  Comparison with baselines:  BM25: Top-k distinct facts via BM25  CBFS: clustering/heuristics-based approach  Expert-labeled ground truth Results  Supervised learning approach (SumSVM, SumDIV) outperforms baselines  Strong variance of results across query sets (for baselines, not our approach)  Strongest performance considering all feature sets Precision results
  • 32. 09/09/16 33Stefan Dietze Evaluation & results (2/2): markup for KB augmentation?  Comparison of obtained facts with existing knowledge bases (DBpedia) o „existing“: fact already in DBpedia o „new“: fact not existing in DBpedia (eg a book‘s releaseDate in Wiki/DBpedia) o „new-p“: property not existing in DBpedia (eg a book‘s release countries)  60-70% new facts for books & movies  100% new facts for queried products (not existing in DBpedia apparently)  Vast potential for KB augmentation (!)
  • 33. Linked Data & knowledge graphs Conclusions & outlook 09/09/16 34Stefan Dietze  Retrieval/search of Linked Data hindered by heterogeneity, quality, dynamics etc  Dealing with diversity & heterogeneity o Profiling & recommendation: dataset search & recommendation o Entity retrieval & clustering: entity search
  • 34. Entity node1 name Molecular structure of nucleic acids node1 author James D. Watson node1 publisher Nature node1 datePublished 1956 node1 datePublished 1953 Entity node2 name Francis Crick node2 name Cricks node2 born 1916 Embedded data/markup Unstructured (Web) data/docs Linked Data & knowledge graphs Conclusions & outlook 09/09/16 35Stefan Dietze  Retrieval/search of Linked Data hindered by heterogeneity, quality, dynamics etc  Dealing with diversity & heterogeneity o Profiling & recommendation: dataset search & recommendation o Entity retrieval & clustering: entity search  New forms of (structured) Web data: Web markup (schema.org et al) o Convergence of structured and unstructured Web o Scale and dynamics (!) o Potential to augment existing knowledge graphs o Potential training data for NED, entity interlinking and similar entity-centric problems
  • 35. Entity node1 name Molecular structure of nucleic acids node1 author James D. Watson node1 publisher Nature node1 datePublished 1956 node1 datePublished 1953 Entity node2 name Francis Crick node2 name Cricks node2 born 1916 Embedded data/markup Unstructured (Web) data/docs Linked Data & knowledge graphs Thank you! 09/09/16 36Stefan Dietze ? http://stefandietze.net @stefandietze