Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes

Mining Historical Data for DBpedia
via Temporal Tagging
of Wikipedia Infoboxes
Norman Weisenburger, Volha Bryl, Simone Paolo Ponzetto
Data and Web Science Research Group
University of Mannheim
Germany
NLP & DBpedia @ ISWC, Riva del Garda, Italy, October 20, 2014

Outline
1. State of art: Temporally annotated data in DBpedia and LOD
2. Temporally annotated data extraction pipeline
3. Company Dataset
• Statistics
• Comparison with other KBs
1. Ongoing and future work
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 2

Why we need historical LOD
• “Historical data” == any data that is or can be temporally annotated
• population of a city, revenue of a company, current club for a football player
• Why we need such data
• Allows having a more precise description of an entity
• Enables LOD-based data mining for trend prediction
• Availability of temporally annotated data on the Web of Data
• Poor and scarce
• Examples can be found in Freebase, Wikidata, YAGO, …
• Temporally annotated facts or – not so frequently – time series
• Some exceptionally good examples follow…

Temporally annotated data: Examples
Apple Inc. in Wikidata
http.//www.wikidata.org/wiki/Q312

Temporally annotated data: Examples
Apple Inc. in Freebase
http.//www.freebase.com/m/0k8z

Temporally annotated data in DBpedia
• DBpedia's main source of knowledge are Wikipedia infoboxes
• Temporal (time-dependent) infobox attributes
1. Not at all temporally annotated
2. Temporally annotated, annotation is modeled as a separate attribute
3. Temporally annotated, annotation is a part of an attribute value
• Often only the latest value is present
• When new value is available, the old one is overwritten

• DBpedia's main source of knowledge are Wikipedia infoboxes
2. Temporally annotated, annotation is modeled as a separate attribute
3. Temporally annotated, annotation is a part of an attribute value
• Often only the latest value is present
• When new value is available, the old one is overwritten
Our focus: case 3, temporal annotation is a part of an attribute value

(1)

2. Temporally annotated, annotation is modeled as a
separate attribute
• Often lost during DBpedia data extraction
• E.g. no connection between populationTotal and
populationAsOf properties
(2)

separate attribute
• Ends up in DBpedia only if an intermediate
node mapping is defined in the mapping wiki
(2)
(2)

separate attribute
3. Temporally annotated, annotation is a part of an
attribute value
• Annotation is lost during extraction
• In most cases value is regularly overwritten

Idea: go back in time
• Properties of interest
• Temporally annotated, annotation is a part of an attribute value
• Use case: Business and Financial Data (Companies)
• Key observations
• Attribute values are often temporally annotated
• If annotation is part of attribute value DBpedia extraction framework ignores it
• Attribute values are regularly overwritten by Wikipedia editors, but the trace
remains in Wikipedia revision history
• DBpedia data extraction process is run on one (e.g. the latest) dump only
• Proposed solution
• Run extraction on (part of) revision history
• Add a temporal tagger to the process

Extraction pipeline
1. Select and download Wikipedia revisions
2. Extract temporal facts
3. Merge facts
• Code available at https.//github.com/normalerweise/mte

Extraction pipeline
1. Select and download Wikipedia revisions
• Select 4 revisions per year (1st, 2nd, 3rd quartile and the last revision)
• Use MediaWiki API to download the revisions

Extraction pipeline
• Parse each infobox attribute twice
• For a value: Mapping Extractor of the DBpedia Extraction Framework
• For time validity (point or interval): HeidelTime
• HeidelTime is a multilingual cross-domain rule-based temporal tagger
• Developed at the University Of Heidelberg
• http.//dbs.ifi.uni-heidelberg.de/index.php?id=129

Extraction pipeline
{{ Infobox company
| name = Netflix, Inc.
| revenue = US$4.37  million (''FY 2013'')
...
<Netflix, revenue, 4.37E9, usDollar, 2013, 610604061>
• Parse each infobox attribute twice
Revision ID
• For a value: Mapping Extractor of the DBpedia Extraction Framework
• For time validity (point or interval): HeidelTime
• HeidelTime is a multilingual cross-domain rule-based temporal tagger
• Developed at the University Of Heidelberg
• http.//dbs.ifi.uni-heidelberg.de/index.php?id=129

Extraction pipeline
3. Merge facts
• Group triples by subject, property, temporal validity, value
• In case of value conflicts, select the most frequent value
• In case of ties, select the most recent value

Extraction pipeline
<Netflix, operatingIncome, 1.28E8, usDollar, 2008, 352001234>
3. Merge facts
• Group triples by subject, property, temporal validity, value
• In case of value conflicts, select the most frequent value
• In case of ties, select the most recent value

Data model
• Our choice for RDF representation
• Singleton property approach
• Vinh Nguyen, Olivier Bodenreider, and Amit Sheth. Don’t like RDF reification?
Making statements about statements using singleton property, WWW 2014
• Motivation: performance in terms of #triples, query size and execution time
• Main idea: unique URI for each predicate instance
<Netflix, revenue#uniqueId, 4.37E9>
<revenue#uniqueId, singletonPropertyOf, revenue>
<revenue#uniqueId, date, 2013>
<revenue#uniqueId, sourceRevision, 610604061>

Company dataset
• Dataset available at http://tiny.cc/tmpcompany
• Started from DBpedia resources of type dbpedia-owl:Company and
yago:Company108058098
• 51,214 companies, for 18,489 at least one fact is extracted for
• assets
• equity
• netIncome
• numberOfEmployees
• operatingIncome
• revenue

Company dataset
• assets, equity, netIncome, numberOfEmployees, operatingIncome, revenue

Company dataset vs other KBs
• 10 random companies
with well-maintained
infoboxes
• Manually mapped
ontology properties
• YAGO2
• 0 triples for these
companies for
hasNumberOfPeople
and hasRevenue
Our dataset

Company dataset vs other KBs
• 10 random companies
with well-maintained
infoboxes
• Manually mapped
ontology properties
• YAGO2
• 0 triples for these
companies for
hasNumberOfPeople
and hasRevenue
• Freebase
• 201 vs 58 triples
Our dataset
Freebase

Evaluation
• Evaluating the precision
• (preliminary, not in the paper)
• 100 random tuples, 2 properties, so far only one annotator
• 75% for numberOfEmployees and 78% for revenue
• Caused by parsing errors: DBpedia extraction framework is always tuned
to work with the latest Wikipedia version
• After fixing some errors: 97% for numberOfEmployees and 92% for revenue

Ongoing and future work
• Ongoing: extracting missing attributes from Wikipedia article texts
• Company dataset is used for distant supervision
• Anticipating some questions
• Yes, we tried the approach for another domain: American football
• Yes, making the data available through an endpoint is on our todo list

Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes

Recommended

Recommended

More Related Content

Similar to Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes

Similar to Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes (20)

Recently uploaded

Recently uploaded (20)

Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes

Editor's Notes