DBpedia is the Linked Data version of Wikipedia. Starting in 2007, several DBpedia dumps have been made available for download. In 2010, the Research Library at the Los Alamos National Laboratory used these dumps to deploy a Memento-compliant DBpedia Archive, in order to demonstrate the applicability and appeal of accessing temporal versions of Linked Data sets using the Memento “Time Travel for the Web” protocol. The archive supported datetime negotiation to access various temporal versions of RDF descriptions of DBpedia subject URIs.
In a recent collaboration with the iMinds Group of Ghent University, the DBpedia Archive received a major overhaul. The initial MongoDB storage approach, which was unable to handle increasingly large DBpedia dumps, was replaced by HDT, the Binary RDF Representation for Publication and Exchange. And, in addition to the existing subject URI access point, Triple Pattern Fragments access, as proposed by the Linked Data Fragments project, was added. This allows datetime negotiation for URIs that identify RDF triples that match subject/predicate/object patterns. To add this powerful capability, native Memento support was added to the Linked Data Fragments Server of Ghent University.
In this talk, we will include a brief refresher of Memento, and will cover Linked Data Fragments, Triple Pattern Fragments, and HDT in more detail. We will share lessons learned from this effort and demo the new DBpedia Archive, which, at this point, holds over 5 billion RDF triples.
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
1. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Herbert Van de Sompel
@hvdsomp
Los Alamos National Laboratory
Acknowledgments: Lyudmila Balakireva, Harihar Shankar, Ruben Verborgh
Access to DBpedia Versions using
Memento and Triple Pattern Fragments
Miel Vander Sande
@Miel_vds
Ghent University
2. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
3. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
4. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Memento Framework
5. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Memento LDOW 2010 Submission
Herbert Van de Sompel et al. (2010) An HTTP-Based Versioning Mechanism for Linked Data
http://arxiv.org/abs/1003.3661
6. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Memento and Linked Data
7. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Memento and Linked Data
8. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Time-Series Analysis across DBpedia Versions
Data collected through “follow your nose” HTTP Navigation
9. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
10. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
First Generation DBpedia Archive: Storage
11. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
First Generation DBpedia Archive: Storage
Characteristics
upload software
custom
upload time
~ 24 hours per version
storage software
MongoDB
storage space
383 Gb for 10 versions
DBpedia versions
10 versions: 2.0 through 3.9
number of triples
~ 3 billion
12. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
First Generation DBpedia Archive: Subject-URI Access
13. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
First Generation DBpedia Archive: Subject-URI Access
http://dbpedia.mementodepot.org/memento/2009052/http://dbpedia.org/page/Oaxaca
14. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
First Generation DBpedia Archive: Subject-URI Access
Characteristics
TimeGate software
custom
access type
Subject URI & datetime
external integration
current DBpedia
clients
• all clients: direct access to
Memento Subject-URI
• Memento clients: datetime
negotiation with Subject-URI
15. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
DBpedia Archive @ LANL Since 2010
• Access based on Subject-URI (DBpedia Topic URI) only
• MongoDB storage
• A blob per Subject-URI per version
• Dynamically transformed to other RDF serializations
• No updates since version 3.9 (2013) of DBpedia as a result of
scalability problems
!!!
!!!
16. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
17. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Affordable & Useful Linked Data Archives
• A Linked Data Archive consists of temporal snapshots of one or
more Linked Data sets, whereby each temporal snapshot reflects
the state of a Linked Data set at a specific moment or interval in
time.
• How to make Linked Data Archives accessible in a manner that is
• affordable/sustainable for the publisher
• useful for the consumer
18. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive: Characteristics
General Characteristics Publisher Consumer
Availability
Bandwidth
Cost
Functionality
Interface Expressiveness
LOD Integration
Memento Support
Cross Time/Data
Verdict:
• Publication perspective: $$$$
• Access perspective: ++++
19. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Publishing
• The typical ways of publishing Linked Data on the Web:
• Subject URI access
• Data dump
• SPARQL endpoint
Let’s consider these from the perspective of Linked Data Archives,
i.e. archival storage and access
20. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive with Subject-URI Access
• For each temporal snapshot of a Linked Data set, and for each
Subject in that snapshot, publish an RDF description (of the Subject)
at a URI that is specific per snapshot/subject
21. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive with Subject-URI Access: Characteristics
General Characteristics Publisher Consumer
Availability rather high rather high
Bandwidth ~ description ~ description
Cost rather low rather high
Functionality
Interface Expressiveness rather low
LOD Integration yes
Memento Support possible
Cross Time/Data follow your nose
Verdict:
• Publication perspective: $$$$
• Access perspective: ++++
22. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive Using Dumps
• Renders each temporal snapshot of a Linked Data set as a data
dump that places all temporal dataset triples (as they were at a
specific moment in time) into one or more files
23. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive Using Dumps: Characteristics
General Characteristics Publisher Consumer
Availability high high
Bandwidth high high
Cost low high
Functionality
Interface Expressiveness download dataset
LOD Integration no
Memento Support not possible
Cross Time/Data download various datasets
Verdict:
• Publication perspective: $$$$
• Access perspective: ++++
24. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive with SPARQL Endpoint(s)
• For each temporal snapshot of a Linked Data set, supports arbitrary
SPARQL queries.
• Different architectural set-ups possible; no standard approach
25. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive Using SPARQL Endpoint(s): Characteristics
General Characteristics Publisher Consumer
Availability problematic problematic
Bandwidth ~ query ~ query
Cost high low
Functionality
Interface Expressiveness highly expressive
LOD Integration no
Memento Support hard
Cross Time/Data custom distributed queries
Verdict:
• Publication perspective: $$$$
• Access perspective: ++++
26. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Affordable & Useful Linked Data Archives
Linked Data Archive Type Publishing Consuming
Data Dump $$$$ ++++
SPARQL Endpoint(s) $$$$ ++++
Subject URI Access $$$$ ++++
27. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
28. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Fragments (Ghent U)
• Every Linked Data interface offers specific fragments of a Linked
Data set
• A fragment is described by
• Selector: what questions can I ask?
• Controls: how do I get more fragments?
• Metadata: helpful information for consumption?
• Each interface type comes with tradeoffs
• cf. the analysis thus far
http://linkeddatafragments.org
Verborgh, R. et al. (2014) Querying datsets on the web with high availability. ISWC 2014
http://ruben.verborgh.org/publications/verborgh_iswc_2014/
29. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Triple Pattern Fragments (Ghent U)
• Triple Pattern Fragments is a new interface with a different set of
tradeoffs that are attractive from an archival perspective
http://www.hydra-cg.com/spec/latest/triple-pattern-fragments/
30. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Triple Pattern Fragments (Ghent U)
• Allows querying a Linked Data set according to
?Subject ?Predicate ?Object
patterns
31. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Triple Pattern Fragments (Ghent U)
Controls: Responses provide navigational help for clients
• Based on emerging Hydra vocabulary for self-describing
Hypermedia-Driven Web APIs
Metadata: dataset info, estimated count (to aid client applications)
http://www.hydra-cg.com/spec/latest/core/
32. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
33. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Binary RDF Representation for Publication and Exchange (HDT)
http://www.w3.org/Submission/HDT/
34. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Binary RDF Representation for Publication and Exchange (HDT)
http://www.w3.org/Submission/HDT/
• Header-Dictionary-Triple (HDT) is a compact, binary representation
of RDF datasets.
35. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Binary RDF Representation for Publication and Exchange (HDT)
http://www.w3.org/Submission/HDT/
• Able to represent massive data sets
• Dictionary/Triples structure achieves
• rapid search for ?subject ?predicate ?object pattern
• high compression rates
• Header provides metadata about the dataset
36. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
37. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
HDT Linked Data Archive with TPF Support
• For each temporal snapshot of a Linked Data set, generate an HDT
serialization that provides access according to
?subject ?predicate ?object
patterns
38. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive with ?s?p?o Access: Characteristics
General Characteristics Publisher Consumer
Availability high high
Bandwidth ~ query ~ query
Cost low medium
Functionality
Interface Expressiveness better than subject-URI only
LOD Integration yes
Memento Support possible
Cross Time/Data follow your nose
Verdict:
• Publication perspective: $$$$
• Access perspective: ++++
39. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Affordable & Useful Linked Data Archives
Linked Data Archive Type Publishing Consuming
Data Dump $$$$ ++++
SPARQL Endpoint(s) $$$$ ++++
Subject URI Access $$$$ ++++
HDT & TPF $$$$ ++++
40. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
41. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: Storage
42. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: Storage
Characteristics
upload software
HDT-CPP
upload time
~ 4 hours per version
storage software
HDT binary files
storage space
70 Gb for 12 versions
DBpedia versions
12 versions: 2.0 through 2015
number of triples
~ 5 billion
43. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: ?s?p?o Query-URI Access
44. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: ?s?p?o Query-URI Access
http://fragments.mementodepot.org/dbpedia_3_8?subject=&predicate=http://dbpedia.org/ontology/b
irthPlace&object=http://dbpedia.org/resource/Ghent
45. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: ?s?p?o Query-URI Access
?s?p?o Query-URI Access
TimeGate URI http://fragments.mementodepot.org/timegate/dbpedia?
subject={DBpediaURI}&predicate={DBpediaURI}&object={DBpediaURI}
http://fragments.mementodepot.org/timegate/dbpedia?
subject=&predicate=&object=http://dbpedia.org/resource/Ghent
TimeMap URI not supported
Memento URI http://fragments.mementodepot.org/{DBpediaVersion}?subject={DBpediaURI
}&predicate={DBpediaURI}&object={DBpediaURI}
http://fragments.mementodepot.org/dbpedia_3_0?
subject=&predicate=&object=http://dbpedia.org/resource/Ghent
Further info http://mementoweb.org/depot/native/fragments/
Try it with Memento for Chrome – http://bit.ly/memento-for-chrome
46. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: Subject-URI Access
47. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: Subject-URI Access
Subject-URI Access
TimeGate URI http://dbpedia.mementodepot.org/timegate/{DBpediaURI}
http://dbpedia.mementodepot.org/timegate/http://dbpedia.org/data/Ghent
TimeMap URI http://dbpedia.mementodepot.org/timemap/link/{DBpediaURI}
http://dbpedia.mementodepot.org/timemap/link/http://dbpedia.org/data/Ghent
Memento URI http://dbpedia.mementodepot.org/{yyyymmdd}/{DBpediaURI}
http://dbpedia.mementodepot.org/20080103/http://dbpedia.org/data/Ghent
Further info http://mementoweb.org/depot/native/dbpedia/
Try it with Memento for Chrome – http://bit.ly/memento-for-chrome
48. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: Access
Characteristics
TimeGate software
① node.js LDF server 2.0.0
② LDF js client
access type
① ?s?p?o Query-URI & datetime
② Subject-URI & datetime
external integration
① DBpedia LDF server
② current DBpedia
clients
• all clients: direct access to
Mementos of Subject-URI and
?s?p?o Query-URI
• Memento clients: datetime
negotiation with Subject-URI and
?s?p?o Query-URI
1
2
49. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
50. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Building a Linked Data Archive
• Convert the archival data set(s) to HDT using HDT-CPP
51. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
HDT Software (C++)
https://github.com/rdfhdt/hdt-cpp
• input data requires cleaning
before processing, especially
regarding URI characters
• DBpedia data not clean
• DBpedia v3.5 was not
successfully processed
• No meaningful error
messages to help locate
problems
• memory intensive
• Kyoto Cabinet was used
to optimize storage
requirement and speed
during processing
• Java version exists but has
memory problems
52. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Building a Linked Data Archive
• Convert the archival data set(s) to HDT using HDT-CPP
• Download the Triple Fragment Server code
53. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Fragment Server (Node.js)
https://github.com/LinkedDataFragments/Server.js
• provides ?s?p?o access to
local and/or remote Linked
Data sets
• supports HDT, Turtle files, N-
Triple files, JSON-LD files,
SPARQL endpoints, in-
memory store, and
BlazeGraph Linked Data sets
• version 2.0.0 (released March
31 2016) has built-in Memento
support
54. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Building a Linked Data Archive
• Convert the archival data set(s) to HDT using HDT-CPP
• Download the Triple Fragment Server code
• Create the JSON config file for Memento
55. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Fragment Server, Memento Configuration
https://github.com/LinkedDataFragments/Server.js/wiki/Configuring-Memento
• declare archival data set(s)
• add datetime ranges for the
archival data set(s)
• add a TimeGate
• list the archival data set(s) for
which the TimeGate should
support datetime negotiation
56. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Building a Linked Data Archive
• Convert the archival data set(s) to HDT using HDT-CPP
• Download the Triple Fragment Server code
• Create the JSON config file for Memento
• Run the server
57. Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Herbert Van de Sompel
@hvdsomp
Los Alamos National Laboratory
Acknowledgments: Lyudmila Balakireva, Harihar Shankar, Ruben Verborgh
Access to DBpedia Versions using
Memento and Triple Pattern Fragments
Miel Vander Sande
@Miel_vds
Ghent University