Web Data Management with RDF

1. Web Data Management in RDF Age M. Tamer Ozsu University of Waterloo David R. Cheriton School of Computer Science PKU/2014-08-28 1

2. Acknowledgements This presentation draws upon collaborative research and discussions with the following colleagues (in alphabetical order) Gunes Aluc, University of Waterloo Khuzaima Daudjee, University of Waterloo Olaf Hartig, University of Waterloo Lei Chen, Hong Kong University of Science Technology Lei Zou, Peking University PKU/2014-08-28 2

3. Web Data Management A long term research interest in the DB community 2000 2004 2011 2011 PKU/2014-08-28 3

4. Interest Due to Properties of Web Data Lack of a schema Data is at best semi-structured Missing data, additional attributes, similar data but not identical Volatility Changes frequently May conform to one schema now, but not later Scale Does it make sense to talk about a schema for Web? How do you capture everything? Querying diculty What is the user language? What are the primitives? Arent search engines or metasearch engines sucient? PKU/2014-08-28 4

5. More Recent Approaches to Web Querying Fusion Tables Users contribute data in spreadsheet, CVS, KML format Possible joins between multiple data sets Extensive visualization PKU/2014-08-28 8

6. More Recent Approaches to Web Querying Fusion Tables Users contribute data in spreadsheet, CVS, KML format Possible joins between multiple data sets Extensive visualization XML Data exchange language Primarily tree based structure list title=MOVIES film titleThe Shining/title directorStanley Kubrick/director actorJack Nicholson/actor /film film titleSpartacus/title directorStanley Kubrick/director /film film titleThe Passenger/title actorJack Nicholson/actor /film ... /list root

7. lm title The Shining director Stanley Kubrick

8. lm actor ... Jack Nicholson

9. lm title The Passenger actor Jack Nicholson PKU/2014-08-28 8

10. More Recent Approaches to Web Querying Fusion Tables Users contribute data in spreadsheet, CVS, KML format Possible joins between multiple data sets Extensive visualization XML Data exchange language Primarily tree based structure RDF (Resource Description Framework) SPARQL W3C recommendation Simple, self-descriptive model Building block of semantic web Linked Open Data (LOD) PKU/2014-08-28 8

11. RDF and Semantic Web RDF is a language for the conceptual modeling of information about resources (web resources in our context) A building block of semantic web Facilitates exchange of information Search engine results can be more focused and structured Facilitates data integration (mashes) Machine understandable Understand the information on the web and the interrelationships among them PKU/2014-08-28 9

12. RDF Uses Yago and DBpedia extract facts from Wikipedia represent as RDF ! structural queries Communities build RDF data E.g., biologists: Bio2RDF and Uniprot RDF Web data integration Linked Open Data Cloud . . . PKU/2014-08-28 10

13. RDF Data Volumes . . . . . . are growing { and fast Linked data cloud currently consists of 325 datasets with 25B triples Size almost doubling every year PKU/2014-08-28 11

14. RDF Data Volumes . . . . . . are growing { and fast Linked data cloud currently consists of 325 datasets with 25B triples Size almost doubling every year As of March 2009 LinkedCT Reactome Taxonomy KEGG GeneID PubMed Pfam UniProt OMIM PDB BBC Later + TOTP riese Symbol ChEBI Daily Med Disea-some CAS HGNC Inter Pro Drug Bank UniParc UniRef ProDom PROSITE Gene Ontology Homolo Gene Pub Chem MGI UniSTS GEO Species Jamendo BBC Programm es Music-brainz Magna-tune Surge Radio MySpace Wrapper Audio- Scrobbler Linked MDB BBC John Peel BBC Playcount Data Gov- Track US Census Data Geo-names lingvoj World Fact-book Euro-stat IRIT Toulouse SW Conference Corpus RDF Book Mashup Project Guten-berg DBLP Hannover DBLP Berlin LAAS-CNRS Buda-pest BME IEEE IBM Resex Pisa New-castle RAE 2001 CiteSeer ACM DBLP RKB Explorer eprints LIBRIS Semantic Web.org Eurécom ECS South-ampton SIOC Revyu Sites Doap-space Flickr exporter FOAF profiles flickr wrappr Crunch Base Sem- Web- Central Open- Guides Wiki-company QDOS Pub Guide Open Calais RDF ohloh W3C WordNet Open Cyc UMBEL Yago DBpedia Freebase Virtuoso Sponger March '09: 89 datasets Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ PKU/2014-08-28 11

15. RDF Data Volumes . . . . . . are growing { and fast Linked data cloud currently consists of 325 datasets with 25B triples Size almost doubling every year New-castle User-generated content As of September 2010 Audio-scrobbler (DBTune) Music Brainz (zitgist) P20 YAGO Chronic-ling America World Fact-book (FUB) Geo Names Moseley Folk WordNet (VUA) WordNet (W3C) VIVO UF VIVO Indiana VIVO Cornell VIAF URI Burner Sussex Reading Lists Plymouth Reading Lists UMBEL UK Post-codes legislation .gov.uk Uberblic UB Mann-heim TWC LOGD GTAA BBC Program mes Twarql transport data.gov .uk totl.net Tele-graphis TCM Gene DIT Taxon Concept The Open Library (Talis) t4gm Surge Radio RAMEAU SH STW statistics data.gov .uk St. Andrews Resource Lists ECS South-ampton EPrints Semantic Crunch Base semantic web.org Semantic XBRL SW Dog Food rdfabout US SEC Wiki UN/ LOCODE Ulm ECS (RKB Explorer) Roma RISKS RESEX RAE2001 Pisa OS OAI NSF LAAS KISTI JISC IRIT IEEE IBM Eurécom ERA ePrints dotAC DEPLOY DBLP (RKB Explorer) Course-ware CORDIS CiteSeer Budapest ACM riese Revyu research data.gov .uk reference data.gov .uk Recht-spraak. nl RDF ohloh Last.FM (rdfize) RDF Book Mashup PSH lingvoj Product DB Poké-pédia PBAC Ord-nance Survey Openly Local The Open Library Open Cyc Open Calais OpenEI New York Times NTU Resource Lists NDL subjects MARC Codes List Man-chester Reading Lists Lotico The London Gazette LOIUS lobid Resources lobid Organi-sations Linked MDB Linked LCCN Linked GeoData Linked CT Linked Open Numbers LIBRIS Lexvo LCSH DBLP (L3S) Linked Sensor Data (Kno.e.sis) Good-win Family Jamendo iServe NSZL Catalog GovTrack GESIS Geo Species Geo Linked Data (es) STITCH Project Guten-berg (FUB) SIDER Medi Care Euro-stat (FUB) Drug Bank Disea-some DBLP (FU Berlin) Daily Med Freebase flickr wrappr Fishes of Texas FanHubz Event- Media EUTC Produc-tions Eurostat EUNIS ESD stan-dards Popula-tion (En- AKTing) NHS (EnAKTing) Mortality (En- AKTing) Energy (En- AKTing) CO2 (En- AKTing) education data.gov .uk ECS South-ampton Gem. Norm-datei data dcs MySpace (DBTune) Music Brainz (DBTune) Magna-tune John Peel (DB Tune) classical (DB Tune) Last.fm Artists (DBTune) DB Tropes dbpedia lite DBpedia Pokedex Airports NASA (Data Incu-bator) Music Brainz (Data Incubator) Discogs (Data In-cubator) Climbing Linked Data for Intervals Cornetto Chem2 Bio2RDF biz. data. gov.uk UniRef UniSTS Uni Path-way Taxo-nomy UniParc UniProt SGD Reactome PubMed Pub Chem Pfam PDB PRO-SITE ProDom OMIM OBO MGI KEGG Reaction KEGG Drug KEGG Pathway KEGG Glycan KEGG Enzyme KEGG Cpd InterPro Homolo Gene HGNC Gene Ontology GeneID Gen Bank ChEBI CAS Affy-metrix BibBase BBC Wildlife Finder BBC Music rdfabout US Census Media Geographic Publications Government Cross-domain Life sciences September '10: 203 datasets Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ PKU/2014-08-28 11

16. RDF Data Volumes . . . . . . are growing { and fast Linked data cloud currently consists of 325 datasets with 25B triples Size almost doubling every year RESEX IBM User-generated content As of September 2011 Audio Scrobbler (DBTune) Music Brainz (zitgist) P20 Turismo de Zaragoza yovisto Yahoo! Geo Planet World Fact-book Moseley Folk YAGO El Viajero Tourism BBC Program mes BBC Geo Names WordNet (VUA) WordNet (W3C) VIVO UF Calames VIVO Indiana VIVO Cornell VIAF URI Burner Sussex Reading Lists Plymouth Reading Lists Source Code Ecosystem Linked Data UniProt PubMed UniRef UMBEL UK Post-codes legislation data.gov.uk Uberblic UB Mann-heim TWC LOGD Twarql transport data.gov. uk Traffic Scotland theses. fr Thesau-rus W totl.net Tele-graphis Semantic Tweet TCM Gene DIT Taxon Concept Open Library (Talis) tags2con delicious t4gm info Swedish Open Cultural Heritage Surge Radio Sudoc STW RAMEAU SH statistics data.gov. uk St. Andrews Resource Lists ECS South-ampton EPrints SSW Thesaur us Linked User Feedback gnoss Greek DBpedia Smart Link Slideshare 2RDF semantic web.org GovTrack Semantic XBRL SW Dog Food US SEC (rdfabout) Sears Scotland Pupils Exams Scotland Geo-graphy Scholaro-meter WordNet (RKB Explorer) Wiki UN/ LOCODE Ulm ECS (RKB Explorer) Roma RISKS RAE2001 Pisa OS OAI NSF New-castle LAAS KISTI JISC IRIT IEEE Eurécom ERA ePrints dotAC DEPLOY DBLP (RKB Explorer) Crime Reports UK Course-ware CORDIS (RKB Explorer) CiteSeer Budapest ACM riese Revyu research data.gov. Ren. uk Energy Genera-tors reference data.gov. uk Recht-spraak. nl RDF ohloh Last.FM (rdfize) RDF Book Mashup Rådata nå! PSH Product Types Ontology Product DB PBAC Poké-pédia patents data.go v.uk Ox Points Ord-nance Survey Openly Local Open Library Open Cyc Open Corpo-rates Open Calais OpenEI Open Election Data Project Open Data Thesau-rus Ontos News Portal OGOLOD Ocean Drilling Codices Janus AMP New York Times NVD ntnusc NTU Resource Lists Norwe-gian MeSH NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico Weather Stations London Gazette LOIUS Linked Open Colors lobid Resources lobid Organi-sations LEM Linked MDB LinkedL CCN Linked GeoData LinkedCT LOV Linked Open Numbers LODE Eurostat (Ontology Central) Linked EDGAR (Ontology Central) Linked Crunch-base lingvoj Lichfield Spen-ding LIBRIS Lexvo LCSH DBLP (L3S) Linked Sensor Data (Kno.e.sis) Klapp-stuhl-club Good-win Family National Radio-activity JP Jamendo (DBtune) Italian public schools ISTAT Immi-gration iServe IdRef Sudoc NSZL Catalog Hellenic PD Hellenic FBD Piedmont Accomo-dations GovWILD Google Art wrapper GESIS GeoWord Net Geo Species Geo Linked Data GEMET GTAA STITCH SIDER Project Guten-berg Medi Care Euro-stat (FUB) EURES Drug Bank Disea-some DBLP (FU Berlin) Daily Med CORDIS (FUB) Freebase flickr wrappr Fishes of Texas Finnish Munici-palities ChEMBL FanHubz Event Media EUTC Produc-tions Eurostat Europeana EUNIS EU Insti-tutions ESD stan-dards EARTh Enipedia Popula-tion (En- AKTing) NHS (En- AKTing) Mortality (En- AKTing) Energy (En- AKTing) Crime (En- AKTing) CO2 Emission (En- AKTing) EEA SISVU educatio n.data.g ov.uk ECS South-ampton ECCO-TCP GND Didactal ia DDC Deutsche Bio-graphie data dcs Music Brainz (DBTune) Magna-tune John Peel (DBTune) Classical (DB Tune) Last.FM artists (DBTune) DB Tropes Portu-guese DBpedia dbpedia lite DBpedia data-open-ac- uk SMC Journals Pokedex Airports NASA (Data Incu-bator) Music Brainz (Data Incubator) Metoffice Weather Forecasts Discogs (Data Incubator) Climbing data.gov.uk intervals Data Gov.ie data bnf.fr Cornetto reegle Chronic-ling America Chem2 Bio2RDF business data.gov. uk Bricklink Brazilian Poli-ticians BNB UniSTS UniPath way UniParc Taxono my UniProt (Bio2RDF) SGD Reactome Pub Chem PRO-SITE ProDom Pfam PDB OMIM MGI KEGG Reaction KEGG Pathway KEGG Glycan KEGG Enzyme KEGG Drug KEGG Com-pound InterPro Homolo Gene HGNC Gene Ontology GeneID Affy-metrix bible ontology BibBase FTS BBC Wildlife Finder Music Alpine Ski Austria LOCAH Amster-dam Museum AGROV OC AEMET US Census (rdfabout) Media Geographic Publications Government Cross-domain Life sciences September '11: 295 datasets, 25B triples Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ PKU/2014-08-28 11

17. RDF Data Volumes . . . . . . are growing { and fast Linked data cloud currently consists of 325 datasets with 25B triples Size almost doubling every year April '14: 1091 datasets, ??? triples Max Schmachtenberg, Christian Bizer, and Heiko Paulheim: Adoption of Linked Data Best Practices in Dierent Topical Domains. In Proc. ISWC, 2014. PKU/2014-08-28 11

18. Closer Look PKU/2014-08-28 12

19. Globally Distributed Network of Data PKU/2014-08-28 13

20. Three Approaches Data warehousing Consolidate data in a repository and query it SPARQL federation Leverage query services provided by data publishers Live Linked Data querying Navigate through LOD by looking up URIs at query execution time PKU/2014-08-28 14

21. Outline 1 LOD and RDF Introduction 2 Data Warehousing Approach Relational Approaches Graph-Based Approaches 3 SPARQL Federation Approach Distributed RDF Processing SPARQL Endpoint Federation 4 Live Querying Approach Traversal-based approaches Index-based approaches Hybrid approaches 5 Conclusions PKU/2014-08-28 15

23. Traditional Hypertext-based Web Access IMDb World Book Data exposed to the Web via HTML PKU/2014-08-28 17

24. Linked Data Publishing Principles (http://...linkedmdb.../Shining,releaseDate, 23 May 1980) (http://...linkedmdb.../Shining,

25. lmLocation, http://cia.../UK) (http://...linkedmdb.../29704,actedIn, http://...linkedmdb.../Shining) IMDb World Book ... (http://cia.../UK, hasPopulation, 63230000) ... Shining UK Data model: RDF Global identi

26. er: URI Access mechanism: HTTP Connection: data links PKU/2014-08-28 18

27. RDF Introduction Everything is an uniquely named resource http://data.linkedmdb.org/resource/actor/JN29704 PKU/2014-08-28 19

28. RDF Introduction Everything is an uniquely named resource Pre

29. xes can be used to shorten the names xmlns:y=http://data.linkedmdb.org/resource/actor/ y:JN29704 PKU/2014-08-28 19

31. xes can be used to shorten the names Properties of resources can be de

32. ned xmlns:y=http://data.linkedmdb.org/resource/actor/ y:JN29704 y:JN29704:hasName Jack Nicholson y:JN29704:BornOnDate 1937-04-22 PKU/2014-08-28 19

35. ned Relationships with other resources can be de

36. ned xmlns:y=http://data.linkedmdb.org/resource/actor/ y:JN29704 y:JN29704:hasName Jack Nicholson y:JN29704:BornOnDate 1937-04-22 JN29704:movieActor y:TS2014 y:TS2014:title The Shining y:TS2014:releaseDate 1980-05-23 PKU/2014-08-28 19

39. ned Relationships with other resources can be de

40. ned Resource descriptions can be contributed by dierent people/groups and can be located anywhere in the web Integrated web database xmlns:y=http://data.linkedmdb.org/resource/actor/ y:JN29704 y:JN29704:hasName Jack Nicholson y:JN29704:BornOnDate 1937-04-22 JN29704:movieActor y:TS2014 y:TS2014:title The Shining y:TS2014:releaseDate 1980-05-23 PKU/2014-08-28 19

41. RDF Data Model Triple: Subject, Predicate (Property), Object (s; p; o) Subject: the entity that is described (URI or blank node) Predicate: a feature of the entity (URI) Object: value of the feature (URI, blank node or literal) (s; p; o) 2 (U [ B) U (U [ B [ L) Set of RDF triples is called an RDF graph U Predicate Subject Object U B U B L U: set of URIs B: set of blank nodes L: set of literals Subject Predicate Object http://...imdb.../

42. lm/2014 rdfs:label The Shining http://...imdb.../

43. lm/2014 movie:releaseDate 1980-05-23 http://...imdb.../29704 movie:actor name Jack Nicholson : : : : : : : : : PKU/2014-08-28 20

44. RDF Example Instance Pre

45. xes: mdb=http://data.linkedmdb.org/resource/; geo=http://sws.geonames.org/ bm=http://wifo5-03.informatik.uni-mannheim.de/bookmashup/ lexvo=http://lexvo.org/id/;wp=http://en.wikipedia.org/wiki/ Subject Predicate Object mdb:

46. lm/2014 rdfs:label The Shining mdb:

47. lm/2014 movie:initial release date 1980-05-23' mdb:

48. lm/2014 movie:director mdb:director/8476 mdb:

49. lm/2014 movie:actor mdb:actor/29704 mdb:

50. lm/2014 movie:actor mdb: actor/30013 mdb:

51. lm/2014 movie:music contributor mdb: music contributor/4110 mdb:

52. lm/2014 foaf:based near geo:2635167 mdb:

53. lm/2014 movie:relatedBook bm:0743424425 mdb:

54. lm/2014 movie:language lexvo:iso639-3/eng mdb:director/8476 movie:director name Stanley Kubrick mdb:

56. lm/2685 rdfs:label A Clockwork Orange mdb:

58. lm/424 rdfs:label Spartacus mdb:actor/29704 movie:actor name Jack Nicholson mdb:

60. lm/1267 rdfs:label The Last Tycoon mdb:

62. lm/3418 rdfs:label The Passenger geo:2635167 gn:name United Kingdom geo:2635167 gn:population 62348447 geo:2635167 gn:wikipediaArticle wp:United Kingdom bm:books/0743424425 dc:creator bm:persons/Stephen+King bm:books/0743424425 rev:rating 4.7 bm:books/0743424425 scom:hasOer bm:oers/0743424425amazonOer lexvo:iso639-3/eng rdfs:label English lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn URI Literal URI PKU/2014-08-28 21

63. RDF Graph United Kingdom gn:name The Passenger refs:label 62348447 gn:population mdb:

64. lm/2014 movie:initial release date 1980-05-23 bm:oers/0743424425amazonOer The Shining refs:label bm:books/0743424425 4.7 rev:rating geo:2635167 The Last Tycoon refs:label movie:actor movie:actor mdb:actor/29704 movie:actor name Jack Nicholson mdb:

65. lm/3418 mdb:

66. lm/1267 mdb:director/8476 Stanley Kubrick movie:director name mdb:

67. lm/2685 refs:label A Clockwork Orange mdb:

68. lm/424 refs:label Spartacus mdb:actor/30013 movie:relatedBook scam:hasOer foaf:based near movie:actor movie:director movie:actor movie:director movie:director PKU/2014-08-28 22

69. Linked Data Model [Hartig, 2012] Web Document Given a countably in

70. nite set D (documents), a Web of Linked Data is a tuple W = (D; adoc; data) where: I D D, I adoc is a partial mapping from URIs to D, and I data is a total mapping from D to

71. nite sets of RDF triples. PKU/2014-08-28 23

72. Linked Data Model [Hartig, 2012] Web Document Given a countably in

73. nite set D (documents), a Web of Linked Data is a tuple W = (D; adoc; data) where: I D D, I adoc is a partial mapping from URIs to D, and I data is a total mapping from D to

74. nite sets of RDF triples. Web of Linked Data A Web of Linked Data W = (D; adoc; data) contains a data link from document d 2 D to document d0 2 D if there exists a URI u such that: I u is mentioned in an RDF triple t 2 data(d), and I d0 = adoc(u). PKU/2014-08-28 23

75. RDF Query Model { SPARQL Query Model - SPARQL Protocol and RDF Query Language Given U (set of URIs), L (set of literals), and V (set of variables), a SPARQL expression is de

76. ned recursively: an atomic triple pattern, which is an element of (U [ V) (U [ V) (U [ V [ L) ?x rdfs:label The Shining P FILTER R, where P is a graph pattern expression and R is a built-in SPARQL condition (i.e., analogous to a SQL predicate) ?x rev:rating ?p FILTER(?p 3.0) P1 AND/OPT/UNION P2, where P1 and P2 are graph pattern expressions Example: SELECT ?name WHERE f ?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d . ?d movie : d i r e c t o r n ame S t a n l e y Kubr i ck . ?m movie : r e l a t e dBo o k ?b . ?b r e v : r a t i n g ? r . FILTER(? r 4 . 0 ) g PKU/2014-08-28 24

77. SPARQL Queries SELECT ?name WHERE f ?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d . ?d movie : d i r e c t o r n ame S t a n l e y Kubr i ck . ?m movie : r e l a t e dBo o k ?b . ?b r e v : r a t i n g ? r . FILTER(? r 4 . 0 ) g FILTER(?r 4.0) ?m ?d movie:director ?name rdfs:label ?b movie:relatedBook Stanley Kubrick movie:director name ?r rev:rating PKU/2014-08-28 25

79. Nave Triple Store Design SELECT ?name WHERE f ?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d . ?d movie : d i r e c t o r n ame S t a n l e y Kubr i ck . ?m movie : r e l a t e dBo o k ?b . ?b r e v : r a t i n g ? r . FILTER(? r 4 . 0 ) g Subject Property Object mdb:

81. lm/2014 movie:initial release date 1980-05-23 mdb:

96. lm/3418 rdfs:label The Passenger geo:2635167 gn:name United Kingdom geo:2635167 gn:population 62348447 geo:2635167 gn:wikipediaArticle wp:United Kingdom bm:books/0743424425 dc:creator bm:persons/Stephen+King bm:books/0743424425 rev:rating 4.7 bm:books/0743424425 scom:hasOer bm:oers/0743424425amazonOer lexvo:iso639-3/eng rdfs:label English lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn PKU/2014-08-28 27

114. lm/3418 rdfs:label The Passenger geo:2635167 gn:name United Kingdom geo:2635167 gn:population 62348447 geo:2635167 gn:wikipediaArticle wp:United Kingdom bm:books/0743424425 dc:creator bm:persons/Stephen+King bm:books/0743424425 rev:rating 4.7 bm:books/0743424425 scom:hasOer bm:oers/0743424425amazonOer lexvo:iso639-3/eng rdfs:label English lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn SELECT T1 . o b j e c t FROM T as T1 , T as T2 , T as T3 , T as T4 , T as T5 WHERE T1 . p= r d f s : l a b e l AND T2 . p=movie : r e l a t e dBo o k AND T3 . p=movie : d i r e c t o r AND T4 . p= r e v : r a t i n g AND T5 . p=movie : d i r e c t o r n ame AND T1 . s=T2 . s AND T1 . s=T3 . s AND T2 . o=T4 . s AND T3 . o=T5 . s AND T4 . o 4 . 0 AND T5 . o= S t a n l e y Kubr i ck PKU/2014-08-28 27

132. lm/3418 rdfs:label The Passenger geo:2635167 gn:name United Kingdom geo:2635167 gn:population 62348447 geo:2635167 gn:wikipediaArticle wp:United Kingdom bm:books/0743424425 dc:creator bm:persons/Stephen+King bm:books/0743424425 rev:rating 4.7 bm:books/0743424425 scom:hasOer bm:oers/0743424425amazonOer lexvo:iso639-3/eng rdfs:label English lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn Easy to implement but too many self-joins! SELECT T1 . o b j e c t FROM T as T1 , T as T2 , T as T3 , T as T4 , T as T5 WHERE T1 . p= r d f s : l a b e l AND T2 . p=movie : r e l a t e dBo o k AND T3 . p=movie : d i r e c t o r AND T4 . p= r e v : r a t i n g AND T5 . p=movie : d i r e c t o r n ame AND T1 . s=T2 . s AND T1 . s=T3 . s AND T2 . o=T4 . s AND T3 . o=T5 . s AND T4 . o 4 . 0 AND T5 . o= S t a n l e y Kubr i ck PKU/2014-08-28 27

133. Exhaustive Indexing RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore [Weiss et al., 2008] Strings are mapped to ids using a mapping table Original triple table Subject Property Object mdb:

135. lm/2014 movie:initial release date 1980-05-23 mdb:director/8476 movie:director name Stanley Kubrick mdb:

136. lm/2685 movie:director mdb:director/8476 Encoded triple table Subject Property Object 0 1 2 0 3 4 5 6 7 8 9 5 Mapping table ID Value 0 mdb:

137. lm/2014 1 rdfs:label 2 The Shining 3 movie:initial release date 4 1980-05-23 5 mdb:director/8476 6 movie:director name 7 Stanley Kubrick 8 mdb:

138. lm/2685 9 movie:director PKU/2014-08-28 28

139. Exhaustive Indexing RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore [Weiss et al., 2008] Strings are mapped to ids using a mapping table Triples are indexed in a clustered B+ tree in lexicographic order Subject Property Object 0 1 2 0 3 4 5 6 7 8 9 5 ... ... ... B+ tree Easy querying through mapping table PKU/2014-08-28 28

140. Exhaustive Indexing RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore [Weiss et al., 2008] Strings are mapped to ids using a mapping table Triples are indexed in a clustered B+ tree in lexicographic order Create indexes for permutations of the three columns: SPO, SOP, PSO, POS, OPS, OSP Subject Property Object 0 1 2 0 3 4 5 6 7 8 9 5 ... ... ... B+ tree Easy querying through mapping table PKU/2014-08-28 28

141. Exhaustive Indexing{Query Execution Each triple pattern can be answered by a range query Joins between triple patterns computed using merge join Join order is easy due to extensive indexing Subject Property Object 0 1 2 0 3 4 5 6 7 8 9 5 ... ... ... ID Value 0 mdb:

143. lm/2685 9 movie:director PKU/2014-08-28 29

146. lm/2685 9 movie:director Advantages I Eliminates some of the joins { they become range queries I Merge join is easy and fast PKU/2014-08-28 29

149. lm/2685 9 movie:director Advantages I Eliminates some of the joins { they become range queries I Merge join is easy and fast Disadvantages I Space usage PKU/2014-08-28 29

150. Property Tables Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF [Bornea et al., 2013] Clustered property table: group together the properties that tend to occur in the same (or similar) subjects Property-class table: cluster the subjects with the same type of property into one property table Subject Property Object mdb:

154. lm/2685 rdfs:label A Clockwork Orange mdb:actor/29704 movie:actor name Jack Nicholson : : : : : : : : : Subject refs:label movie:director mob:

155. lm/2014 The Shining mob:director/8476 mob:

156. lm/2685 The Clockwork Orange mob:director/8476 Subject movie:actor name mdb:actor Jack Nicholson PKU/2014-08-28 30

163. lm/2685 The Clockwork Orange mob:director/8476 Subject movie:actor name mdb:actor Jack Nicholson Advantages I Fewer joins I If the data is structured, we have a relational system { similar to normalized relations PKU/2014-08-28 30

170. lm/2685 The Clockwork Orange mob:director/8476 Subject movie:actor name mdb:actor Jack Nicholson Advantages I Fewer joins I If the data is structured, we have a relational system { similar to normalized relations Disadvantages I Potentially a lot of NULLs I Clustering is not trivial I Multi-valued properties are complicated PKU/2014-08-28 30

171. Binary Tables Grouping by properties: For each property, build a two-column table, containing both subject and object, ordered by subjects [Abadi et al., 2007, 2009] Also called vertical partitioned tables n two column tables (n is the number of unique properties in the data) Subject Property Object mdb:

175. lm/2685 rdfs:label A Clockwork Orange mdb:actor/29704 movie:actor name Jack Nicholson : : : : : : : : : movie:director Subject Object mdb:

176. lm/2014 mdb:director/8476 mdb:

177. lm/2685 mdb:director/8476 refs:label Subject Object mob:

178. lm/2014 The Shining mob:

179. lm/2685 The Clockwork Orange movie:actor name Subject Object mdb:actor/29704 Jack Nicholson PKU/2014-08-28 31

188. lm/2685 The Clockwork Orange movie:actor name Subject Object mdb:actor/29704 Jack Nicholson Advantages I Supports multi-valued properties I No NULLs I No clustering I Read only needed attributes (i.e. less I/O) I Good performance for subject-subject joins PKU/2014-08-28 31

197. lm/2685 The Clockwork Orange movie:actor name Subject Object mdb:actor/29704 Jack Nicholson Advantages I Supports multi-valued properties I No NULLs I No clustering I Read only needed attributes (i.e. less I/O) I Good performance for subject-subject joins Disadvantages I Not useful for subject-object joins I Expensive inserts PKU/2014-08-28 31

198. Graph-based Approach Answering SPARQL query subgraph matching gStore [Zou et al., 2011, 2014] FILTER(?r 4.0) ?m ?d movie:director ?name rdfs:label ?b movie:relatedBook Stanley Kubrick movie:director name ?r rev:rating Subgraph Matching United Kingdom gn:name The Passenger refs:label 62348447 gn:population mdb:

200. lm/3418 mdb:

203. lm/424 refs:label Spartacus mdb:actor/30013 movie:relatedBook scam:hasOer foaf:based near movie:actor movie:director movie:actor movie:director movie:director PKU/2014-08-28 32

206. lm/3418 mdb:

209. lm/424 refs:label Spartacus mdb:actor/30013 movie:relatedBook scam:hasOer foaf:based near movie:actor movie:director movie:actor movie:director movie:director Advantages I Maintains the graph structure I Full set of queries can be handled PKU/2014-08-28 32

212. lm/3418 mdb:

215. lm/424 refs:label Spartacus mdb:actor/30013 movie:relatedBook scam:hasOer foaf:based near movie:actor movie:director movie:actor movie:director movie:director Advantages I Maintains the graph structure I Full set of queries can be handled Disadvantages I Graph pattern matching is expensive PKU/2014-08-28 32

216. gStore General Approach: Work directly on the RDF graph and the SPARQL query graph Use a signature-based encoding of each entity and class vertex to speed up matching Filter-and-evaluate Use a false positive algorithm to prune nodes and obtain a set of candidates; then do more detailed evaluation on those Use an index (VS-tree) over the data signature graph (has light maintenance load) for ecient pruning PKU/2014-08-28 33

217. 1. Encode Q and G to Get Signature Graphs Query signature graph Q 00010 0100 0000 1000 0000 0000 0100 10000 Data signature graph G 0010 1000 0100 0001 00001 1000 0001 00010 0000 0100 10000 0000 1000 10000 0000 0010 10000 0000 1001 00100 1001 1000 01000 0001 0001 01000 0100 1000 01000 0001 0100 01000 PKU/2014-08-28 34

218. 2. Filter-and-Evaluate Query signature graph Q 00010 0100 0000 1000 0000 0000 0100 10000 Data signature graph G 0010 1000 0100 0001 00001 1000 0001 00010 0000 0100 10000 0000 1000 10000 0000 0010 10000 0000 1001 00100 1001 1000 01000 0001 0001 01000 0100 1000 01000 0001 0100 01000 Find matches of Q over signature graph G Verify each match in RDF graph G PKU/2014-08-28 35

219. How to Generate Candidate List Two step process: 1. For each node of Q get lists of nodes in G that include that node. 2. Do a multi-way join to get the candidate list PKU/2014-08-28 36

220. How to Generate Candidate List Two step process: 1. For each node of Q get lists of nodes in G that include that node. 2. Do a multi-way join to get the candidate list Alternatives: PKU/2014-08-28 36

221. How to Generate Candidate List Two step process: 1. For each node of Q get lists of nodes in G that include that node. 2. Do a multi-way join to get the candidate list Alternatives: Sequential scan of G Both steps are inecient PKU/2014-08-28 36

222. How to Generate Candidate List Two step process: 1. For each node of Q get lists of nodes in G that include that node. 2. Do a multi-way join to get the candidate list Alternatives: Sequential scan of G Both steps are inecient Use S-trees Height-balanced tree over signatures Run an inclusion query for each node of Q and get lists of nodes in G that include that node. Given query signature q and a set of data signatures S,

223. nd all data signatures si 2 S where qsi = q Does not support second step { expensive PKU/2014-08-28 36

224. How to Generate Candidate List Two step process: 1. For each node of Q get lists of nodes in G that include that node. 2. Do a multi-way join to get the candidate list Alternatives: Sequential scan of G Both steps are inecient Use S-trees Height-balanced tree over signatures Run an inclusion query for each node of Q and get lists of nodes in G that include that node. Given query signature q and a set of data signatures S,

225. nd all data signatures si 2 S where qsi = q Does not support second step { expensive VS-tree (and VS-tree) Multi-resolution summary graph based on S-tree Supports both steps eciently Grouping by vertices PKU/2014-08-28 36

226. S-tree Solution 1111 1111 00010 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 G1 G2 0000 1000 001 0010 1000 0000 0100 0000 0010 1000 0001 0100 0001 008 0100 1000 0000 1001 1001 1000 0001 0001 0001 0100 005 004 006 002 003 007 011 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 0100 0000 1000 0000 0000 0100 10000 PKU/2014-08-28 37

227. S-tree Solution 1111 1111 00010 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 G1 G2 0000 1000 001 0010 1000 0000 0100 0000 0010 1000 0001 0100 0001 008 0100 1000 0000 1001 1001 1000 0001 0001 0001 0100 005 004 006 002 003 007 011 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 0100 0000 1000 0000 0000 0100 10000 PKU/2014-08-28 37

228. S-tree Solution 10000 002 1111 1111 00010 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 G1 G2 0000 1000 001 0010 1000 0000 0100 0000 0010 1000 0001 0100 0001 008 0100 1000 0000 1001 1001 1000 0001 0001 0001 0100 005 004 006 002 003 007 011 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 0100 0000 1000 0000 0000 0100 011 PKU/2014-08-28 37

229. S-tree Solution 10000 002 1111 1111 00010 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 G1 G2 0000 1000 001 0010 1000 0000 0100 0000 0010 1000 0001 0100 0001 008 0100 1000 0000 1001 1001 1000 0001 0001 0001 0100 005 004 006 002 003 007 011 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 0100 0000 1000 0000 0000 0100 011 003 008 PKU/2014-08-28 37

230. S-tree Solution 10000 002 1111 1111 00010 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 G1 G2 0000 1000 001 0010 1000 0000 0100 0000 0010 1000 0001 0100 0001 008 0100 1000 0000 1001 1001 1000 0001 0001 0001 0100 005 004 006 002 003 007 011 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 0100 0000 1000 0000 0000 0100 011 003 008 004 009 PKU/2014-08-28 37

231. S-tree Solution 10000 002 1111 1111 00010 004 009 on on 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 G1 G2 0000 1000 001 0010 1000 0000 0100 0000 0010 1000 0001 0100 0001 008 0100 1000 0000 1001 1001 1000 0001 0001 0001 0100 005 004 006 002 003 007 011 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 0100 0000 1000 0000 0000 0100 011 003 008 PKU/2014-08-28 37

232. S-tree Solution 10000 002 1111 1111 00010 004 009 on on 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 G1 G2 0000 1000 001 0010 1000 0000 0100 0000 0010 1000 0001 0100 0001 008 0100 1000 0000 1001 1001 1000 0001 0001 0001 0100 005 004 006 002 003 007 011 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 0100 0000 1000 0000 0000 0100 011 003 008 Possibly large join space! PKU/2014-08-28 37

233. VS-tree 1111 1111 10001 01100 0110 1111 1101 1101 10000 0000 1110 0110 1001 1100 1001 1001 1101 G1 G2 0000 1000 001 0010 1000 0000 0100 0000 0010 1000 0001 0100 0001 008 01000 0100 1000 0000 1001 1001 1000 01000 0001 0001 0001 0100 005 004 006 002 003 007 011 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 11101 10010 10000 00001 01100 00010 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 Super edge PKU/2014-08-28 38

234. Pruning with VS-Tree 1111 1111 10001 01100 0110 1111 1101 1101 10000 0000 1110 0110 1001 1100 1001 1001 1101 G1 G2 0000 1000 001 0010 1000 0000 0100 0000 0010 1000 0001 0100 0001 008 01000 0100 1000 0000 1001 1001 1000 01000 0001 0001 0001 0100 005 004 006 002 003 007 011 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 11101 10010 10000 00001 01100 00010 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 00010 0100 0000 1000 0000 0000 0100 10000 PKU/2014-08-28 39

235. Pruning with VS-Tree 1111 1111 10001 01100 0110 1111 1101 1101 10000 0000 1110 0110 1001 1100 1001 1001 1101 G1 G2 0000 1000 001 0010 1000 0000 0100 0000 0010 1000 0001 0100 0001 008 01000 0100 1000 0000 1001 1001 1000 01000 0001 0001 0001 0100 005 004 006 002 003 007 011 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 11101 10010 10000 00001 01100 00010 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 00010 0100 0000 1000 0000 0000 0100 10000 PKU/2014-08-28 39

236. Pruning with VS-Tree 1111 1111 10001 01100 0110 1111 1101 1101 10000 0000 1110 0110 1001 1100 1001 1001 1101 G1 G2 0000 1000 001 0010 1000 0000 0100 0000 0010 1000 0001 0100 0001 008 01000 0100 1000 0000 1001 1001 1000 01000 0001 0001 0001 0100 005 004 006 002 003 007 011 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 11101 10010 10000 00001 01100 00010 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 00010 0100 0000 1000 0000 0000 0100 10000 d3 2 d3 3 00010 10000 d3 3 d3 4 d3 1 d3 4 G3 01000 PKU/2014-08-28 39

237. Pruning with VS-Tree 1111 1111 10001 01100 0110 1111 1101 1101 10000 0000 1110 0110 1001 1100 1001 1001 1101 G1 G2 0000 1000 001 0010 1000 0000 0100 0000 0010 1000 0001 0100 0001 008 01000 0100 1000 0000 1001 1001 1000 01000 0001 0001 0001 0100 005 004 006 002 003 007 011 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 11101 10010 10000 00001 01100 00010 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 00010 0100 0000 1000 0000 0000 0100 10000 d3 2 d3 3 00010 10000 d3 3 d3 4 d3 1 d3 4 G3 01000 003 008 002 011 004 009 on on PKU/2014-08-28 39

238. Pruning with VS-Tree 1111 1111 10001 01100 0110 1111 1101 1101 10000 0000 1110 0110 1001 1100 1001 1001 1101 G1 G2 0000 1000 001 0010 1000 0000 0100 0000 0010 1000 0001 0100 0001 008 01000 0100 1000 0000 1001 1001 1000 01000 0001 0001 0001 0100 005 004 006 002 003 007 011 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 11101 10010 10000 00001 01100 00010 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 00010 0100 0000 1000 0000 0000 0100 10000 d3 2 d3 3 00010 10000 d3 3 d3 4 d3 1 d3 4 G3 01000 003 008 002 011 004 009 on on PKU/2014-08-28 39

239. Pruning with VS-Tree 1111 1111 10001 01100 0110 1111 1101 1101 10000 0000 1110 0110 1001 1100 1001 1001 1101 G1 G2 0000 1000 001 0010 1000 0000 0100 0000 0010 1000 0001 0100 0001 008 01000 0100 1000 0000 1001 1001 1000 01000 0001 0001 0001 0100 005 004 006 002 003 007 011 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 11101 10010 10000 00001 01100 00010 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 00010 0100 0000 1000 0000 0000 0100 10000 d3 2 d3 3 00010 10000 d3 3 d3 4 d3 1 d3 4 G3 01000 003 008 002 011 004 009 on on Reduced join space! PKU/2014-08-28 39

240. Adaptivity to Workload Web applications that are supported by RDF data management systems are far more varied than conventional relational applications Data that are being handled are far more heterogeneous SPARQL is far more exible in how triple patterns (i.e., the atomic query unit) can be combined An experiment [Aluc et al., 2014] RDF-3X VOS (6.1) VOS (7.1) MonetDB 4Store % queries for which tested system is fastest 20.9 0.0 22.6 56.5 0.0 Total workload execution time (hours) 27.1 20.9 20.8 38.6 72.2 Mean (per query) execution time (sec- onds) 7.8 6.0 6.0 11.1 20.7 PKU/2014-08-28 40

241. Adaptivity to Workload Web applications that are supported by RDF data management systems are far more varied than conventional relational applications Data that are being handled are far more heterogeneous SPARQL is far more exible in how triple patterns (i.e., the atomic query unit) can be combined An experiment [Aluc et al., 2014] Summary of Experiments RDF-3X VOS (6.1) VOS (7.1) MonetDB 4Store % I queries No single for which system is a sole 20.9 winner across 0.0 all queries 22.6 56.5 0.0 tested I system is No single system is the sole loser across all queries, either fastest I There can be 2{5 orders of magnitude dierence in the performance (i.e., query Total workload execution time (hours) 27.1 20.9 20.8 38.6 72.2 execution time) between the best and the worst system for a given query I The winner in one query may timeout in another I Performance dierence widens as dataset size increases Mean (per query) execution time (sec- onds) 7.8 6.0 6.0 11.1 20.7 PKU/2014-08-28 40

242. Group-by-Query Approach hasPost Tamer Post23571 Olaf taggedIn UWaterloo worksAt hasPost taggedIn hasPost Tamer Post23 Bob likes UWaterloo worksAt Post2 hasPost favourites hasPost Tamer Post23 Bob taggedIn UWaterloo worksAt Post2 PKU/2014-08-28 41

243. Challenges Group-by-query clusters (a) do not have

244. xed size, (b) contain same set of attributes 1. Workload time analysis Type-A, robust Type-C, robust Type-A, adaptable Type-B, adaptable Type-B, adaptable Type-B, adaptable Type-B, adaptable Type-C, adaptable PKU/2014-08-28 42

246. xed size, (b) contain same set of attributes 1. Workload time analysis 2. Updating the physical layout Cache Storage System Hash Function evict @t1 function adapts Hash Function @tk PKU/2014-08-28 42

248. xed size, (b) contain same set of attributes 1. Workload time analysis 2. Updating the physical layout 3. Partial indexing Index { { { { { { { { { { Cache Storage System Hash Function evict @t1 function adapts Hash Function @tk SPARQL Query Engine PKU/2014-08-28 42

249. chameleon-db Prototype system [Aluc et al., 2013] 35,000 lines of code in C++ and growing Structural Index ... Vertex Index Spill Index Storage System Cluster Index Storage Advisor Query Engine Plan Generation Evaluation PKU/2014-08-28 43

250. Some Open Problems Scalability of the solutions to very large datasets Maintenance of auxiliary data structures in dynamic environments Adaptive systems to handle varying and time-changing workloads Uncertain RDF data processing Keyword search over RDF data Query processing over incomplete RDF data PKU/2014-08-28 44

252. Remember the Environment Distributed environment Some of the data sites can process SPARQL queries { SPARQL endpoints Not all data sites can process queries PKU/2014-08-28 46

253. Remember the Environment Distributed environment Some of the data sites can process SPARQL queries { SPARQL endpoints Not all data sites can process queries Alternatives PKU/2014-08-28 46

254. Remember the Environment Distributed environment Some of the data sites can process SPARQL queries { SPARQL endpoints Not all data sites can process queries Alternatives Data re-distribution + query decomposition PKU/2014-08-28 46

255. Remember the Environment Distributed environment Some of the data sites can process SPARQL queries { SPARQL endpoints Not all data sites can process queries Alternatives Data re-distribution + query decomposition SPARQL federation: just process at SPARQL endpoints PKU/2014-08-28 46

256. Remember the Environment Distributed environment Some of the data sites can process SPARQL queries { SPARQL endpoints Not all data sites can process queries Alternatives Data re-distribution + query decomposition SPARQL federation: just process at SPARQL endpoints Live querying (see next section) PKU/2014-08-28 46

257. Distributed RDF Processing [Kaoudi and Manolescu, 2014] Data partitioning approaches RDF data warehouse is partitioned and distributed RDF data D = fD1; : : : ;Dng Allocate each Di to a site PKU/2014-08-28 47

258. Distributed RDF Processing [Kaoudi and Manolescu, 2014] Data partitioning approaches RDF data warehouse is partitioned and distributed RDF data D = fD1; : : : ;Dng Allocate each Di to a site Partitioning alternatives Table-based (e.g., [Husain et al., 2011]) Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013]) Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013]) PKU/2014-08-28 47

259. Distributed RDF Processing [Kaoudi and Manolescu, 2014] Data partitioning approaches RDF data warehouse is partitioned and distributed RDF data D = fD1; : : : ;Dng Allocate each Di to a site Partitioning alternatives Table-based (e.g., [Husain et al., 2011]) Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013]) Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013]) SPARQL query decomposed Q = fQ1; : : : ;Qkg Distributed execution of fQ1; : : : ;Qkg over fD1; : : : ;Dng PKU/2014-08-28 47

260. Distributed RDF Processing [Kaoudi and Manolescu, 2014] Data partitioning approaches RDF data warehouse is partitioned and distributed RDF data D = fD1; : : : ;Dng Allocate each Di to a site Partitioning alternatives Table-based (e.g., [Husain et al., 2011]) Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013]) Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013]) SPARQL query decomposed Q = fQ1; : : : ;Qkg Distributed execution of fQ1; : : : ;Qkg over fD1; : : : ;Dng I High performance I Great for parallelizing centralized RDF data I May not be possible to re-partition and re-allocate Web data (i.e., LOD) PKU/2014-08-28 47

261. Distributed RDF Processing { 2 Data summary-based approaches Build summaries (index) for the distributed RDF datasets (e.g., [Atre et al., 2010; Prasser et al., 2012]) PKU/2014-08-28 48

262. Distributed RDF Processing { 2 Data summary-based approaches Build summaries (index) for the distributed RDF datasets (e.g., [Atre et al., 2010; Prasser et al., 2012]) SPARQL query Q = fQ1; : : : ;Qkg Distributed execution of fQ1; : : : ;Qkg using the data summary PKU/2014-08-28 48

263. Distributed RDF Processing { 2 Data summary-based approaches Build summaries (index) for the distributed RDF datasets (e.g., [Atre et al., 2010; Prasser et al., 2012]) SPARQL query Q = fQ1; : : : ;Qkg Distributed execution of fQ1; : : : ;Qkg using the data summary I No data re-partitioning and re-allocation I Have to scan the data at each site I Index over distributed data with maintenance concerns PKU/2014-08-28 48

264. SPARQL Endpoint Federation Consider only the SPARQL endpoints for query execution No data re-partitioning/re-distribution Consider D = D1 [ D2 [ : : : [ Dn; Di : SPARQL endpoint PKU/2014-08-28 49

265. SPARQL Endpoint Federation Consider only the SPARQL endpoints for query execution No data re-partitioning/re-distribution Consider D = D1 [ D2 [ : : : [ Dn; Di : SPARQL endpoint Alternatives SPARQL query decomposed Q = fQ1; : : : ;Qkg and executed over fD1; : : : ;Dng { DARQ, FedX [Schwarte et al., 2011], SPLENDID [Gorlitz and Staab, 2011], ANAPSID [Acosta et al., 2011] Partial query evaluation { Distributed gStore [Peng et al., 2014] PKU/2014-08-28 49

266. SPARQL Endpoint Federation Consider only the SPARQL endpoints for query execution No data re-partitioning/re-distribution Consider D = D1 [ D2 [ : : : [ Dn; Di : SPARQL endpoint Alternatives SPARQL query decomposed Q = fQ1; : : : ;Qkg and executed over fD1; : : : ;Dng { DARQ, FedX [Schwarte et al., 2011], SPLENDID [Gorlitz and Staab, 2011], ANAPSID [Acosta et al., 2011] Partial query evaluation { Distributed gStore [Peng et al., 2014] Partial evaluation I Given function f (s; d) and part of its input s, perform f 's computation that only depends on s to get f 0(d) I Compute f 0(d) when d becomes available I Applied to, e.g., XML [Buneman et al., 2006] PKU/2014-08-28 49

267. Distributed SPARQL Using Partial Query Evaluation Two steps: 1. Evaluate a query at each site to

268. nd local matches Query is the function and each Di is the known input Inner match or local partial match D1 D2 D3 D4 PKU/2014-08-28 50

269. Distributed SPARQL Using Partial Query Evaluation Two steps: 1. Evaluate a query at each site to

270. nd local matches Query is the function and each Di is the known input Inner match or local partial match 2. Assemble the partial matches to get

271. nal result Crossing match Centralized assembly Distributed assembly D1 D2 D3 D4 Crossing match PKU/2014-08-28 50

272. Some Open Problems Handling data at non-SPARQL endpoint sites Modi

273. cation to SPARQL endpoints (for partial query evaluation) Heterogeneous use of vocabularies (use of ontologies) PKU/2014-08-28 51

275. Live Query Processing Not all data resides at SPARQL endpoints Freshness of access to data important Potentially countably in

276. nite data sources Live querying On-line execution Only rely on linked data principles Alternatives Traversal-based approaches Index-based approaches Hybrid approaches PKU/2014-08-28 53

277. SPARQL Query Semantics in Live Querying Full-web semantics Scope of evaluating a SPARQL expression is all Linked Data Query result completeness cannot be guaranteed by any (terminating) execution PKU/2014-08-28 54

278. SPARQL Query Semantics in Live Querying Full-web semantics Scope of evaluating a SPARQL expression is all Linked Data Query result completeness cannot be guaranteed by any (terminating) execution Reachability-based query semantics Query consists of a SPARQL expression, a set of seed URIs S, and a reachability condition c Scope: all data along paths of data links that satisfy the condition Computationally feasible PKU/2014-08-28 54

279. Traversal Approaches Discover relevant URIs recursively by traversing (speci

280. c) data links at query execution runtime [Hartig, 2013; Ladwig and Tran, 2011] Implements reachability-based query semantics Start from a set of seed URIs Recursively follow and discover new URIs Important issue is selection of seed URIs Retrieved data serves to discover new URIs and to construct result PKU/2014-08-28 55

282. c) data links at query execution runtime [Hartig, 2013; Ladwig and Tran, 2011] Implements reachability-based query semantics Advantages Easy to implement. No data structure to maintain. Start from a set of seed URIs Recursively follow and discover new URIs Important issue is selection of seed URIs Retrieved data serves to discover new URIs and to construct result PKU/2014-08-28 55

284. c) data links at query execution runtime [Hartig, 2013; Ladwig and Tran, 2011] Implements reachability-based query semantics Advantages Easy to implement. No data structure to maintain. Start from a set of seed URIs Recursively follow and discover new URIs Important issue is selection of seed URIs Retrieved data serves to discover new URIs and to construct result Disadvantages Possibilities for parallelized data retrieval are limited Repeated data retrieval introduces signi

285. cant query latency. PKU/2014-08-28 55

286. Traversal Optimization Dynamic query execution [Hartig and Ozsu, 2014] Data Retrieval ...lookup queue... Output PKU/2014-08-28 56

287. Traversal Optimization Dynamic query execution [Hartig and Ozsu, 2014] Prioritization of URIs { a number of alternatives Non-adaptive Adaptive, Local processing aware Adaptive, Local processing agnostic Intermediate solution driven Solution-aware graph-based Hybrid graph-based Purely graph-based PKU/2014-08-28 56

288. Index Approaches Use pre-populated index to determine relevant URIs (and to avoid as many irrelevant ones as possible) Dierent index keys possible; e.g., triple patterns [Umbrich et al., 2011] Index entries a set of URIs Indexed URIs may appear multiple times (i.e., associated with multiple index keys) Each URI in such an entry may be paired with a cardinality (utilized for source ranking) Key: tp Entry: furi1; uri2; ; uring GET urii PKU/2014-08-28 57

289. Index Approaches Use pre-populated index to determine relevant URIs (and to avoid as many irrelevant ones as possible) Dierent index keys possible; e.g., triple patterns [Umbrich et al., 2011] Index entries a set of URIs Indexed URIs may appear multiple times (i.e., associated with multiple index keys) Each URI in such an entry may be paired with a cardinality (utilized for source ranking) Advantages Data retrieval can be fully parallelized Reduces the impact of data retrieval on query execution time Key: tp Entry: furi1; uri2; ; uring GET urii PKU/2014-08-28 57

290. Index Approaches Use pre-populated index to determine relevant URIs (and to avoid as many irrelevant ones as possible) Dierent index keys possible; e.g., triple patterns [Umbrich et al., 2011] Index entries a set of URIs Indexed URIs may appear multiple times (i.e., associated with multiple index keys) Each URI in such an entry may be paired with a cardinality (utilized for source ranking) Advantages Data retrieval can be fully parallelized Reduces the impact of data retrieval on query execution time Key: tp Entry: furi1; uri2; ; uring Disadvantages Querying can only start after index GET construction urii Depends on what has been selected for the index Freshness may be an issue Index maintenance PKU/2014-08-28 57

291. Hybrid Approach Perform a traversal-based execution using a prioritized list of URIs to look up [Ladwig and Tran, 2010] Initial seed from the pre-populated index Non-seed URIs are ranked by a function based on information in the index New discovered URIs that are not in the index are ranked according to number of referring documents PKU/2014-08-28 58

292. Some Open Problems Optimize queries by using statistics collected during earlier query executions Heterogeneous use of vocabularies (use of ontologies) Combine SPARQL federation to leverage SPARQL endpoint functionality PKU/2014-08-28 59

294. Conclusions RDF and Linked Object Data seem to have considerable promise for Web data management 2014 2011 PKU/2014-08-28 61

295. Conclusions RDF and Linked Object Data seem to have considerable promise for Web data management More work needs to be done Query semantics Adaptive system design Optimizations { both in data warehousing and distributed environments Live querying requires signi

296. cant thought to reduce latency 2014 2011 PKU/2014-08-28 61

297. Conclusions What I did not talk about: Not much on general distributed/parallel processing Not much on SPARQL semantics Nothing about RDFS { no schema stu Nothing about entailment regimes 0 ) no reasoning PKU/2014-08-28 62

298. Thank you! Research supported by PKU/2014-08-28 63

299. References I Abadi, D. J., Marcus, A., Madden, S., and Hollenbach, K. (2009). SW-Store: a vertically partitioned DBMS for semantic web data management. VLDB J., 18(2):385{406. Abadi, D. J., Marcus, A., Madden, S. R., and Hollenbach, K. (2007). Scalable semantic web data management using vertical partitioning. In Proc. 33rd Int. Conf. on Very Large Data Bases, pages 411{422. Abiteboul, S., Quass, D., McHugh, J., Widom, J., and Wiener, J. (1997). The Lorel query language for semistructured data. Int. J. Digit. Libr., 1(1):68{88. Acosta, M., Vidal, M.-E., Lampo, T., Castillo, J., and Ruckhaus, E. (2011). ANAPSID: an adaptive query processing engine for SPARQL endpoints. In Proc. 10th Int. Semantic Web Conf., pages 18{34. Aluc, G., Hartig, O., Ozsu, M. T., and Daudjee, K. (2014). Diversi

300. ed stress testing of RDF data management systems. In Proc. 13th Int. Semantic Web Conf. Forthcoming. Aluc, G., Ozsu, M. T., Daudjee, K., and Hartig, O. (2013). chameleon-db: a workload-aware robust RDF data management system. Technical Report CS-2013-10, University of Waterloo. PKU/2014-08-28 64

301. References II Arocena, G. and Mendelzon, A. (1998). Weboql: Restructuring documents, databases and webs. In Proc. 14th Int. Conf. on Data Engineering, pages 24{33. Atre, M., Chaoji, V., Zaki, M. J., and Hendler, J. A. (2010). Matrix bit loaded: A scalable lightweight join query processor for rdf data. In Proc. 19th Int. World Wide Web Conf., pages 41{50. Bornea, M. A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P., Udrea, O., and Bhattacharjee, B. (2013). Building an ecient RDF store over a relational database. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 121{132. Buneman, P., Cong, G., Fan, W., and Kementsietsidis, A. (2006). Using partial evaluation in distributed query evaluation. In Proc. 32nd Int. Conf. on Very Large Data Bases, pages 211{222. Buneman, P., Davidson, S., Hillebrand, G. G., and Suciu, D. (1996). A query language and optimization techniques for unstructured data. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 505{516. Fernandez, M., Florescu, D., and Levy, A. (1997). A query language for a web-site management system. ACM SIGMOD Rec., 26(3):4{11. PKU/2014-08-28 65

302. References III Gorlitz, O. and Staab, S. (2011). SPLENDID: SPARQL endpoint federation exploiting VOID descriptions. In Proc. 2nd Int. Workshop on Consuming Linked Data. Gurajada, S., Seufert, S., Miliaraki, I., and Theobald, M. (2014). TriAD: A distributed shared-nothing RDF engine based on asynchronous message passing. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 289{300. Hartig, O. (2012). SPARQL for a web of linked data: Semantics and computability. In Proc. 9th Extended Semantic Web Conf., pages 8{23. Hartig, O. (2013). SQUIN: a traversal based query execution system for the web of linked data. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 1081{1084. Hartig, O. and Ozsu, M. T. (2014). Optimizing response time of traversal-based query optimization. In preparation. Huang, J., Abadi, D. J., and Ren, K. (2011). Scalable SPARQL querying of large RDF graphs. Proc. VLDB Endowment, 4(11):1123{1134. PKU/2014-08-28 66

303. References IV Husain, M. F., McGlothlin, J., Masud, M. M., Khan, L. R., and Thuraisingham, B. (2011). Heuristics-based query processing for large RDF graphs using cloud computing. IEEE Trans. Knowl. and Data Eng., 23(9):1312{1327. Kaoudi, Z. and Manolescu, I. (2014). RDF in the clouds: A survey. VLDB J. Forthcoming. Konopnicki, D. and Shmueli, O. (1995). W3QS: A query system for the World Wide Web. In Proc. 21th Int. Conf. on Very Large Data Bases, pages 54{65. Ladwig, G. and Tran, T. (2010). Linked data query processing strategies. In Proc. 9th Int. Semantic Web Conf., pages 453{469. Ladwig, G. and Tran, T. (2011). SIHJoin: Querying remote and local linked data. In Proc. 8th Extended Semantic Web Conf., pages 139{153. Lakshmanan, L. V. S., Sadri, F., and Subramanian, I. N. (1996). A declarative language for querying and restructuring the Web. In Proc. 6th Int. Workshop on Research Issues on Data Eng., pages 12{21. Lee, K. and Liu, L. (2013). Scaling queries over big rdf graphs with semantic hash partitioning. Proc. VLDB Endowment, 6(14):1894{1905. PKU/2014-08-28 67

304. References V Mendelzon, A. O., Mihaila, G. A., and Milo, T. (1997). Querying the World Wide Web. Int. J. Digit. Libr., 1(1):54{67. Neumann, T. and Weikum, G. (2008). RDF-3X: a RISC-style engine for RDF. Proc. VLDB Endowment, 1(1):647{659. Neumann, T. and Weikum, G. (2009). The RDF-3X engine for scalable management of RDF data. VLDB J., 19(1):91{113. Papakonstantinou, Y., Garcia-Molina, H., and Widom, J. (1995). Object exchange across heterogeneous information sources. In Proc. 11th Int. Conf. on Data Engineering, pages 251{260. Peng, P., Zou, L., Ozsu, M. T., Chen, L., and Zhao, D. (2014). Processing SPARQL queries over linked data { a distributed graph-based approach. In submitted for publication. Prasser, F., Kemper, A., and Kuhn, K. A. (2012). Ecient distributed query processing for autonomous rdf databases. In Proc. 15th Int. Conf. on Extending Database Technology, pages 372{383. Schwarte, A., Haase, P., Hose, K., Schenkel, R., and Schmidt, M. (2011). Fedx: A federation layer for distributed query processing on linked open data. In Proc. 8th Extended Semantic Web Conf., pages 481{486. PKU/2014-08-28 68

305. References VI Umbrich, J., Hose, K., Karnstedt, M., Harth, A., and Polleres, A. (2011). Comparing data summaries for processing live queries over linked data. World Wide Web J., 14(5-6):495{544. Weiss, C., Karras, P., and Bernstein, A. (2008). Hexastore: sextuple indexing for semantic web data management. Proc. VLDB Endowment, 1(1):1008{1019. Wilkinson, K. (2006). Jena property table implementation. Technical Report HPL-2006-140, HP Laboratories Palo Alto. Zhang, X., Chen, L., Tong, Y., and Wang, M. (2013). EAGRE: Towards scalable I/O ecient SPARQL query evaluation on the cloud. In Proc. 29th Int. Conf. on Data Engineering, pages 565{576. Zou, L., Mo, J., Chen, L., Ozsu, M. T., and Zhao, D. (2011). gStore: answering SPARQL queries via subgraph matching. Proc. VLDB Endowment, 4(8):482{493. Zou, L., Ozsu, M. T., Chen, L., Shen, X., Huang, R., and Zhao, D. (2014). gStore: A graph-based SPARQL query engine. VLDB J., 23(4):565{590. PKU/2014-08-28 69

Web Data Management with RDF

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Web Data Management with RDF

Similar to Web Data Management with RDF (20)

Recently uploaded

Recently uploaded (20)

Web Data Management with RDF