Human Factors of XR: Using Human Factors to Design XR Systems
Web Data Management with RDF
1. Web Data Management in RDF Age
M. Tamer Ozsu
University of Waterloo
David R. Cheriton School of Computer Science
PKU/2014-08-28 1
2. Acknowledgements
This presentation draws upon collaborative research and
discussions with the following colleagues (in alphabetical order)
Gunes Aluc, University of Waterloo
Khuzaima Daudjee, University of Waterloo
Olaf Hartig, University of Waterloo
Lei Chen, Hong Kong University of Science Technology
Lei Zou, Peking University
PKU/2014-08-28 2
3. Web Data Management
A long term research interest in the DB community
2000 2004
2011 2011
PKU/2014-08-28 3
4. Interest Due to Properties of Web Data
Lack of a schema
Data is at best semi-structured
Missing data, additional attributes, similar data but not
identical
Volatility
Changes frequently
May conform to one schema now, but not later
Scale
Does it make sense to talk about a schema for Web?
How do you capture everything?
Querying diculty
What is the user language?
What are the primitives?
Arent search engines or metasearch engines sucient?
PKU/2014-08-28 4
5. More Recent Approaches to Web Querying
Fusion Tables
Users contribute data in spreadsheet, CVS, KML format
Possible joins between multiple data sets
Extensive visualization
PKU/2014-08-28 8
6. More Recent Approaches to Web Querying
Fusion Tables
Users contribute data in spreadsheet, CVS, KML format
Possible joins between multiple data sets
Extensive visualization
XML
Data exchange language
Primarily tree based structure
list title=MOVIES
film
titleThe Shining/title
directorStanley Kubrick/director
actorJack Nicholson/actor
/film
film
titleSpartacus/title
directorStanley Kubrick/director
/film
film
titleThe Passenger/title
actorJack Nicholson/actor
/film
...
/list
root
9. lm
title
The Passenger
actor
Jack Nicholson
PKU/2014-08-28 8
10. More Recent Approaches to Web Querying
Fusion Tables
Users contribute data in spreadsheet, CVS, KML format
Possible joins between multiple data sets
Extensive visualization
XML
Data exchange language
Primarily tree based structure
RDF (Resource Description Framework) SPARQL
W3C recommendation
Simple, self-descriptive model
Building block of semantic web Linked Open Data (LOD)
PKU/2014-08-28 8
11. RDF and Semantic Web
RDF is a language for the conceptual modeling of information
about resources (web resources in our context)
A building block of semantic web
Facilitates exchange of information
Search engine results can be more focused and structured
Facilitates data integration (mashes)
Machine understandable
Understand the information on the web and the
interrelationships among them
PKU/2014-08-28 9
12. RDF Uses
Yago and DBpedia extract facts from Wikipedia represent
as RDF ! structural queries
Communities build RDF data
E.g., biologists: Bio2RDF and Uniprot RDF
Web data integration
Linked Open Data Cloud
. . .
PKU/2014-08-28 10
13. RDF Data Volumes . . .
. . . are growing { and fast
Linked data cloud currently consists of 325 datasets with
25B triples
Size almost doubling every year
PKU/2014-08-28 11
14. RDF Data Volumes . . .
. . . are growing { and fast
Linked data cloud currently consists of 325 datasets with
25B triples
Size almost doubling every year
As of March 2009
LinkedCT
Reactome
Taxonomy
KEGG
GeneID
PubMed
Pfam
UniProt
OMIM
PDB
BBC
Later +
TOTP
riese
Symbol
ChEBI
Daily
Med
Disea-some
CAS
HGNC
Inter
Pro
Drug
Bank
UniParc
UniRef
ProDom
PROSITE
Gene
Ontology
Homolo
Gene
Pub
Chem
MGI
UniSTS
GEO
Species
Jamendo
BBC
Programm
es
Music-brainz
Magna-tune
Surge
Radio
MySpace
Wrapper
Audio-
Scrobbler
Linked
MDB
BBC
John
Peel
BBC
Playcount
Data
Gov-
Track
US
Census
Data
Geo-names
lingvoj
World
Fact-book
Euro-stat
IRIT
Toulouse
SW
Conference
Corpus
RDF Book
Mashup
Project
Guten-berg
DBLP
Hannover
DBLP
Berlin
LAAS-CNRS
Buda-pest
BME
IEEE
IBM
Resex
Pisa
New-castle
RAE
2001
CiteSeer
ACM
DBLP
RKB
Explorer
eprints
LIBRIS
Semantic
Web.org Eurécom
ECS
South-ampton
SIOC Revyu
Sites
Doap-space
Flickr
exporter
FOAF
profiles
flickr
wrappr
Crunch
Base
Sem-
Web-
Central
Open-
Guides
Wiki-company
QDOS
Pub
Guide
Open
Calais
RDF
ohloh
W3C
WordNet
Open
Cyc
UMBEL
Yago
DBpedia
Freebase
Virtuoso
Sponger
March '09:
89 datasets
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.
http://lod-cloud.net/
PKU/2014-08-28 11
15. RDF Data Volumes . . .
. . . are growing { and fast
Linked data cloud currently consists of 325 datasets with
25B triples
Size almost doubling every year
New-castle
User-generated content
As of September 2010
Audio-scrobbler
(DBTune)
Music
Brainz
(zitgist)
P20
YAGO
Chronic-ling
America
World
Fact-book
(FUB)
Geo
Names
Moseley
Folk
WordNet
(VUA)
WordNet
(W3C)
VIVO UF
VIVO
Indiana
VIVO
Cornell
VIAF
URI
Burner
Sussex
Reading
Lists
Plymouth
Reading
Lists
UMBEL
UK Post-codes
legislation
.gov.uk
Uberblic
UB
Mann-heim
TWC LOGD
GTAA
BBC
Program
mes
Twarql
transport
data.gov
.uk
totl.net
Tele-graphis
TCM
Gene
DIT
Taxon
Concept
The Open
Library
(Talis)
t4gm
Surge
Radio
RAMEAU
SH
STW
statistics
data.gov
.uk
St.
Andrews
Resource
Lists
ECS
South-ampton
EPrints
Semantic
Crunch
Base
semantic
web.org
Semantic
XBRL
SW
Dog
Food
rdfabout
US SEC
Wiki
UN/
LOCODE
Ulm
ECS
(RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
LAAS
KISTI
JISC
IRIT
IEEE
IBM
Eurécom
ERA
ePrints
dotAC
DEPLOY
DBLP
(RKB
Explorer)
Course-ware
CORDIS
CiteSeer
Budapest
ACM
riese
Revyu
research
data.gov
.uk
reference
data.gov
.uk
Recht-spraak.
nl
RDF
ohloh
Last.FM
(rdfize)
RDF
Book
Mashup
PSH
lingvoj
Product
DB
Poké-pédia
PBAC
Ord-nance
Survey
Openly
Local
The Open
Library
Open
Cyc
Open
Calais
OpenEI
New
York
Times
NTU
Resource
Lists
NDL
subjects
MARC
Codes
List
Man-chester
Reading
Lists
Lotico
The
London
Gazette
LOIUS
lobid
Resources
lobid
Organi-sations
Linked
MDB
Linked
LCCN
Linked
GeoData
Linked
CT
Linked
Open
Numbers
LIBRIS
Lexvo
LCSH
DBLP
(L3S)
Linked
Sensor Data
(Kno.e.sis)
Good-win
Family
Jamendo
iServe
NSZL
Catalog
GovTrack
GESIS
Geo
Species
Geo
Linked
Data
(es)
STITCH
Project
Guten-berg
(FUB)
SIDER
Medi
Care
Euro-stat
(FUB)
Drug
Bank
Disea-some
DBLP
(FU
Berlin)
Daily
Med
Freebase
flickr
wrappr
Fishes
of Texas
FanHubz
Event-
Media
EUTC
Produc-tions
Eurostat
EUNIS
ESD
stan-dards
Popula-tion
(En-
AKTing)
NHS
(EnAKTing)
Mortality
(En-
AKTing)
Energy
(En-
AKTing)
CO2
(En-
AKTing)
education
data.gov
.uk
ECS
South-ampton
Gem.
Norm-datei
data
dcs
MySpace
(DBTune)
Music
Brainz
(DBTune)
Magna-tune
John
Peel
(DB
Tune)
classical
(DB
Tune)
Last.fm
Artists
(DBTune)
DB
Tropes
dbpedia
lite
DBpedia
Pokedex
Airports
NASA
(Data
Incu-bator)
Music
Brainz
(Data
Incubator)
Discogs
(Data In-cubator)
Climbing
Linked Data
for Intervals
Cornetto
Chem2
Bio2RDF
biz.
data.
gov.uk
UniRef
UniSTS
Uni
Path-way
Taxo-nomy
UniParc
UniProt
SGD
Reactome
PubMed
Pub
Chem
Pfam PDB
PRO-SITE
ProDom
OMIM
OBO
MGI
KEGG
Reaction
KEGG
Drug
KEGG
Pathway
KEGG
Glycan
KEGG
Enzyme
KEGG
Cpd
InterPro
Homolo
Gene
HGNC
Gene
Ontology
GeneID
Gen
Bank
ChEBI
CAS
Affy-metrix
BibBase
BBC
Wildlife
Finder
BBC
Music
rdfabout
US Census
Media
Geographic
Publications
Government
Cross-domain
Life sciences
September '10:
203 datasets
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.
http://lod-cloud.net/
PKU/2014-08-28 11
16. RDF Data Volumes . . .
. . . are growing { and fast
Linked data cloud currently consists of 325 datasets with
25B triples
Size almost doubling every year
RESEX
IBM
User-generated content
As of September 2011
Audio
Scrobbler
(DBTune)
Music
Brainz
(zitgist)
P20
Turismo
de
Zaragoza
yovisto
Yahoo!
Geo
Planet
World
Fact-book
Moseley
Folk
YAGO
El
Viajero
Tourism
BBC
Program
mes BBC
Geo
Names
WordNet
(VUA)
WordNet
(W3C)
VIVO UF
Calames
VIVO
Indiana
VIVO
Cornell
VIAF
URI
Burner
Sussex
Reading
Lists
Plymouth
Reading
Lists
Source Code
Ecosystem
Linked Data
UniProt
PubMed
UniRef
UMBEL
UK Post-codes
legislation
data.gov.uk
Uberblic
UB
Mann-heim
TWC LOGD
Twarql
transport
data.gov.
uk
Traffic
Scotland
theses.
fr
Thesau-rus
W
totl.net
Tele-graphis
Semantic
Tweet
TCM
Gene
DIT
Taxon
Concept
Open
Library
(Talis)
tags2con
delicious
t4gm
info
Swedish
Open
Cultural
Heritage
Surge
Radio
Sudoc
STW
RAMEAU
SH
statistics
data.gov.
uk
St.
Andrews
Resource
Lists
ECS
South-ampton
EPrints
SSW
Thesaur
us
Linked
User
Feedback
gnoss
Greek
DBpedia
Smart
Link
Slideshare
2RDF
semantic
web.org
GovTrack
Semantic
XBRL
SW
Dog
Food
US SEC
(rdfabout)
Sears
Scotland
Pupils
Exams
Scotland
Geo-graphy
Scholaro-meter
WordNet
(RKB
Explorer)
Wiki
UN/
LOCODE
Ulm
ECS
(RKB
Explorer)
Roma
RISKS
RAE2001
Pisa
OS
OAI
NSF
New-castle
LAAS
KISTI
JISC
IRIT
IEEE
Eurécom
ERA
ePrints dotAC
DEPLOY
DBLP
(RKB
Explorer)
Crime
Reports
UK
Course-ware
CORDIS
(RKB
Explorer)
CiteSeer
Budapest
ACM
riese
Revyu
research
data.gov.
Ren. uk
Energy
Genera-tors
reference
data.gov.
uk
Recht-spraak.
nl
RDF
ohloh
Last.FM
(rdfize)
RDF
Book
Mashup
Rådata
nå!
PSH
Product
Types
Ontology
Product
DB
PBAC
Poké-pédia
patents
data.go
v.uk
Ox
Points
Ord-nance
Survey
Openly
Local
Open
Library
Open
Cyc
Open
Corpo-rates
Open
Calais
OpenEI
Open
Election
Data
Project
Open
Data
Thesau-rus
Ontos
News
Portal
OGOLOD
Ocean
Drilling
Codices
Janus
AMP
New
York
Times
NVD
ntnusc
NTU
Resource
Lists
Norwe-gian
MeSH
NDL
subjects
ndlna
my
Experi-ment
Italian
Museums
medu-cator
MARC
Codes
List
Man-chester
Reading
Lists
Lotico
Weather
Stations
London
Gazette
LOIUS
Linked
Open
Colors
lobid
Resources
lobid
Organi-sations
LEM
Linked
MDB
LinkedL
CCN
Linked
GeoData
LinkedCT
LOV
Linked
Open
Numbers
LODE
Eurostat
(Ontology
Central)
Linked
EDGAR
(Ontology
Central)
Linked
Crunch-base
lingvoj
Lichfield
Spen-ding
LIBRIS
Lexvo
LCSH
DBLP
(L3S)
Linked
Sensor Data
(Kno.e.sis)
Klapp-stuhl-club
Good-win
Family
National
Radio-activity
JP
Jamendo
(DBtune)
Italian
public
schools
ISTAT
Immi-gration
iServe
IdRef
Sudoc
NSZL
Catalog
Hellenic
PD
Hellenic
FBD
Piedmont
Accomo-dations
GovWILD
Google
Art
wrapper
GESIS
GeoWord
Net
Geo
Species
Geo
Linked
Data
GEMET
GTAA
STITCH
SIDER
Project
Guten-berg
Medi
Care
Euro-stat
(FUB)
EURES
Drug
Bank
Disea-some
DBLP
(FU
Berlin)
Daily
Med
CORDIS
(FUB)
Freebase
flickr
wrappr
Fishes
of Texas
Finnish
Munici-palities
ChEMBL
FanHubz
Event
Media
EUTC
Produc-tions
Eurostat
Europeana
EUNIS
EU
Insti-tutions
ESD
stan-dards
EARTh
Enipedia
Popula-tion
(En-
AKTing)
NHS
(En-
AKTing) Mortality
(En-
AKTing)
Energy
(En-
AKTing)
Crime
(En-
AKTing)
CO2
Emission
(En-
AKTing)
EEA
SISVU
educatio
n.data.g
ov.uk
ECS
South-ampton
ECCO-TCP
GND
Didactal
ia
DDC Deutsche
Bio-graphie
data
dcs
Music
Brainz
(DBTune)
Magna-tune
John
Peel
(DBTune)
Classical
(DB
Tune)
Last.FM
artists
(DBTune)
DB
Tropes
Portu-guese
DBpedia
dbpedia
lite
DBpedia
data-open-ac-
uk
SMC
Journals
Pokedex
Airports
NASA
(Data
Incu-bator)
Music
Brainz
(Data
Incubator)
Metoffice
Weather
Forecasts
Discogs
(Data
Incubator)
Climbing
data.gov.uk
intervals
Data
Gov.ie
data
bnf.fr
Cornetto
reegle
Chronic-ling
America
Chem2
Bio2RDF
business
data.gov.
uk
Bricklink
Brazilian
Poli-ticians
BNB
UniSTS
UniPath
way
UniParc
Taxono
my
UniProt
(Bio2RDF)
SGD
Reactome
Pub
Chem
PRO-SITE
ProDom
Pfam
PDB
OMIM
MGI
KEGG
Reaction
KEGG
Pathway
KEGG
Glycan
KEGG
Enzyme
KEGG
Drug
KEGG
Com-pound
InterPro
Homolo
Gene
HGNC
Gene
Ontology
GeneID
Affy-metrix
bible
ontology
BibBase
FTS
BBC
Wildlife
Finder
Music
Alpine
Ski
Austria
LOCAH
Amster-dam
Museum
AGROV
OC
AEMET
US Census
(rdfabout)
Media
Geographic
Publications
Government
Cross-domain
Life sciences
September '11:
295 datasets, 25B
triples
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.
http://lod-cloud.net/
PKU/2014-08-28 11
17. RDF Data Volumes . . .
. . . are growing { and fast
Linked data cloud currently consists of 325 datasets with
25B triples
Size almost doubling every year
April '14:
1091 datasets, ???
triples
Max Schmachtenberg, Christian Bizer, and Heiko Paulheim: Adoption of Linked
Data Best Practices in Dierent Topical Domains. In Proc. ISWC, 2014.
PKU/2014-08-28 11
20. Three Approaches
Data warehousing
Consolidate data in a repository and query it
SPARQL federation
Leverage query services provided by data publishers
Live Linked Data querying
Navigate through LOD by looking up URIs at query execution
time
PKU/2014-08-28 14
40. ned
Resource descriptions can be
contributed by dierent
people/groups and can be located
anywhere in the web
Integrated web database
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
y:JN29704:hasName Jack Nicholson
y:JN29704:BornOnDate 1937-04-22
JN29704:movieActor
y:TS2014
y:TS2014:title The Shining
y:TS2014:releaseDate 1980-05-23
PKU/2014-08-28 19
41. RDF Data Model
Triple: Subject, Predicate (Property),
Object (s; p; o)
Subject: the entity that is described
(URI or blank node)
Predicate: a feature of the entity (URI)
Object: value of the feature (URI,
blank node or literal)
(s; p; o) 2 (U [ B) U (U [ B [ L)
Set of RDF triples is called an RDF graph
U
Predicate
Subject Object
U B U B L
U: set of URIs
B: set of blank nodes
L: set of literals
Subject Predicate Object
http://...imdb.../
62. lm/3418 rdfs:label The Passenger
geo:2635167 gn:name United Kingdom
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOer bm:oers/0743424425amazonOer
lexvo:iso639-3/eng rdfs:label English
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
URI Literal
URI
PKU/2014-08-28 21
63. RDF Graph
United Kingdom
gn:name
The Passenger
refs:label
62348447
gn:population
mdb:
64. lm/2014
movie:initial release date
1980-05-23
bm:oers/0743424425amazonOer
The Shining
refs:label
bm:books/0743424425
4.7
rev:rating
geo:2635167
The Last Tycoon
refs:label
movie:actor movie:actor
mdb:actor/29704
movie:actor name
Jack Nicholson
mdb:
70. nite set D (documents), a Web of Linked
Data is a tuple W = (D; adoc; data) where:
I D D,
I adoc is a partial mapping from URIs to D, and
I data is a total mapping from D to
73. nite set D (documents), a Web of Linked
Data is a tuple W = (D; adoc; data) where:
I D D,
I adoc is a partial mapping from URIs to D, and
I data is a total mapping from D to
74. nite sets of RDF triples.
Web of Linked Data
A Web of Linked Data W = (D; adoc; data)
contains a data link from document d 2 D to
document d0 2 D if there exists a URI u such
that:
I u is mentioned in an RDF triple
t 2 data(d), and
I d0 = adoc(u).
PKU/2014-08-28 23
75. RDF Query Model { SPARQL
Query Model - SPARQL Protocol and RDF Query Language
Given U (set of URIs), L (set of literals), and V (set of
variables), a SPARQL expression is de
76. ned recursively:
an atomic triple pattern, which is an element of
(U [ V) (U [ V) (U [ V [ L)
?x rdfs:label The Shining
P FILTER R, where P is a graph pattern expression and R is a
built-in SPARQL condition (i.e., analogous to a SQL predicate)
?x rev:rating ?p FILTER(?p 3.0)
P1 AND/OPT/UNION P2, where P1 and P2 are graph
pattern expressions
Example:
SELECT ?name
WHERE f
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : d i r e c t o r n ame S t a n l e y Kubr i ck .
?m movie : r e l a t e dBo o k ?b . ?b r e v : r a t i n g ? r .
FILTER(? r 4 . 0 )
g
PKU/2014-08-28 24
77. SPARQL Queries
SELECT ?name
WHERE f
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : d i r e c t o r n ame S t a n l e y Kubr i ck .
?m movie : r e l a t e dBo o k ?b . ?b r e v : r a t i n g ? r .
FILTER(? r 4 . 0 )
g
FILTER(?r 4.0)
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
Stanley Kubrick
movie:director name
?r
rev:rating
PKU/2014-08-28 25
79. Nave Triple Store Design
SELECT ?name
WHERE f
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : d i r e c t o r n ame S t a n l e y Kubr i ck .
?m movie : r e l a t e dBo o k ?b . ?b r e v : r a t i n g ? r .
FILTER(? r 4 . 0 )
g
Subject Property Object
mdb:
97. Nave Triple Store Design
SELECT ?name
WHERE f
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : d i r e c t o r n ame S t a n l e y Kubr i ck .
?m movie : r e l a t e dBo o k ?b . ?b r e v : r a t i n g ? r .
FILTER(? r 4 . 0 )
g
Subject Property Object
mdb:
114. lm/3418 rdfs:label The Passenger
geo:2635167 gn:name United Kingdom
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOer bm:oers/0743424425amazonOer
lexvo:iso639-3/eng rdfs:label English
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
SELECT T1 . o b j e c t
FROM T as T1 , T as T2 , T as T3 ,
T as T4 , T as T5
WHERE T1 . p= r d f s : l a b e l
AND T2 . p=movie : r e l a t e dBo o k
AND T3 . p=movie : d i r e c t o r
AND T4 . p= r e v : r a t i n g
AND T5 . p=movie : d i r e c t o r n ame
AND T1 . s=T2 . s
AND T1 . s=T3 . s
AND T2 . o=T4 . s
AND T3 . o=T5 . s
AND T4 . o 4 . 0
AND T5 . o= S t a n l e y Kubr i ck
PKU/2014-08-28 27
115. Nave Triple Store Design
SELECT ?name
WHERE f
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : d i r e c t o r n ame S t a n l e y Kubr i ck .
?m movie : r e l a t e dBo o k ?b . ?b r e v : r a t i n g ? r .
FILTER(? r 4 . 0 )
g
Subject Property Object
mdb:
132. lm/3418 rdfs:label The Passenger
geo:2635167 gn:name United Kingdom
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOer bm:oers/0743424425amazonOer
lexvo:iso639-3/eng rdfs:label English
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
Easy to implement
but
too many self-joins!
SELECT T1 . o b j e c t
FROM T as T1 , T as T2 , T as T3 ,
T as T4 , T as T5
WHERE T1 . p= r d f s : l a b e l
AND T2 . p=movie : r e l a t e dBo o k
AND T3 . p=movie : d i r e c t o r
AND T4 . p= r e v : r a t i n g
AND T5 . p=movie : d i r e c t o r n ame
AND T1 . s=T2 . s
AND T1 . s=T3 . s
AND T2 . o=T4 . s
AND T3 . o=T5 . s
AND T4 . o 4 . 0
AND T5 . o= S t a n l e y Kubr i ck
PKU/2014-08-28 27
133. Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore
[Weiss et al., 2008]
Strings are mapped to ids using a mapping table
Original triple table
Subject Property Object
mdb:
139. Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore
[Weiss et al., 2008]
Strings are mapped to ids using a mapping table
Triples are indexed in a clustered B+ tree in lexicographic
order
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
B+ tree
Easy querying
through mapping
table
PKU/2014-08-28 28
140. Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore
[Weiss et al., 2008]
Strings are mapped to ids using a mapping table
Triples are indexed in a clustered B+ tree in lexicographic
order
Create indexes for permutations of the three columns: SPO,
SOP, PSO, POS, OPS, OSP
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
B+ tree
Easy querying
through mapping
table
PKU/2014-08-28 28
141. Exhaustive Indexing{Query Execution
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Join order is easy due to extensive indexing
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
... ...
ID Value
0 mdb:
142. lm/2014
1 rdfs:label
2 The Shining
3 movie:initial release date
4 1980-05-23
5 mdb:director/8476
6 movie:director name
7 Stanley Kubrick
8 mdb:
144. Exhaustive Indexing{Query Execution
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Join order is easy due to extensive indexing
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
... ...
ID Value
0 mdb:
145. lm/2014
1 rdfs:label
2 The Shining
3 movie:initial release date
4 1980-05-23
5 mdb:director/8476
6 movie:director name
7 Stanley Kubrick
8 mdb:
146. lm/2685
9 movie:director
Advantages
I Eliminates some of the joins { they become range queries
I Merge join is easy and fast
PKU/2014-08-28 29
147. Exhaustive Indexing{Query Execution
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Join order is easy due to extensive indexing
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
... ...
ID Value
0 mdb:
148. lm/2014
1 rdfs:label
2 The Shining
3 movie:initial release date
4 1980-05-23
5 mdb:director/8476
6 movie:director name
7 Stanley Kubrick
8 mdb:
149. lm/2685
9 movie:director
Advantages
I Eliminates some of the joins { they become range queries
I Merge join is easy and fast
Disadvantages
I Space usage
PKU/2014-08-28 29
150. Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF
[Bornea et al., 2013]
Clustered property table: group together the properties that
tend to occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type
of property into one property table
Subject Property Object
mdb:
156. lm/2685 The Clockwork Orange mob:director/8476
Subject movie:actor name
mdb:actor Jack Nicholson
PKU/2014-08-28 30
157. Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF
[Bornea et al., 2013]
Clustered property table: group together the properties that
tend to occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type
of property into one property table
Subject Property Object
mdb:
163. lm/2685 The Clockwork Orange mob:director/8476
Subject movie:actor name
mdb:actor Jack Nicholson
Advantages
I Fewer joins
I If the data is structured, we have a relational system { similar
to normalized relations
PKU/2014-08-28 30
164. Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF
[Bornea et al., 2013]
Clustered property table: group together the properties that
tend to occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type
of property into one property table
Subject Property Object
mdb:
170. lm/2685 The Clockwork Orange mob:director/8476
Subject movie:actor name
mdb:actor Jack Nicholson
Advantages
I Fewer joins
I If the data is structured, we have a relational system { similar
to normalized relations
Disadvantages
I Potentially a lot of NULLs
I Clustering is not trivial
I Multi-valued properties are complicated
PKU/2014-08-28 30
171. Binary Tables
Grouping by properties: For each property, build a two-column
table, containing both subject and object, ordered by subjects
[Abadi et al., 2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties in
the data)
Subject Property Object
mdb:
179. lm/2685 The Clockwork Orange
movie:actor name
Subject Object
mdb:actor/29704 Jack Nicholson
PKU/2014-08-28 31
180. Binary Tables
Grouping by properties: For each property, build a two-column
table, containing both subject and object, ordered by subjects
[Abadi et al., 2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties in
the data)
Subject Property Object
mdb:
188. lm/2685 The Clockwork Orange
movie:actor name
Subject Object
mdb:actor/29704 Jack Nicholson
Advantages
I Supports multi-valued properties
I No NULLs
I No clustering
I Read only needed attributes (i.e. less I/O)
I Good performance for subject-subject joins
PKU/2014-08-28 31
189. Binary Tables
Grouping by properties: For each property, build a two-column
table, containing both subject and object, ordered by subjects
[Abadi et al., 2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties in
the data)
Subject Property Object
mdb:
197. lm/2685 The Clockwork Orange
movie:actor name
Subject Object
mdb:actor/29704 Jack Nicholson
Advantages
I Supports multi-valued properties
I No NULLs
I No clustering
I Read only needed attributes (i.e. less I/O)
I Good performance for subject-subject joins
Disadvantages
I Not useful for subject-object joins
I Expensive inserts
PKU/2014-08-28 31
198. Graph-based Approach
Answering SPARQL query subgraph matching
gStore [Zou et al., 2011, 2014]
FILTER(?r 4.0)
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
Stanley Kubrick
movie:director name
?r
rev:rating
Subgraph Matching
United Kingdom
gn:name
The Passenger
refs:label
62348447
gn:population
mdb:
199. lm/2014
movie:initial release date
1980-05-23
bm:oers/0743424425amazonOer
The Shining
refs:label
bm:books/0743424425
4.7
rev:rating
geo:2635167
The Last Tycoon
refs:label
movie:actor movie:actor
mdb:actor/29704
movie:actor name
Jack Nicholson
mdb:
204. Graph-based Approach
Answering SPARQL query subgraph matching
gStore [Zou et al., 2011, 2014]
FILTER(?r 4.0)
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
Stanley Kubrick
movie:director name
?r
rev:rating
Subgraph Matching
United Kingdom
gn:name
The Passenger
refs:label
62348447
gn:population
mdb:
205. lm/2014
movie:initial release date
1980-05-23
bm:oers/0743424425amazonOer
The Shining
refs:label
bm:books/0743424425
4.7
rev:rating
geo:2635167
The Last Tycoon
refs:label
movie:actor movie:actor
mdb:actor/29704
movie:actor name
Jack Nicholson
mdb:
209. lm/424
refs:label
Spartacus
mdb:actor/30013
movie:relatedBook
scam:hasOer
foaf:based near
movie:actor
movie:director
movie:actor
movie:director movie:director
Advantages
I Maintains the graph structure
I Full set of queries can be handled
PKU/2014-08-28 32
210. Graph-based Approach
Answering SPARQL query subgraph matching
gStore [Zou et al., 2011, 2014]
FILTER(?r 4.0)
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
Stanley Kubrick
movie:director name
?r
rev:rating
Subgraph Matching
United Kingdom
gn:name
The Passenger
refs:label
62348447
gn:population
mdb:
211. lm/2014
movie:initial release date
1980-05-23
bm:oers/0743424425amazonOer
The Shining
refs:label
bm:books/0743424425
4.7
rev:rating
geo:2635167
The Last Tycoon
refs:label
movie:actor movie:actor
mdb:actor/29704
movie:actor name
Jack Nicholson
mdb:
215. lm/424
refs:label
Spartacus
mdb:actor/30013
movie:relatedBook
scam:hasOer
foaf:based near
movie:actor
movie:director
movie:actor
movie:director movie:director
Advantages
I Maintains the graph structure
I Full set of queries can be handled
Disadvantages
I Graph pattern matching is expensive
PKU/2014-08-28 32
216. gStore
General Approach:
Work directly on the RDF graph and the SPARQL query graph
Use a signature-based encoding of each entity and class vertex
to speed up matching
Filter-and-evaluate
Use a false positive algorithm to prune nodes and obtain a set
of candidates; then do more detailed evaluation on those
Use an index (VS-tree) over the data signature graph (has
light maintenance load) for ecient pruning
PKU/2014-08-28 33
217. 1. Encode Q and G to Get Signature Graphs
Query signature graph Q
00010
0100 0000 1000 0000
0000 0100
10000
Data signature graph G
0010 1000
0100 0001
00001
1000 0001
00010
0000 0100
10000
0000 1000
10000
0000 0010
10000
0000 1001
00100
1001 1000
01000
0001 0001
01000
0100 1000
01000
0001 0100
01000
PKU/2014-08-28 34
218. 2. Filter-and-Evaluate
Query signature graph Q
00010
0100 0000 1000 0000
0000 0100
10000
Data signature graph G
0010 1000
0100 0001
00001
1000 0001
00010
0000 0100
10000
0000 1000
10000
0000 0010
10000
0000 1001
00100
1001 1000
01000
0001 0001
01000
0100 1000
01000
0001 0100
01000
Find matches of Q over
signature graph G
Verify each match in
RDF graph G
PKU/2014-08-28 35
219. How to Generate Candidate List
Two step process:
1. For each node of Q get lists of nodes in G that include that
node.
2. Do a multi-way join to get the candidate list
PKU/2014-08-28 36
220. How to Generate Candidate List
Two step process:
1. For each node of Q get lists of nodes in G that include that
node.
2. Do a multi-way join to get the candidate list
Alternatives:
PKU/2014-08-28 36
221. How to Generate Candidate List
Two step process:
1. For each node of Q get lists of nodes in G that include that
node.
2. Do a multi-way join to get the candidate list
Alternatives:
Sequential scan of G
Both steps are inecient
PKU/2014-08-28 36
222. How to Generate Candidate List
Two step process:
1. For each node of Q get lists of nodes in G that include that
node.
2. Do a multi-way join to get the candidate list
Alternatives:
Sequential scan of G
Both steps are inecient
Use S-trees
Height-balanced tree over signatures
Run an inclusion query for each node of Q and get lists of
nodes in G that include that node.
Given query signature q and a set of data signatures S,
223. nd all data signatures si 2 S where qsi = q
Does not support second step { expensive
PKU/2014-08-28 36
224. How to Generate Candidate List
Two step process:
1. For each node of Q get lists of nodes in G that include that
node.
2. Do a multi-way join to get the candidate list
Alternatives:
Sequential scan of G
Both steps are inecient
Use S-trees
Height-balanced tree over signatures
Run an inclusion query for each node of Q and get lists of
nodes in G that include that node.
Given query signature q and a set of data signatures S,
225. nd all data signatures si 2 S where qsi = q
Does not support second step { expensive
VS-tree (and VS-tree)
Multi-resolution summary graph based on S-tree
Supports both steps eciently
Grouping by vertices
PKU/2014-08-28 36
240. Adaptivity to Workload
Web applications that are supported by RDF data
management systems are far more varied than conventional
relational applications
Data that are being handled are far more heterogeneous
SPARQL is far more
exible in how triple patterns (i.e., the
atomic query unit) can be combined
An experiment [Aluc et al., 2014]
RDF-3X VOS (6.1) VOS (7.1) MonetDB 4Store
% queries for which
tested system is
fastest
20.9 0.0 22.6 56.5 0.0
Total workload exe-
cution time (hours)
27.1 20.9 20.8 38.6 72.2
Mean (per query)
execution time (sec-
onds)
7.8 6.0 6.0 11.1 20.7
PKU/2014-08-28 40
241. Adaptivity to Workload
Web applications that are supported by RDF data
management systems are far more varied than conventional
relational applications
Data that are being handled are far more heterogeneous
SPARQL is far more
exible in how triple patterns (i.e., the
atomic query unit) can be combined
An experiment [Aluc et al., 2014]
Summary of Experiments
RDF-3X VOS (6.1) VOS (7.1) MonetDB 4Store
% I queries No single for which
system is a sole 20.9 winner across 0.0 all queries
22.6 56.5 0.0
tested I system is
No single system is the sole loser across all queries, either
fastest
I There can be 2{5 orders of magnitude dierence in the performance (i.e., query
Total workload exe-
cution time (hours)
27.1 20.9 20.8 38.6 72.2
execution time) between the best and the worst system for a given query
I The winner in one query may timeout in another
I Performance dierence widens as dataset size increases
Mean (per query)
execution time (sec-
onds)
7.8 6.0 6.0 11.1 20.7
PKU/2014-08-28 40
246. xed size, (b) contain
same set of attributes
1. Workload time analysis
2. Updating the physical layout
Cache
Storage System
Hash
Function
evict
@t1
function
adapts
Hash
Function
@tk
PKU/2014-08-28 42
248. xed size, (b) contain
same set of attributes
1. Workload time analysis
2. Updating the physical layout
3. Partial indexing
Index { { { { { { { { { {
Cache
Storage System
Hash
Function
evict
@t1
function
adapts
Hash
Function
@tk
SPARQL Query Engine
PKU/2014-08-28 42
249. chameleon-db
Prototype system [Aluc et al., 2013]
35,000 lines of code in C++ and growing
Structural Index
...
Vertex Index
Spill Index
Storage System Cluster Index
Storage Advisor
Query
Engine Plan Generation Evaluation
PKU/2014-08-28 43
250. Some Open Problems
Scalability of the solutions to very large datasets
Maintenance of auxiliary data structures in dynamic
environments
Adaptive systems to handle varying and time-changing
workloads
Uncertain RDF data processing
Keyword search over RDF data
Query processing over incomplete RDF data
PKU/2014-08-28 44
252. Remember the Environment
Distributed environment
Some of the data sites can
process SPARQL queries {
SPARQL endpoints
Not all data sites can
process queries
PKU/2014-08-28 46
253. Remember the Environment
Distributed environment
Some of the data sites can
process SPARQL queries {
SPARQL endpoints
Not all data sites can
process queries
Alternatives
PKU/2014-08-28 46
254. Remember the Environment
Distributed environment
Some of the data sites can
process SPARQL queries {
SPARQL endpoints
Not all data sites can
process queries
Alternatives
Data re-distribution +
query decomposition
PKU/2014-08-28 46
255. Remember the Environment
Distributed environment
Some of the data sites can
process SPARQL queries {
SPARQL endpoints
Not all data sites can
process queries
Alternatives
Data re-distribution +
query decomposition
SPARQL federation: just
process at SPARQL
endpoints
PKU/2014-08-28 46
256. Remember the Environment
Distributed environment
Some of the data sites can
process SPARQL queries {
SPARQL endpoints
Not all data sites can
process queries
Alternatives
Data re-distribution +
query decomposition
SPARQL federation: just
process at SPARQL
endpoints
Live querying (see next
section)
PKU/2014-08-28 46
257. Distributed RDF Processing [Kaoudi and Manolescu, 2014]
Data partitioning approaches
RDF data warehouse is partitioned and distributed
RDF data D = fD1; : : : ;Dng
Allocate each Di to a site
PKU/2014-08-28 47
258. Distributed RDF Processing [Kaoudi and Manolescu, 2014]
Data partitioning approaches
RDF data warehouse is partitioned and distributed
RDF data D = fD1; : : : ;Dng
Allocate each Di to a site
Partitioning alternatives
Table-based (e.g., [Husain et al., 2011])
Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])
Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
PKU/2014-08-28 47
259. Distributed RDF Processing [Kaoudi and Manolescu, 2014]
Data partitioning approaches
RDF data warehouse is partitioned and distributed
RDF data D = fD1; : : : ;Dng
Allocate each Di to a site
Partitioning alternatives
Table-based (e.g., [Husain et al., 2011])
Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])
Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
SPARQL query decomposed Q = fQ1; : : : ;Qkg
Distributed execution of fQ1; : : : ;Qkg over fD1; : : : ;Dng
PKU/2014-08-28 47
260. Distributed RDF Processing [Kaoudi and Manolescu, 2014]
Data partitioning approaches
RDF data warehouse is partitioned and distributed
RDF data D = fD1; : : : ;Dng
Allocate each Di to a site
Partitioning alternatives
Table-based (e.g., [Husain et al., 2011])
Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])
Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
SPARQL query decomposed Q = fQ1; : : : ;Qkg
Distributed execution of fQ1; : : : ;Qkg over fD1; : : : ;Dng
I High performance
I Great for parallelizing centralized RDF data
I May not be possible to re-partition and re-allocate Web data
(i.e., LOD)
PKU/2014-08-28 47
261. Distributed RDF Processing { 2
Data summary-based approaches
Build summaries (index) for the distributed RDF datasets
(e.g., [Atre et al., 2010; Prasser et al., 2012])
PKU/2014-08-28 48
262. Distributed RDF Processing { 2
Data summary-based approaches
Build summaries (index) for the distributed RDF datasets
(e.g., [Atre et al., 2010; Prasser et al., 2012])
SPARQL query Q = fQ1; : : : ;Qkg
Distributed execution of fQ1; : : : ;Qkg using the data
summary
PKU/2014-08-28 48
263. Distributed RDF Processing { 2
Data summary-based approaches
Build summaries (index) for the distributed RDF datasets
(e.g., [Atre et al., 2010; Prasser et al., 2012])
SPARQL query Q = fQ1; : : : ;Qkg
Distributed execution of fQ1; : : : ;Qkg using the data
summary
I No data re-partitioning and re-allocation
I Have to scan the data at each site
I Index over distributed data with maintenance concerns
PKU/2014-08-28 48
264. SPARQL Endpoint Federation
Consider only the SPARQL endpoints for query execution
No data re-partitioning/re-distribution
Consider D = D1 [ D2 [ : : : [ Dn; Di : SPARQL endpoint
PKU/2014-08-28 49
265. SPARQL Endpoint Federation
Consider only the SPARQL endpoints for query execution
No data re-partitioning/re-distribution
Consider D = D1 [ D2 [ : : : [ Dn; Di : SPARQL endpoint
Alternatives
SPARQL query decomposed Q = fQ1; : : : ;Qkg and executed
over fD1; : : : ;Dng { DARQ, FedX [Schwarte et al., 2011],
SPLENDID [Gorlitz and Staab, 2011], ANAPSID [Acosta
et al., 2011]
Partial query evaluation { Distributed gStore [Peng et al.,
2014]
PKU/2014-08-28 49
266. SPARQL Endpoint Federation
Consider only the SPARQL endpoints for query execution
No data re-partitioning/re-distribution
Consider D = D1 [ D2 [ : : : [ Dn; Di : SPARQL endpoint
Alternatives
SPARQL query decomposed Q = fQ1; : : : ;Qkg and executed
over fD1; : : : ;Dng { DARQ, FedX [Schwarte et al., 2011],
SPLENDID [Gorlitz and Staab, 2011], ANAPSID [Acosta
et al., 2011]
Partial query evaluation { Distributed gStore [Peng et al.,
2014]
Partial evaluation
I Given function f (s; d) and part of its input s, perform f 's
computation that only depends on s to get f 0(d)
I Compute f 0(d) when d becomes available
I Applied to, e.g., XML [Buneman et al., 2006]
PKU/2014-08-28 49
267. Distributed SPARQL Using Partial Query Evaluation
Two steps:
1. Evaluate a query at each site to
268. nd local matches
Query is the function and each Di is the known input
Inner match or local partial match
D1
D2
D3
D4
PKU/2014-08-28 50
269. Distributed SPARQL Using Partial Query Evaluation
Two steps:
1. Evaluate a query at each site to
270. nd local matches
Query is the function and each Di is the known input
Inner match or local partial match
2. Assemble the partial matches to get
271. nal result
Crossing match
Centralized assembly
Distributed assembly
D1
D2
D3
D4
Crossing match
PKU/2014-08-28 50
275. Live Query Processing
Not all data resides at
SPARQL endpoints
Freshness of access to data
important
Potentially countably in
276. nite
data sources
Live querying
On-line execution
Only rely on linked data
principles
Alternatives
Traversal-based
approaches
Index-based approaches
Hybrid approaches
PKU/2014-08-28 53
277. SPARQL Query Semantics in Live Querying
Full-web semantics
Scope of evaluating a SPARQL expression is all Linked Data
Query result completeness cannot be guaranteed by any
(terminating) execution
PKU/2014-08-28 54
278. SPARQL Query Semantics in Live Querying
Full-web semantics
Scope of evaluating a SPARQL expression is all Linked Data
Query result completeness cannot be guaranteed by any
(terminating) execution
Reachability-based query semantics
Query consists of a SPARQL expression, a set of seed URIs S,
and a reachability condition c
Scope: all data along paths of data links that satisfy the
condition
Computationally feasible
PKU/2014-08-28 54
280. c) data links
at query execution runtime [Hartig,
2013; Ladwig and Tran, 2011]
Implements reachability-based
query semantics
Start from a set of seed URIs
Recursively follow and discover
new URIs
Important issue is selection of seed
URIs
Retrieved data serves to discover
new URIs and to construct result
PKU/2014-08-28 55
282. c) data links
at query execution runtime [Hartig,
2013; Ladwig and Tran, 2011]
Implements reachability-based
query semantics
Advantages
Easy to implement.
No data structure to maintain.
Start from a set of seed URIs
Recursively follow and discover
new URIs
Important issue is selection of seed
URIs
Retrieved data serves to discover
new URIs and to construct result
PKU/2014-08-28 55
284. c) data links
at query execution runtime [Hartig,
2013; Ladwig and Tran, 2011]
Implements reachability-based
query semantics
Advantages
Easy to implement.
No data structure to maintain.
Start from a set of seed URIs
Recursively follow and discover
new URIs
Important issue is selection of seed
URIs
Retrieved data serves to discover
new URIs and to construct result
Disadvantages
Possibilities for parallelized data retrieval are limited
Repeated data retrieval introduces signi
286. Traversal Optimization
Dynamic query execution [Hartig and Ozsu, 2014]
Data Retrieval
...lookup queue...
Output
PKU/2014-08-28 56
287. Traversal Optimization
Dynamic query execution [Hartig and Ozsu, 2014]
Prioritization of URIs { a number of alternatives
Non-adaptive
Adaptive,
Local processing aware
Adaptive,
Local processing agnostic
Intermediate solution driven Solution-aware graph-based
Hybrid graph-based Purely graph-based
PKU/2014-08-28 56
288. Index Approaches
Use pre-populated index to determine relevant URIs (and to
avoid as many irrelevant ones as possible)
Dierent index keys possible; e.g., triple patterns [Umbrich
et al., 2011]
Index entries a set of URIs
Indexed URIs may appear multiple times (i.e., associated with
multiple index keys)
Each URI in such an entry may be paired with a cardinality
(utilized for source ranking)
Key: tp Entry: furi1; uri2; ; uring
GET urii
PKU/2014-08-28 57
289. Index Approaches
Use pre-populated index to determine relevant URIs (and to
avoid as many irrelevant ones as possible)
Dierent index keys possible; e.g., triple patterns [Umbrich
et al., 2011]
Index entries a set of URIs
Indexed URIs may appear multiple times (i.e., associated with
multiple index keys)
Each URI in such an entry may be paired with a cardinality
(utilized for source ranking)
Advantages
Data retrieval can be fully parallelized
Reduces the impact of data retrieval on query execution time
Key: tp Entry: furi1; uri2; ; uring
GET urii
PKU/2014-08-28 57
290. Index Approaches
Use pre-populated index to determine relevant URIs (and to
avoid as many irrelevant ones as possible)
Dierent index keys possible; e.g., triple patterns [Umbrich
et al., 2011]
Index entries a set of URIs
Indexed URIs may appear multiple times (i.e., associated with
multiple index keys)
Each URI in such an entry may be paired with a cardinality
(utilized for source ranking)
Advantages
Data retrieval can be fully parallelized
Reduces the impact of data retrieval on query execution time
Key: tp Entry: furi1; uri2; ; uring
Disadvantages
Querying can only start after index GET construction
urii
Depends on what has been selected for the index
Freshness may be an issue
Index maintenance
PKU/2014-08-28 57
291. Hybrid Approach
Perform a traversal-based execution using a prioritized list of
URIs to look up [Ladwig and Tran, 2010]
Initial seed from the pre-populated index
Non-seed URIs are ranked by a function based on information
in the index
New discovered URIs that are not in the index are ranked
according to number of referring documents
PKU/2014-08-28 58
292. Some Open Problems
Optimize queries by using statistics collected during earlier
query executions
Heterogeneous use of vocabularies (use of ontologies)
Combine SPARQL federation to leverage SPARQL endpoint
functionality
PKU/2014-08-28 59
294. Conclusions
RDF and Linked Object Data seem to have considerable
promise for Web data management
2014 2011
PKU/2014-08-28 61
295. Conclusions
RDF and Linked Object Data seem to have considerable
promise for Web data management
More work needs to be done
Query semantics
Adaptive system design
Optimizations { both in data warehousing and distributed
environments
Live querying requires signi
297. Conclusions
What I did not talk about:
Not much on general distributed/parallel processing
Not much on SPARQL semantics
Nothing about RDFS { no schema stu
Nothing about entailment regimes 0 ) no reasoning
PKU/2014-08-28 62
299. References I
Abadi, D. J., Marcus, A., Madden, S., and Hollenbach, K. (2009). SW-Store: a
vertically partitioned DBMS for semantic web data management. VLDB J.,
18(2):385{406.
Abadi, D. J., Marcus, A., Madden, S. R., and Hollenbach, K. (2007). Scalable
semantic web data management using vertical partitioning. In Proc. 33rd
Int. Conf. on Very Large Data Bases, pages 411{422.
Abiteboul, S., Quass, D., McHugh, J., Widom, J., and Wiener, J. (1997). The
Lorel query language for semistructured data. Int. J. Digit. Libr., 1(1):68{88.
Acosta, M., Vidal, M.-E., Lampo, T., Castillo, J., and Ruckhaus, E. (2011).
ANAPSID: an adaptive query processing engine for SPARQL endpoints. In
Proc. 10th Int. Semantic Web Conf., pages 18{34.
Aluc, G., Hartig, O., Ozsu, M. T., and Daudjee, K. (2014). Diversi
300. ed stress
testing of RDF data management systems. In Proc. 13th Int. Semantic Web
Conf. Forthcoming.
Aluc, G., Ozsu, M. T., Daudjee, K., and Hartig, O. (2013). chameleon-db: a
workload-aware robust RDF data management system. Technical Report
CS-2013-10, University of Waterloo.
PKU/2014-08-28 64
301. References II
Arocena, G. and Mendelzon, A. (1998). Weboql: Restructuring documents,
databases and webs. In Proc. 14th Int. Conf. on Data Engineering, pages
24{33.
Atre, M., Chaoji, V., Zaki, M. J., and Hendler, J. A. (2010). Matrix bit
loaded: A scalable lightweight join query processor for rdf data. In Proc.
19th Int. World Wide Web Conf., pages 41{50.
Bornea, M. A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P.,
Udrea, O., and Bhattacharjee, B. (2013). Building an ecient RDF store
over a relational database. In Proc. ACM SIGMOD Int. Conf. on
Management of Data, pages 121{132.
Buneman, P., Cong, G., Fan, W., and Kementsietsidis, A. (2006). Using partial
evaluation in distributed query evaluation. In Proc. 32nd Int. Conf. on Very
Large Data Bases, pages 211{222.
Buneman, P., Davidson, S., Hillebrand, G. G., and Suciu, D. (1996). A query
language and optimization techniques for unstructured data. In Proc. ACM
SIGMOD Int. Conf. on Management of Data, pages 505{516.
Fernandez, M., Florescu, D., and Levy, A. (1997). A query language for a
web-site management system. ACM SIGMOD Rec., 26(3):4{11.
PKU/2014-08-28 65
302. References III
Gorlitz, O. and Staab, S. (2011). SPLENDID: SPARQL endpoint federation
exploiting VOID descriptions. In Proc. 2nd Int. Workshop on Consuming
Linked Data.
Gurajada, S., Seufert, S., Miliaraki, I., and Theobald, M. (2014). TriAD: A
distributed shared-nothing RDF engine based on asynchronous message
passing. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages
289{300.
Hartig, O. (2012). SPARQL for a web of linked data: Semantics and
computability. In Proc. 9th Extended Semantic Web Conf., pages 8{23.
Hartig, O. (2013). SQUIN: a traversal based query execution system for the
web of linked data. In Proc. ACM SIGMOD Int. Conf. on Management of
Data, pages 1081{1084.
Hartig, O. and Ozsu, M. T. (2014). Optimizing response time of
traversal-based query optimization. In preparation.
Huang, J., Abadi, D. J., and Ren, K. (2011). Scalable SPARQL querying of
large RDF graphs. Proc. VLDB Endowment, 4(11):1123{1134.
PKU/2014-08-28 66
303. References IV
Husain, M. F., McGlothlin, J., Masud, M. M., Khan, L. R., and Thuraisingham,
B. (2011). Heuristics-based query processing for large RDF graphs using
cloud computing. IEEE Trans. Knowl. and Data Eng., 23(9):1312{1327.
Kaoudi, Z. and Manolescu, I. (2014). RDF in the clouds: A survey. VLDB J.
Forthcoming.
Konopnicki, D. and Shmueli, O. (1995). W3QS: A query system for the World
Wide Web. In Proc. 21th Int. Conf. on Very Large Data Bases, pages 54{65.
Ladwig, G. and Tran, T. (2010). Linked data query processing strategies. In
Proc. 9th Int. Semantic Web Conf., pages 453{469.
Ladwig, G. and Tran, T. (2011). SIHJoin: Querying remote and local linked
data. In Proc. 8th Extended Semantic Web Conf., pages 139{153.
Lakshmanan, L. V. S., Sadri, F., and Subramanian, I. N. (1996). A declarative
language for querying and restructuring the Web. In Proc. 6th Int.
Workshop on Research Issues on Data Eng., pages 12{21.
Lee, K. and Liu, L. (2013). Scaling queries over big rdf graphs with semantic
hash partitioning. Proc. VLDB Endowment, 6(14):1894{1905.
PKU/2014-08-28 67
304. References V
Mendelzon, A. O., Mihaila, G. A., and Milo, T. (1997). Querying the World
Wide Web. Int. J. Digit. Libr., 1(1):54{67.
Neumann, T. and Weikum, G. (2008). RDF-3X: a RISC-style engine for RDF.
Proc. VLDB Endowment, 1(1):647{659.
Neumann, T. and Weikum, G. (2009). The RDF-3X engine for scalable
management of RDF data. VLDB J., 19(1):91{113.
Papakonstantinou, Y., Garcia-Molina, H., and Widom, J. (1995). Object
exchange across heterogeneous information sources. In Proc. 11th Int. Conf.
on Data Engineering, pages 251{260.
Peng, P., Zou, L., Ozsu, M. T., Chen, L., and Zhao, D. (2014). Processing
SPARQL queries over linked data { a distributed graph-based approach. In
submitted for publication.
Prasser, F., Kemper, A., and Kuhn, K. A. (2012). Ecient distributed query
processing for autonomous rdf databases. In Proc. 15th Int. Conf. on
Extending Database Technology, pages 372{383.
Schwarte, A., Haase, P., Hose, K., Schenkel, R., and Schmidt, M. (2011).
Fedx: A federation layer for distributed query processing on linked open
data. In Proc. 8th Extended Semantic Web Conf., pages 481{486.
PKU/2014-08-28 68
305. References VI
Umbrich, J., Hose, K., Karnstedt, M., Harth, A., and Polleres, A. (2011).
Comparing data summaries for processing live queries over linked data.
World Wide Web J., 14(5-6):495{544.
Weiss, C., Karras, P., and Bernstein, A. (2008). Hexastore: sextuple indexing
for semantic web data management. Proc. VLDB Endowment,
1(1):1008{1019.
Wilkinson, K. (2006). Jena property table implementation. Technical Report
HPL-2006-140, HP Laboratories Palo Alto.
Zhang, X., Chen, L., Tong, Y., and Wang, M. (2013). EAGRE: Towards
scalable I/O ecient SPARQL query evaluation on the cloud. In Proc. 29th
Int. Conf. on Data Engineering, pages 565{576.
Zou, L., Mo, J., Chen, L., Ozsu, M. T., and Zhao, D. (2011). gStore:
answering SPARQL queries via subgraph matching. Proc. VLDB
Endowment, 4(8):482{493.
Zou, L., Ozsu, M. T., Chen, L., Shen, X., Huang, R., and Zhao, D. (2014).
gStore: A graph-based SPARQL query engine. VLDB J., 23(4):565{590.
PKU/2014-08-28 69