6. RDF: Resource Description Framework
• Resource
– Generalization of “Web resource”
– A thing that can be identified (but not necessarily
retrieved) on the Web
• Description
– A resource is described with statements that
specify the properties and property values of the
resource
• Statement (aka Triple)
– subject: identifies the resource
– predicate: identifies a property of the resource
– object: identifies the value of that property
7. Everything can be described with (loads
of) triples...
Subject
Property
(resource)
A Triple
Object
(resource
or
literal value)
Subject
(resource)
9. An RDF graph can be serialized in several
ways
• RDF/XML: the W3C’s official format
– XML is well established: good for application developers
– very verbose, not very “readable”
– e.g. uniprot.org/uniprot/P00750.rdf
• N-Triple
– good for loading into triple stores
– e.g. uniprot.org/uniprot/P00750.nt
• Turtle ⟵ most examples will use this
– good for reading by humans
– e.g. uniprot.org/uniprot/P00750.ttl
• JSON-LD
– easy for javascript/websites
• ....
• Conversion 100% lossless
11. RDF identifies resources with URIs
UniProt.rdf
What and why
presented by
A Triple
expasy.org/people/
Jerven_Tjalling
.Bolleman.htm
URI
12. Multiple URIs may identify the same thing
expasy.org/people/
Jerven_Tjalling
.Bolleman.htm
ch.linkedin.com/
in/jervenbolleman
owl:sameAs
A Triple
13. The life sciences have an identity
problem...
• www.genenames.org/data/hgnc_data.php?
hgnc_id=9993
– RGS11: regulator of G-protein signaling 11
• http://www.uniprot.org/taxonomy/9993
– European alpine marmot
• ...
Text
Te What is “9993”?
15. The solution are URIs
• In RDF statements:
– subject and predicates must be URIs
– objects may be URIs or literal values
• Advantages:
– No risk of “name clashes” when integrating data from
different sources
– Different people can make statements about the same
resource:
Distributed annotation at a global scale!
22. RDF What? Quick recap
• RDF describes data with statements (aka triples)
– statement = subject + predicate + object
– related statements form a directed graph
• RDF uses URIs to identify things:
– subject and predicates must be URIs
– objects may be URIs or literal values
• Multiple serialisation formats that are 99.999999%
automatically convertible
23. Why RDF? Isn’t there a simpler solution?
What?
Why?
SPARQL?
Exam
ples
Exam
ples
24. A very simple example: FASTA
• Why does everyone in the sequence world use
FASTA?
25. A very simple example: FASTA
• Why does everyone in the sequence world use
FASTA?
– The smallest common denominator
– You can put in the header what you like and I can
choose to ignore it
• BUT: You only get a sequence...
>Who|cares_about:this?
THISISWHATWEWANT
26. A simple example: GFF
• Some people want to exchange more than
sequences, and invented GFF:
• BUT: ...
SEQ1 EMBL atg 103 105 . + 0
SEQ1 EMBL exon 103 172 . + 0
27. A simple example: GFF
• Some people want to exchange more than
sequences, and invented GFF:
• BUT: What do the columns mean?
– Originally, an exchange format for sequence
feature descriptions, later also used for other
annotations
– 3 versions known (to me ;)
– Not extendable without prior agreement of all
users
SEQ1 EMBL atg 103 105 . + 0
SEQ1 EMBL exon 103 172 . + 0
28. A proper solution: XML
• There is a world beyond sequences and
bioinformatics!
• XML is an IT-industry standard
– Datatypes
– Multi namespaces
– Schemas
• BUT:
– Hierarchical data model
– Schemas close extension
29. XML represents data as a tree
• XML datatypes
– Multi namespace
– XML Schema closes extensions
• Tree format
entry
Proton
acceptor 196
activ
e
2.7.11.
-
EC
30. No XML standard for other relationships
prizes:a case study
• XML datatypes
– Multi namespace
– XML Schema closes extensions
• Tree format
entry
Proton
acceptor 196
activ
e
2.7.11.
-
EC
31. Our data is a graph!
entry
Proton
acceptor
196activ
e
2.7.11.
-
EC
32. RDF advantages
• W3C standard
• Can be serialized as XML or JSON
• i.e. most benefits of XML or JSON
• Generic graph structure
• URIs as a standard way to identify resources and
their properties
– data integration without name clashes
– distributed annotation
– normalization
• Extensible!
33. RDF is extensible
• Anyone can say Anything about Anything
– You can say something about my data
• RDF extensions remain compatible
• RDF encourages data and schema reuse
@prefix prot:<purl.uniprot.org/uniprot/>
@prefix intact:<fake.ebi.ac.uk/intact/example>
prot:P32234
prot:P32234
intact:interacts_with
intact:interacts_with
Interactions.ttl
prot:Q9VGZ4
prot:P25724
34. RDF data model is simple
• Everything can be said with triples
• Generic triple stores
– low maintenance data integration
• SPARQL
– SQL
– XPath
– Regular expressions
for RDF
for RDF
for RDF
35. Comparison
Flat file XML RDF
Standard NO YES YES
Scalable NO YES YES +
Extendable NO NO YES
Generic
Data model
NO NO YES
37. Most common failure in RDF world:
Philosophy over pragmatism
1.
Be
honest
about
your
data
• what
you
have:
not
what
you
want
2.
Change
the
concept
change
the
IRI
•
One
concept
can
be
referred
to
by
multiple
IRI
3.
Better
to
“todo”
than
to
“debate”
38. Model real data not the the “real world”
• Describe
records
that
relate
to
real
world
things
• Acknowledge
that
they
are
records
• Model
measurements
before
“facts”
44. OWL: Web Ontology Language
• Will
be
presented
in
detail
during
the
week
• Logical
meaning
added
to
RDF
statements
• That
tools
use
• Classifies
existing
data
or
infers
new
data
• Very
powerful
and
useful
49. W3C workgroup in progress
• Data-‐Shapes
• You
don’t
want
to
know
how
the
sausage
is
made…
• Vendors
looking
forward
to
implementing
it
• Currently
not
that
bad,
could
be
better
• First
Working
Draft
51. Why provide a public SPARQL endpoint
• A
10
man
wet
laboratory
can
not
afford:
– to
host
their
own
database
in
house
holding
all
or
even
a
bit
of
all
life
science
data.
– not
to
have
access,
and
use,
existing
life
science
information.
52. ← Not CPU Time...
But Brain Time
↓
The right kind of optimisation
53. Why provide a public SPARQL endpoint
• Classical
SQL
can
be
provided
on
the
web
–Is
not
practical
–No
federation
–Poor
standards
conformance
• Local SQL is expensive
• Local
JSON
is
no
better
• Nor
is
local
XML
56. Why provide a public SPARQL endpoint
• Document
centric
REST
is
not
enough
–Swiss-‐Prot
available
as
REST
–(over e-mail !!) since 1986
–expasy.ch since 1993
–www.uniprot.org
since
2002
• Most user use a GUI not a CLI
• developers
build
GUI
on
a
CLI
62. Real users
Mix between hard analytics and super specific
Estimate somewhere between:
300 - 1000 real humans per month
We know they are real because they take
holidays ;)
72. Turtle is the RDF serialization aligned with
SPARQL
• Shorthand
to
avoid
typing
so
much
– .
‘dot’
is
end
statement
– ;
‘semi-‐colon’
repeat
subject
– ,
‘comma’
is
repeat
subject
and
predicate
• prefix
– before
‘:’
is
abbreviation
of
uri
73. Why don’t these queries work elsewhere?
• PREFIX
– On
the
web
you
often
have
to
add
these
– But
some
can
be
preconfigured
PREFIX :<http://purl.uniprot.org/core/>
SELECT ?x
FROM <http://purl.uniprot.org/taxonomy/>
WHERE {?x a :Taxon}
74. a = rdf:type = <http://www.w3.org/1999/02/22-
rdf-syntax-ns#type>
75. <9993> rdf:type up:Taxon ;
up:rank up:Species ;
up:reviewed true ;
up:mnemonic "MARMR" ;
up:scientificName "Marmota marmota" ;
up:commonName "Alpine marmot" ;
up:otherName "European marmot" ;
rdfs:subClassOf <9992> ;
skos:narrowerTransitive <9994> ;
rdfs:subClassOf
taxon:9994 is a more specific
classification than
76. <9993> rdf:type up:Taxon ;
up:rank up:Species ;
up:reviewed true ;
up:mnemonic "MARMR" ;
up:scientificName "Marmota marmota" ;
up:commonName "Alpine marmot" ;
up:otherName "European marmot" ;
rdfs:subClassOf <9992> ;
skos:narrowerTransitive <9994> ;
rank => “The level, for nomenclatural
purposes, of a taxon in a taxonomic
hierarchy”
77. Lets learn SPARQL
• Queries
over
RDF
data.
– Four
basic
types
• SELECT
– Returns
“tab
delimited”
results
• CONSTRUCT
– Makes
new
triples
• DESCRIBE
– Returns
all
triples
mentioning
a
resource
94. 5: Optional
• When
values
may
be
missing
– yet
interesting
when
they
are
there
• Use
as
sub
query
• bound
values
from
outside
stay
bound
inside
– ?x
?y?z
.
OPTIONAL
{?x
?b
?c}
• ?x
same
variable
=
same
thing
102. MINUS{} or FILTER (NOT EXISTS{})
• Whats
the
difference?
– MINUS
subtracts
results
– NOT
EXITS
tests
if
the
sub
pattern
is
possible
at
all.
• Normally
the
faster
option.
107. FILTERS
• You
just
saw
it
twice
– Once
in
the
!BOUND
– Once
in
the
NOT
EXISTS
• FILTERS
a
result
set
by
possibly
removing
values
– FILTER
do
not
add
a
value
to
the
result
• Inside
the
same
graph
pattern
order
independent.
141. HAVING
• FILTER
for
aggregates
• After
the
GROUP
BY
clause
– ...
GROUP
BY
?x
HAVING
(count(?y)
>
2)
– ...
GROUP
BY
?x
HAVING
(min(?y)
=
2)
– etc...
151. Examples
• Parameter
lists
are
between
()
Text
VALUES (?annotation) {
(core:Disease_Annotation)
(core:Disulfide_Bond_Annotation)
}
152. Examples
• Undef
means
no
value
at
– all
not
bound
Text
VALUES (?annotation ?begin) {
(core:Disease_Annotation UNDEF)
(core:Disulfide_Bond_Annotation 2)
}
153. VALUES
• After
declaring
a
set
of
values
you
can
use
them
in
your
query.
SELECT ?comment WHERE {
VALUES (?annotation ?begin) {
(core:Disease_Annotation UNDEF)
(core:Disulfide_Bond_Annotation 2)
}
?annotation rdfs:comment ?comment .
}
154. SERVICE: Using other sparql endpoints
• SERVICE<URL
of
other
endpoint>
– Runs
a
sub
query
on
the
other
endpoint
and
merges
it
back
into
your
query.
157. SERVICE
• Useful
– Quick
experimenting
with
combing
multiple
datasources
– Quick
for
queries
where
not
to
much
data
is
send
to
the
remote
point
• Slow
– When
you
ask
for
to
much
data
– Remote
endpoint
not
resourced
for
your
questions
158. SERVICE
• Slowly
improving
• Theoretically
unfixable
• Practically
could
be
much
better
• 1000
x
speed
up
small
step
away