SlideShare a Scribd company logo
1 of 70
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Provenance and the W3C PROV model
(in the Big Data context)§
Paolo Missier
School of Computing Science
Newcastle University, UK
Tutorial
First Keystone Summer School,
Malta, July 2015
Some of the slides courtesy of Luc Moreau – thanks!
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Topical research dissemination events
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Lecture goals and outline
• What is provenance, and why does it matter?
• Definitions and case studies
• The W3C PROV standard in a nutshell
• PROV-O: the Provenance Ontology and examples of its usage
• Provenance and Big Data: what’s the connection?
• Opportunities and challenges
• Provenance tools [from Southampton]
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
One recent book
http://www.morganclaypool.com/doi/abs/10.2200/S00528ED1V01Y201308WBE007
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
1- Reproducibility and dissemination in Science
Independent validation of scientific claims is a cornerstone of
experimental science
• Scientific claims are supported by experiments
• How do express my “material and methods” so that you can
independently verify my results?
• How do I document my results to promote their understanding /
reuse
Provenance is the equivalent of a logbook
• Capture all steps involved in the derivation of a
result
• Replay, validate the execution, compare it with
others
To what extent these can be formalised and automated in data-
intensive science?
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
2- Explaining the outcome of a complex decision process
• Which process was used
to derive a diagnosis?
• How did the process use
the input data?
• How were the steps
configured?
• Which decisions were
made by human experts
(clinicians)?
MAF threshold
- Non-synonymous
- stop/gain
- frameshift
known polymorphisms
Homo / Heterozygous
Pathogenicity
predictors
Variant filtering
HPO match
HPO to OMIM
OMIM match
OMIM to Gene
Gene
Union
Gene
Intersect
Genes in scope
User-supplied
genes list
User-supplied
disease keywords
User-defined
preferred genes
Variant Scoping
Candidate
variants
Select
variants
in scope
variants
in scope
ClinVar
lookupClinVar
Annotated
patient
variants
Variant Classification
RED:
found,
pathogenic
AMBER:
not found
GREEN:
found,
benign
OMIM
AMBER/
not found
AMBER/
uncertain
NGS
pipelineClinical diagnosis of genetic diseases
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
3- Understanding the results of a computation
• Why has my [very complicated algorithm] produced this particular
result?
• Why is my predictive analytics model suggesting that it will rain
tomorrow?
• Why is this record part of the result of my database query?
• Database provenance
• Why is this record included in the result of my keyword search?
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
4- Content reuse on the Social Web
Open Data, Data Journalism
• A consume-select-curate-share workflow, not only professional
• Ethos: to expose the data and methods used to produce news items
• But: Data wrangling can introduce errors
• Is the data I am using valid? What is its primary source? What are the
transformation steps?
NowNews publishes an article based on
the latest employment data published by
GovStat
PolicyOrg compiles a report including
NowNews article
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
What is provenance?
Oxford English Dictionary:
• the fact of coming from some particular source or quarter; origin, derivation
• the history or pedigree of a work of art, manuscript, rare book, etc.;
• a record of the passage of an item through its various owners
Magna Carta (‘the Great Charter’) was
agreed between King John and his barons
on 15 June 1215.
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
What is provenance?
Provenance refers to the sources of information, including entities
and processes, involving in producing or delivering an artifact (*)
Provenance is a description of how things came to be, and how
they came to be in the state they are in today (*)
Provenance is a record that describes the people, institutions,
entities, and activities, involved in producing, influencing, or
delivering a piece of data or a thing in the world
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Provenance on the Web
Tim Berners-Lee’s “Oh Yeah” button:
• A browser button by which the user can express their uncertainty about a
document being displayed “so how do I know I can trust this information?”.
• Upon activation of the button, the software then retrieves metadata about the
document, listing assumptions on which trust can be based.
http://users.ugent.be/~tdenies/OhYeah/
Easy Access to Provenance: an Essential Step Towards Trust on the Web, Procs METHOD 2013: The 2nd IEEE
International Workshop on Methods for Establishing Trust with Open Data Held in conjunction with COMPSAC,
the IEEE Signature Conference on Computers, Software & Applications - July 22-26, 2013 - Kyoto, Japan
http://dx.doi.org/10.1109/COMPSACW.2013.29
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Provenance in the Semantic Web Stack
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Use cases on the Social Web
Open Data, Data Journalism
NowNews publishes an article based on the latest employment
data published by GovStat
PolicyOrg compiles a report including NowNews article
Bob: Journalist
Alice: Data Cruncher
Tom: Editor
Nick: Web Master
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Derivation - Timeliness
Derivation:
• Charts, graphs and visualizations are all based on multiple data sets
• Eg Bob’s article on employment that appeared in NowNews
• Which data was a figure based upon?
Is the report based on the most
up-to-date data?
Bob: Journalist
Alice: Data Cruncher
Tom: Editor
Nick: Web Master
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Derivation - Trusted sources
Derivation:
• Is this content derived from data coming from a reliable source?
• The chart within Bob’s article is based on GovStat data
• However that information is hidden:
• the chart was produced by a complex process performed by Alice
Policy rule:
“data supplied by the government
is reliable”
Bob: Journalist
Alice: Data Cruncher
Tom: Editor
Nick: Web Master
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Tracing the source of errors
Derivation, attribution:
• When did this error occur?
• Who was responsible for the chart?
Nick discovers an error in the
chart included in Bob’s article
prov:wasAttributedTonowpeople:
Bob
now:
employment-article-v1.html
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Ensuring policy compliance
Process inspection:
• Which process steps led to publication?
• Was editorial check part of it?
Policy rule:
“posts are to be checked by an
editor prior to publication”
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Ensuring credit and acknowledgement
NowNews relies on multiple
contributors
Bob: Journalist
Alice: Data Cruncher
Tom: Editor
Nick: Web Master
employment-article-v1.html
David
att
Bob
delAttribution and responsibility:
• How do we ensure that all relevant
contributors are acknowledged?
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Reproducibility
Documenting the data
generation process:
• How do we ensure that
the figures can be
reproduced using the
new versions of the
data?
NowNews must ensure that the
article figures reflect the most
recent data
Bob: Journalist
Alice: Data Cruncher
Tom: Editor
Nick: Web Master
data-crunching
data-source-A
use
data-source-B
use
Alice
assoc
version: 1.0 version: 2.0
employment-article-v1.html
gen
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
So, why does provenance matter?
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To enable process analysis for debugging, improvement, evolution
• To enable reproducibility of processes (eg in science, data journalism…)
See also:
ACM Journal of Data and Information Quality (JDIQ) - Special Issue on Provenance, Data
and Information Quality, Paolo Missier, Paolo Papotti, Eds. Volume 5 Issue 3, February 2015
DOI: 10.1145/2692312
http://dl.acm.org/citation.cfm?id=2700413
http://jdiq.acm.org/archive.cfm?id=2698232
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
The W3C Working Group on Provenance
W3C
Incubator group
on provenance
Chair: Yolanda Gil,
ISI, USC
W3C
working group
approved
Chairs:
Luc Moreau,
Paul Groth
2009-2010
Main output:
“Provenance XG Final Report”
http://www.w3.org/2005/Incubator/prov/XGR-prov/
- provides an overview of the various existing
approaches, vocabularies
- proposes the creation of a dedicated W3C Working
Group
April, 2011 April, 2013
Proposed
Recommendations
finalised
prov-dm: Data Model
prov-o: OWL ontology, RDF encoding
prov-n: prov notation
prov-constraints
...plus a number of non-prescriptive
Notes
http://www.w3.org/2011/prov/wiki/
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
PROV: scope and structure
23
source: http://www.w3.org/TR/prov-overview/
Recommendation
track
See also:
Moreau, Luc, and Paul Groth. “Provenance: An Introduction to PROV.” Synthesis Lectures
on the Semantic Web: Theory and Technology 3, no. 4 (September 15, 2013): 1–129.
doi:10.2200/S00528ED1V01Y201308WBE007.
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
PROV Core Elements (graph depiction)
2
4
An entity is a physical, digital, conceptual, or other kind of thing with some fixed
aspects; entities may be real or imaginary.
An activity is something that occurs over a period of time and acts upon or with entities; it
may include consuming, processing, transforming, ..., using, or generating entities.
An agent is something that bears some form of responsibility for an activity taking place,
for the existence of an entity, or for another agent's activity.
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Generation, Usage
2
5
Generation is the completion of production of a new entity by an activity. This entity did not
exist before generation and becomes available for usage after this generation.
Usage is the beginning of utilizing an entity by an activity. Before usage, the activity had
not begun to utilize this entity
PROV is based on a notion of instantaneous events, that mark transitions in the world
- generation, usage (and others)
Ordering constraints amongst events:
“generation of e must precede each of usages”
“a can only use / generate e after it has started and before it has ended”
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Concepts and relations
2
6
Generation of “draft v1” expressed as relation:
wasGeneratedBy(“draft v1”, ...)
Usage of “draft v1” by “commenting” expressed as relation:
used(“commenting, “draft v1”,...)
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
PROV notation
2
7
document
prefix prov <http://www.w3.org/ns/prov#>
prefix ex <http://www.example.com/>
entity(ex:draftComments)
entity(ex:draftV1, [ ex:distr='internal', ex:status = "draft"])
entity(ex:paper1)
entity(ex:paper2)
activity(ex:commenting)
activity(ex:drafting)
wasGeneratedBy(ex:draftComments, ex:commenting, 2013-03-18T11:10:00)
used(ex:commenting, ex:draftV1, -)
wasGeneratedBy(ex:draftV1, ex:drafting, -)
used(ex:drafting, ex:paper1, -)
used(ex:drafting, ex:paper2, -)
endDocument
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Same example — PROV-O notation (RDF/N3)
2
8
:draftComments a prov:Entity ;
:distr "internal"^^xsd:string ;
prov:wasGeneratedBy :commenting .
:commenting a prov:Activity ;
prov:used :draftV1 .
:draftV1 a prov:Entity ;
:distr "internal"^^xsd:string ;
:status "draft"^^xsd:string ;
:version "0.1"^^xsd:string ;
prov:wasGeneratedBy :drafting .
:drafting a prov:Activity ;
prov:used :paper1,
:paper2 .
:paper1 a prov:Entity,
"reference"^^xsd:string .
:paper2 a prov:Entity,
"reference"^^xsd:string .
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Association, Attribution, Delegation: who did what?
2
9
An activity association is an assignment of responsibility to an agent for an activity,
indicating that the agent had a role in the activity.
Attribution is the ascribing of an entity to an agent.
entity(ex:draftComments, [ ex:distr='internal' ])
activity(ex:commenting)
agent(ex:Bob, [prov:type = "mainEditor"] )
agent(ex:Alice, [prov:type = "srEditor"])
wasAssociatedWith(ex:commenting, Bob, -, [prov:role = "editor"])
actedOnBehalfOf(Bob, Alice)
wasAttributedTo(ex:draftComments, ex:Bob)
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Same example — PROV-O notation (RDF/N3)
3
0
:Alice a prov:Agent,
"ex:chiefEditor";
:firstName "Alice";
:lastName "Cooper".
:Bob a prov:Agent,
"ex:seniorEditor";
:firstName "Robert";
:lastName "Thompson"^;
prov:actedOnBehalfOf :Alice .
:draftComments prov:wasAttributedTo :Bob .
:drafting a prov:Activity ;
prov:wasAssociatedWith :Bob .
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Association and Attribution
3
1
Q.: what is the relationship between attribution and association?
This is defined as an inference rule in the PROV-CONSTR document
entity(e)
agent(Ag)
activity(a)
wasAttributedTo(e, Ag)
wasGeneratedBy(e, a,-)
wasAssociatedWith(a, Ag,-)
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Communication amongst activities
3
2
Communication is the exchange of some unspecified entity by two
activities, one activity using some entity generated by the other.
activity(ex:commenting)
activity(ex:drafting)
wasInformedBy(ex:commenting, ex:drafting)
:drafting a prov:Activity .
:commenting a prov:Activity ;
prov:wasInformedBy :drafting .
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Communication, generation, usage
3
3
activity(ex:commenting)
activity(ex:drafting)
entity(e)
wasInformedBy(ex:commenting, ex:drafting)
wasGeneratedBy(e,ex:drafting, -)
used(ex:commenting, e, -)
Q.: what is the relationship between communication, generation, and usage?
This are inference rules 5 and 6 in the PROV-CONSTR document
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Three Views of Provenance
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Summary of the PROV Core model
3
5
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Derivation amongst entities
3
6
A derivation is a transformation of an entity into another, an update of an entity
resulting in a new one, or the construction of a new entity based on a pre-existing
entity.
entity(ex:draftV1)
entity(ex:draftComments)
wasDerivedFrom(ex:draftComments, ex:draftV1)
Q.: what is the relationship between derivation, generation, and usage?
:draftComments a prov:Entity ;
prov:wasDerivedFrom :draftV1 .
:draftV1 a prov:Entity .
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Provenance and Big Data: what’s the connection?
opportunities and challenges
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Provenance {as,of} Big Data
1. BigProv: Provenance as big data
• High volume provenance
• What kind of analytics are interesting on big provenance?
2. Provenance of analytics processes
• “Prediction provenance”
• Train a model  provenance of the model as a record of the training
process and data involved
• Use the model to make predictions  provenance of the prediction
3. Provenance of a search
• What is the provenance of a keyword search?
• Why would it be interesting? What can we learn from it?
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Recent research on Provenance as Big Data
Chen, Peng; Plale, Beth A., "Big Data Provenance Analysis and Visualization," Cluster, Cloud and Grid
Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.797,800, 4-7
May 2015 doi: 10.1109/CCGrid.2015.85
Chen, Peng; Plale, Beth A., "ProvErr: System Level Statistical Fault Diagnosis Using Dependency
Model," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium
on , vol., no., pp.525,534, 4-7 May 2015
doi: 10.1109/CCGrid.2015.86
Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs
Peter Macko and Margo Seltzer, Harvard University, Procs. TAPP’11, 2011, Crete, Greece
Provenance from Log Files: a BigData Problem, Devarshi Ghoshal and
Beth Plale, Procs. BigProv workshop, EDBT, Genova, Italy, 2013
Adam Bates, Kevin Butler and Thomas Moyer. Take Only What You Need: Leveraging Mandatory
Access Control Policy to Reduce Provenance Storage Costs. In Procs. TAPP’15 workshop,
Edinburgh, 2015
http://workshops.inf.ed.ac.uk/tapp2015/TAPP15_II_3.pdf
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
ProvGen
• A Provenance Generator tool for experimenting with provenance at scale
• Why generate synthetic provenance?
• Synthetic PROV graphs can be a valuable complement to emerging natural
provenance collections
• … provided their structural properties reflect specific provenance patterns
• control over their repetition and variability
• varying scales
• Useful for benchmarking emerging provenance management systems
• Useful to test analytics algorithms that operate on large provenance collections
trace size
numberoftraces
science
datasets
git2PROV
mediaWiki
History
retweet
history
Firth, Hugo, and Paolo Missier. “ProvGen: Generating Synthetic PROV Graphs with Predictable Structure.” In
Procs. IPAW 2014 (Provenance and Annotations). Koln, Germany: Springer, 2014.
http://arxiv.org/pdf/1406.2495
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
What does ProvGen do?
• Accept a seed PROV graph
• Grow the graph
• Add nodes and relationships following the seed graph
structure
• … with constraints on how to grow
document
entity(e1, [type="Document",
version="original"])
entity(e2, [type="Document"])
entity(e3, [type="Document"])
activity(a1, [type="create"])
activity(a2, [type="edit"])
activity(a3, [type="edit"])
agent(ag, [type="Person"])
used(a2, e1)
used(a3, e2)
wasGeneratedBy(e2, a2, [fct="save"])
wasGeneratedBy(e1, a1, [fct="publish"])
wasGeneratedBy(e3, a3, [fct="save"])
wasAssociatedWith(a3, ag,
[role="contributor"])
wasAssociatedWith(a2, ag,
[role="contributor"])
wasAssociatedWith(a1, ag,
[role="creator"])
wasDerivedFrom(e2, e1)
wasDerivedFrom(e3, e2)
endDocument
a1
type: create
a2
e1
use
type: edita3
e2
use
type: edit
gen
type: Document
version: original
der
type: Document
e3
gen
der
type: Document
plan
type: prov:plan
ag
type: Personassoc
assoc
assoc
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
ProvGen constraints
an Entity must have relationship "WasDerivedFrom" exactly 2 times unless it has
property("version"="original");
the Entity(e1) must not have relationship "WasDerivedFrom" with the Entity(e2)
unless e1 has relationship "Used" with the Activity(a) and e2 has the
relationship "WasGeneratedBy" with the Activity(a);
an Entity must have relationship "WasGeneratedBy" exactly 1 times;
an Entity must have property("version"="original") with probability 0.05;
an Entity must have out degree at most 2;
an Activity must have relationship "Used" at most 1 times;
an Activity must have property("type"="create") with probability 0.01;
an Activity must have relationship "WasAssociatedWith" exactly 1 times;
an Activity must have relationship "Used" exactly 1 times unless it has
property("type"="create");
an Activity must have relationship "WasGeneratedBy" exactly 1 times;
an Agent must have relationship "WasAssociatedWith" with probability 0.1;
an Agent must have relationship "WasAssociatedWith" between 1, 120 times with
distribution gamma(1.3, 2.4);
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Some test queries
Generated graph loaded to Neo4J GDBMS
Queries expressed using the Cypher graph query language
Transitive closure over Derivation:
Return all the derivation chains, along with their length
MATCH (a)-[r:`WASDERIVEDFROM`*]->(b) RETURN a,b, length(r)
MATCH (a)-[r:`WASDERIVEDFROM`*]->(b)
WHERE length(r) > 10
RETURN a,b, length(r)
ORDER BY length(r) desc limit 50
Return the top 50 length derivation chains
MATCH (a)-[:`WASASSOCIATEDWITH`]->(b)
RETURN a as Agent, b as Activity
All agents and their associated activities
All agents who created new documents
MATCH (a{type:'create'})-[:`WASASSOCIATEDWITH`]->(b)
RETURN a,b LIMIT 25
All agents who edited a document that was derived from an original
MATCH (doc1{version:'original'}) <- [:WASDERIVEDFROM] - (doc2)
-[:`WASGENERATEDBY`] -> act -[:WASASSOCIATEDWITH] -> agent
RETURN agent LIMIT 25
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Provenance of Big Data
Provenance of analytics processes:
“Prediction provenance”
• Train a model  provenance of the model as a record of the training
process and data involved
• Use the model to make predictions  provenance of the prediction
21 July 2015 11:38
21 July 2015 11:38
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Relations may be given identifiers
4
5
entity(ex:draftComments)
entity(ex:draftV1)
activity(ex:commenting)
wasGeneratedBy(gen1; ex:draftComments, ex:commenting, -)
used(use1; ex:commenting, ex:draftV1, -)
gen1 denotes a generation event
use1 denotes a usage event
wasDerivedFrom(id; e2, e1, a, g2, u1, attrs)
General derivation relation:
Relation IDs make it possible to refer to relations in other relations
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Rendering N-ary relations in PROV-O
4
6
RDF is for binary relations —- N-ary relations require reification
entity(ex:draftComments)
entity(ex:draftV1)
activity(ex:commenting)
wasGeneratedBy(gen1; ex:draftComments,
ex:commenting,
2013-03-18T10:00:01)
used(use1; ex:commenting, ex:draftV1, -)
:draftComments a prov:Entity ;
prov:qualifiedGeneration :gen1 .
:gen1 a prov:Generation ;
prov:activity :commenting;
prov:atTime “2013-03-18T10:00:01+09:00".
:commenting a prov:Activity ;
prov:qualifiedUsage :use1 .
:use1 a prov:Usage ;
:note "found comments useful";
prov:atTime "2013-03-21T10:00:01+09:00";
prov:entity :draftV1.
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
“Qualified relation” RDF pattern
4
7
:draftComments a prov:Entity ;
prov:qualifiedGeneration :gen1 .
:gen1 a prov:Generation ;
prov:activity :commenting;
prov:atTime “2013-03-18T10:00:01+09:00".
:commenting a prov:Activity ;
prov:qualifiedUsage :use1 .
:use1 a prov:Usage ;
:note "found comments useful";
prov:atTime "2013-03-21T10:00:01+09:00";
prov:entity :draftV1.
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Plans — why was something done?
4
8
Most relation types have two arguments which are { Entity, Activity, Agent}
Derivation is one exception:
wasDerivedFrom(id; e2, e1, a, g2, u1, attrs)
Two other notable exceptions:
- Associations with a plan
- Delegation with an activity scope
wasAssociatedWith(id; a, ag, pl, attrs)
A plan is an entity that represents a set of actions or steps
intended by one or more agents to achieve some goal
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Association with a plan
4
9
A plan plays a role in an association
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Plans are typed entities
5
0
activity(ex:_aProgramExecution, [ex:execTime="22.5sec"])
agent(ex:_aJVM, [prov:type = 'JVM-6.0'])
entity(ex:myCleverProgram,
[prov:type='prov:Plan', ex:label='Program 1'])
wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM,
ex:myCleverProgram,
[prov:role='defaultRuntime',
ex:accessPath="webapp" ])
A plan is an entity having prov:type = “prov:plan”
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Plan pattern as PROV-O
5
1
:_aProgramExecution a prov:Activity ;
:execTime "22.5sec;
prov:qualifiedAssociation [ a prov:Association ;
:accessPath "webapp";
prov:agent :_aJVM ;
prov:hadPlan :myCleverProgram ;
prov:hadRole "defaultRuntime"] .
:_aJVM a prov:Agent, “Java-6.0".
:myCleverProgram a prov:Entity, prov:Plan.
activity(ex:_aProgramExecution, [ex:execTime="22.5sec"])
agent(ex:_aJVM, [prov:type = 'JVM-6.0'])
entity(ex:myCleverProgram,
[prov:type='prov:Plan', ex:label='Program 1'])
wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM,
ex:myCleverProgram,
[prov:role='defaultRuntime',
ex:accessPath='webapp' ])
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Plan pattern as PROV-O
5
2
:_aProgramExecution a prov:Activity ;
:execTime "22.5sec;
prov:qualifiedAssociation [ a prov:Association ;
:accessPath "webapp";
prov:agent :_aJVM ;
prov:hadPlan :myCleverProgram ;
prov:hadRole "defaultRuntime"] .
:_aJVM a prov:Agent, “Java-6.0".
:myCleverProgram a prov:Entity, prov:Plan.
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Delegation within an activity scope
5
3
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Real-world artifacts vs provenance entities
5
4
ref: http://www.w3.org/2001/sw/wiki/PROV-FAQ#Examples_of_Provenance
“What do I know about the car I see in this Cambridge street today?”
•It was bought by Joe in 2011
•Joe drove it to Boston on March 16th,
2013. The car has now got 10,000 miles
on it
•Joe drove it to Cambridge on March
18th, 2013.
“Same” car, but different provenance at
each stage of its evolution
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Alternate-specialization pattern
5
5
Two alternate entities present aspects of the same thing. These aspects may be the same or
different, and the alternate entities may or may not overlap in time.
An entity that is a specialization of another shares all aspects of the latter, and additionally
presents more specific aspects of the same thing as the latter.
...But, this is still that car!
Semantic notes:
1. Specialization implies alternate: IF specializationOf(e1,e2) THEN alternateOf(e1,e2).
2. Alternate is symmetric: IF alternateOf(e1,e2) THEN alternateOf(e2,e1)
3. Specialization is transitive: IF specializationOf(e1,e2) and specializationOf(e2,e3) THEN specializationOf(e1,e3).
differing in their
location
same owner,
added location
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Reserved attributes and types
5
6
A small set of reserved attributes, with some usage restrictions
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Bundles, provenance of provenance
5
7
A bundle is a named set of provenance descriptions, and is itself an entity,
so allowing provenance of provenance to be expressed.
bundle pm:bundle1
entity(ex:draftComments)
entity(ex:draftV1)
activity(ex:commenting)
wasGeneratedBy(ex:draftComments, ex:commenting,-)
used(ex:commenting, ex:draftV1, -)
endBundle
...
entity(pm:bundle1, [ prov:type='prov:Bundle' ])
agent(ex:Bob)
wasGeneratedBy(pm:bundle1, -, 2013-03-20T10:30:00)
wasAttributedTo(pm:bundle1, ex:Bob)
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Bundles in PROV-O
5
8
Bundle definition (an RDF named graph):
ex:bundle1 {
:draftComments a prov:Entity ;
:status “blah";
prov:wasGeneratedBy :commenting .
:commenting a prov:Activity ;
prov:used :draftV1 .
:draftV1 a prov:Entity .
}
Bundle usage:
ex:bundle1 a prov:Entity, "prov:Bundle";
prov:qualifiedGeneration [ a prov:Generation ;
prov:atTime “2013-03-20T10:30:00+09:00" ];
prov:wasAttributedTo :Bob .
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
PROV-DM relations at a glance
21
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Component Structure for PROV
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Core vs Extended
Core Extended
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Time, Events
6
2
wasStartedBy(id; a2, e, a1, t, attrs)
wasEndedBy(id; a2, e, a1, t, attrs)
Instead, the PROV data model is implicitly based on a notion of
instantaneous events, that mark transitions in the world (*)
(*) PROV-CONSTR http://www.w3.org/TR/prov-constraints/#events (non-normative)
Events:
- activity start, activity end,
- entity generation , entity usage, entity invalidation
- Provenance statements are combined by different systems
- An application may not be able to align the times involved to a single
global timeline
Therefore, PROV minimizes assumptions about time
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
From “scruffy” provenance to “valid” provenance
6
3
- Are all possible temporal partial ordering of events equally acceptable?
- How can we specify the set of all valid orderings?
More generally, how do we formally define what it means for a set of
provenance statements to be valid?
PROV defines a set of temporal constraints that ensure consistency
of a provenance graph
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Summary
• Motivation for collecting provenance of data and information
• In Science
• In the Social Web
• The W3C PROV Recommendation (2013)
• PROV-DM: The PROV data model
• PROV-O: the Provenance Ontology
• (PROV-CONSTRAINTS)
• Provenance as Big Data
• High volume provenance
• Storage, analytics, visualisation
• Provenance of analytics
• How can I explain my predictions?
• The ProvGen tool
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Selected bibliography
Moreau, Luc, Paolo Missier, Khalid Belhajjame, Reza B’Far, James Cheney, Sam Coppens, Stephen Cresswell,
et al. PROV-DM: The PROV Data Model. Edited by Luc Moreau and Paolo Missier, 2012.
http://www.w3.org/TR/prov-dm/
Cheney, James, Paolo Missier, and Luc Moreau. Constraints of the Provenance Data Model, 2012.
http://www.w3.org/TR/prov-constraints/
Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. “The Rationale of PROV.” Web
Semantics: Science, Services and Agents on the World Wide Web (April 2015).
doi:10.1016/j.websem.2015.04.001.
http://www.sciencedirect.com/science/article/pii/S1570826815000177
Marinho, Anderson, Leonardo Murta, Cláudia Werner, Vanessa Braganholo, Sérgio Manuel Serra da Cruz,
Eduardo Ogasawara, and Marta Mattoso. “ProvManager: a Provenance Management System for Scientific
Workflows.” Concurrency and Computation: Practice and Experience 24, no. 13 (2012): 1513–1530.
http://dx.doi.org/10.1002/cpe.1870.
ProvGen: generating synthetic PROV graphs with predictable structure.
Firth, H.; and Missier, P. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer
http://arxiv.org/pdf/1406.2495
ProvAbs: model, policy, and tooling for abstracting PROV graphs.
Missier, P.; Bryans, J.; Gamble, C.; Curcin, V.; and Danger, R. In Procs. IPAW 2014 (Provenance and
Annotations), Koln, Germany, 2014. Springer http://arxiv.org/pdf/1406.1998
De Oliveira, Daniel, Vítor Silva, and Marta Mattoso. “How Much Domain Data Should Be in Provenance
Databases?” In 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15). Edinburgh,
Scotland: USENIX Association, 2015. https://www.usenix.org/conference/tapp15/workshop-
program/presentation/de-oliveira.
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
provenance.ecs.soton.ac.uk
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
ProvValidator
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
ProvTranslator
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
ProvStore
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
ProvToolbox
FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
ProvPy

More Related Content

What's hot

Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
 
WWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationWWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationStefan Dietze
 
LAK Dataset and Challenge (April 2013)
LAK Dataset and Challenge (April 2013)LAK Dataset and Challenge (April 2013)
LAK Dataset and Challenge (April 2013)Stefan Dietze
 
VALA 2016 L-Plate session on Linked Open Data
VALA 2016 L-Plate session on Linked Open DataVALA 2016 L-Plate session on Linked Open Data
VALA 2016 L-Plate session on Linked Open DataPeter Neish
 
Learning Analytics & Linked Data – Opportunities, Challenges, Examples
Learning Analytics & Linked Data – Opportunities, Challenges, ExamplesLearning Analytics & Linked Data – Opportunities, Challenges, Examples
Learning Analytics & Linked Data – Opportunities, Challenges, ExamplesStefan Dietze
 
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)Stefan Dietze
 
The SFX Framework for Context-Sensitive Reference Linking
The SFX Framework for  Context-Sensitive Reference LinkingThe SFX Framework for  Context-Sensitive Reference Linking
The SFX Framework for Context-Sensitive Reference LinkingHerbert Van de Sompel
 
FAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueFAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueHerbert Van de Sompel
 
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedStefan Dietze
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017petermurrayrust
 
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebStefan Dietze
 
Doing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebDoing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebMathieu d'Aquin
 
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataStefan Dietze
 

What's hot (20)

Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
 
WWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationWWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & Education
 
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at ScaleFull Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
 
LAK Dataset and Challenge (April 2013)
LAK Dataset and Challenge (April 2013)LAK Dataset and Challenge (April 2013)
LAK Dataset and Challenge (April 2013)
 
Ziegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections LibrariesZiegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections Libraries
 
VALA 2016 L-Plate session on Linked Open Data
VALA 2016 L-Plate session on Linked Open DataVALA 2016 L-Plate session on Linked Open Data
VALA 2016 L-Plate session on Linked Open Data
 
Learning Analytics & Linked Data – Opportunities, Challenges, Examples
Learning Analytics & Linked Data – Opportunities, Challenges, ExamplesLearning Analytics & Linked Data – Opportunities, Challenges, Examples
Learning Analytics & Linked Data – Opportunities, Challenges, Examples
 
Data hv seminar_thadthong_v05_slshr
Data hv seminar_thadthong_v05_slshrData hv seminar_thadthong_v05_slshr
Data hv seminar_thadthong_v05_slshr
 
Broad Data
Broad DataBroad Data
Broad Data
 
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
 
The SFX Framework for Context-Sensitive Reference Linking
The SFX Framework for  Context-Sensitive Reference LinkingThe SFX Framework for  Context-Sensitive Reference Linking
The SFX Framework for Context-Sensitive Reference Linking
 
McGeary Data Curation Network: Developing and Scaling
McGeary Data Curation Network: Developing and ScalingMcGeary Data Curation Network: Developing and Scaling
McGeary Data Curation Network: Developing and Scaling
 
FAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueFAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning Issue
 
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017
 
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the Web
 
Doing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebDoing Clever Things with the Semantic Web
Doing Clever Things with the Semantic Web
 
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open Data
 
NISO Webinar: Taking Your Website Wherever You Go: Delivering Great User Expe...
NISO Webinar: Taking Your Website Wherever You Go: Delivering Great User Expe...NISO Webinar: Taking Your Website Wherever You Go: Delivering Great User Expe...
NISO Webinar: Taking Your Website Wherever You Go: Delivering Great User Expe...
 
Washington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of HoustonWashington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of Houston
 

Similar to Keystone summer school 2015 paolo-missier-provenance

Big data-and-creativity v.1
Big data-and-creativity v.1Big data-and-creativity v.1
Big data-and-creativity v.1Kim Flintoff
 
LinkedUp Open Education Panel session
LinkedUp Open Education Panel sessionLinkedUp Open Education Panel session
LinkedUp Open Education Panel sessionMarieke Guy
 
Carrying the Banner: Reinventing News on Your University Website
Carrying the Banner: Reinventing News on Your University WebsiteCarrying the Banner: Reinventing News on Your University Website
Carrying the Banner: Reinventing News on Your University WebsiteGeorgiana Cohen
 
Computational Verification Challenges in Social Media
Computational Verification Challenges in Social MediaComputational Verification Challenges in Social Media
Computational Verification Challenges in Social MediaSymeon Papadopoulos
 
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open DataODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open DataMartin Kaltenböck
 
Data Driven Journalism Links and Resources
Data Driven Journalism Links and Resources Data Driven Journalism Links and Resources
Data Driven Journalism Links and Resources Amy Weiss
 
Data Science For Social Good: Tackling the Challenge of Homelessness
Data Science For Social Good: Tackling the Challenge of HomelessnessData Science For Social Good: Tackling the Challenge of Homelessness
Data Science For Social Good: Tackling the Challenge of HomelessnessAnita Luthra
 
2013.07.22 Tom De Nies - METHOD 2013 - Easy Access to Provenance: an Essenti...
2013.07.22  Tom De Nies - METHOD 2013 - Easy Access to Provenance: an Essenti...2013.07.22  Tom De Nies - METHOD 2013 - Easy Access to Provenance: an Essenti...
2013.07.22 Tom De Nies - METHOD 2013 - Easy Access to Provenance: an Essenti...tdenies
 
Emerging Trends in Crisis Informatics
Emerging Trends in Crisis InformaticsEmerging Trends in Crisis Informatics
Emerging Trends in Crisis InformaticsAdam Papendieck
 
Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...
Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...
Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...Lora Aroyo
 
Learn to speak open
Learn to speak openLearn to speak open
Learn to speak openLilian Juma
 
Safecast Report 2017 - Part 1 Safecast Project- Final
Safecast Report 2017 - Part 1 Safecast Project- FinalSafecast Report 2017 - Part 1 Safecast Project- Final
Safecast Report 2017 - Part 1 Safecast Project- FinalSafecast
 
ocTEL and Open Badges #altc
ocTEL and Open Badges #altcocTEL and Open Badges #altc
ocTEL and Open Badges #altcMartin Hawksey
 
Linas Eriksonas, Social networks of startup entrepreneurs:  the case of the s...
Linas Eriksonas, Social networks of startup entrepreneurs: the case of the s...Linas Eriksonas, Social networks of startup entrepreneurs: the case of the s...
Linas Eriksonas, Social networks of startup entrepreneurs:  the case of the s...Linas Eriksonas
 
Linas Eriksonas, Social networks of startup entrepreneurs:  the case of the s...
Linas Eriksonas, Social networks of startup entrepreneurs: the case of the s...Linas Eriksonas, Social networks of startup entrepreneurs: the case of the s...
Linas Eriksonas, Social networks of startup entrepreneurs:  the case of the s...Linas Eriksonas
 
The Age of Data Driven Science and Engineering
The Age of Data Driven Science and Engineering The Age of Data Driven Science and Engineering
The Age of Data Driven Science and Engineering Persontyle
 
Semantic Web in the Plateau of Productivity
Semantic Web in the Plateau of ProductivitySemantic Web in the Plateau of Productivity
Semantic Web in the Plateau of ProductivityIoannis Stavrakantonakis
 

Similar to Keystone summer school 2015 paolo-missier-provenance (20)

Processing Large Complex Data
Processing Large Complex DataProcessing Large Complex Data
Processing Large Complex Data
 
Big data-and-creativity v.1
Big data-and-creativity v.1Big data-and-creativity v.1
Big data-and-creativity v.1
 
LinkedUp Open Education Panel session
LinkedUp Open Education Panel sessionLinkedUp Open Education Panel session
LinkedUp Open Education Panel session
 
Carrying the Banner: Reinventing News on Your University Website
Carrying the Banner: Reinventing News on Your University WebsiteCarrying the Banner: Reinventing News on Your University Website
Carrying the Banner: Reinventing News on Your University Website
 
Computational Verification Challenges in Social Media
Computational Verification Challenges in Social MediaComputational Verification Challenges in Social Media
Computational Verification Challenges in Social Media
 
Open Goverment Data: Insights from the International Open Goverment Data Conf...
Open Goverment Data: Insights from the International Open Goverment Data Conf...Open Goverment Data: Insights from the International Open Goverment Data Conf...
Open Goverment Data: Insights from the International Open Goverment Data Conf...
 
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open DataODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data
 
Here Comes Everything
Here Comes EverythingHere Comes Everything
Here Comes Everything
 
Data Driven Journalism Links and Resources
Data Driven Journalism Links and Resources Data Driven Journalism Links and Resources
Data Driven Journalism Links and Resources
 
Data Science For Social Good: Tackling the Challenge of Homelessness
Data Science For Social Good: Tackling the Challenge of HomelessnessData Science For Social Good: Tackling the Challenge of Homelessness
Data Science For Social Good: Tackling the Challenge of Homelessness
 
2013.07.22 Tom De Nies - METHOD 2013 - Easy Access to Provenance: an Essenti...
2013.07.22  Tom De Nies - METHOD 2013 - Easy Access to Provenance: an Essenti...2013.07.22  Tom De Nies - METHOD 2013 - Easy Access to Provenance: an Essenti...
2013.07.22 Tom De Nies - METHOD 2013 - Easy Access to Provenance: an Essenti...
 
Emerging Trends in Crisis Informatics
Emerging Trends in Crisis InformaticsEmerging Trends in Crisis Informatics
Emerging Trends in Crisis Informatics
 
Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...
Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...
Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...
 
Learn to speak open
Learn to speak openLearn to speak open
Learn to speak open
 
Safecast Report 2017 - Part 1 Safecast Project- Final
Safecast Report 2017 - Part 1 Safecast Project- FinalSafecast Report 2017 - Part 1 Safecast Project- Final
Safecast Report 2017 - Part 1 Safecast Project- Final
 
ocTEL and Open Badges #altc
ocTEL and Open Badges #altcocTEL and Open Badges #altc
ocTEL and Open Badges #altc
 
Linas Eriksonas, Social networks of startup entrepreneurs:  the case of the s...
Linas Eriksonas, Social networks of startup entrepreneurs: the case of the s...Linas Eriksonas, Social networks of startup entrepreneurs: the case of the s...
Linas Eriksonas, Social networks of startup entrepreneurs:  the case of the s...
 
Linas Eriksonas, Social networks of startup entrepreneurs:  the case of the s...
Linas Eriksonas, Social networks of startup entrepreneurs: the case of the s...Linas Eriksonas, Social networks of startup entrepreneurs: the case of the s...
Linas Eriksonas, Social networks of startup entrepreneurs:  the case of the s...
 
The Age of Data Driven Science and Engineering
The Age of Data Driven Science and Engineering The Age of Data Driven Science and Engineering
The Age of Data Driven Science and Engineering
 
Semantic Web in the Plateau of Productivity
Semantic Web in the Plateau of ProductivitySemantic Web in the Plateau of Productivity
Semantic Web in the Plateau of Productivity
 

More from Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 

More from Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Recently uploaded

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Recently uploaded (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Keystone summer school 2015 paolo-missier-provenance

  • 1. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Provenance and the W3C PROV model (in the Big Data context)§ Paolo Missier School of Computing Science Newcastle University, UK Tutorial First Keystone Summer School, Malta, July 2015 Some of the slides courtesy of Luc Moreau – thanks!
  • 3. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Lecture goals and outline • What is provenance, and why does it matter? • Definitions and case studies • The W3C PROV standard in a nutshell • PROV-O: the Provenance Ontology and examples of its usage • Provenance and Big Data: what’s the connection? • Opportunities and challenges • Provenance tools [from Southampton]
  • 5. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier 1- Reproducibility and dissemination in Science Independent validation of scientific claims is a cornerstone of experimental science • Scientific claims are supported by experiments • How do express my “material and methods” so that you can independently verify my results? • How do I document my results to promote their understanding / reuse Provenance is the equivalent of a logbook • Capture all steps involved in the derivation of a result • Replay, validate the execution, compare it with others To what extent these can be formalised and automated in data- intensive science?
  • 6. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier 2- Explaining the outcome of a complex decision process • Which process was used to derive a diagnosis? • How did the process use the input data? • How were the steps configured? • Which decisions were made by human experts (clinicians)? MAF threshold - Non-synonymous - stop/gain - frameshift known polymorphisms Homo / Heterozygous Pathogenicity predictors Variant filtering HPO match HPO to OMIM OMIM match OMIM to Gene Gene Union Gene Intersect Genes in scope User-supplied genes list User-supplied disease keywords User-defined preferred genes Variant Scoping Candidate variants Select variants in scope variants in scope ClinVar lookupClinVar Annotated patient variants Variant Classification RED: found, pathogenic AMBER: not found GREEN: found, benign OMIM AMBER/ not found AMBER/ uncertain NGS pipelineClinical diagnosis of genetic diseases
  • 7. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier 3- Understanding the results of a computation • Why has my [very complicated algorithm] produced this particular result? • Why is my predictive analytics model suggesting that it will rain tomorrow? • Why is this record part of the result of my database query? • Database provenance • Why is this record included in the result of my keyword search?
  • 8. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier 4- Content reuse on the Social Web Open Data, Data Journalism • A consume-select-curate-share workflow, not only professional • Ethos: to expose the data and methods used to produce news items • But: Data wrangling can introduce errors • Is the data I am using valid? What is its primary source? What are the transformation steps? NowNews publishes an article based on the latest employment data published by GovStat PolicyOrg compiles a report including NowNews article :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 9. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier What is provenance? Oxford English Dictionary: • the fact of coming from some particular source or quarter; origin, derivation • the history or pedigree of a work of art, manuscript, rare book, etc.; • a record of the passage of an item through its various owners Magna Carta (‘the Great Charter’) was agreed between King John and his barons on 15 June 1215.
  • 10. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier What is provenance? Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*) Provenance is a description of how things came to be, and how they came to be in the state they are in today (*) Provenance is a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing in the world
  • 11. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Provenance on the Web Tim Berners-Lee’s “Oh Yeah” button: • A browser button by which the user can express their uncertainty about a document being displayed “so how do I know I can trust this information?”. • Upon activation of the button, the software then retrieves metadata about the document, listing assumptions on which trust can be based. http://users.ugent.be/~tdenies/OhYeah/ Easy Access to Provenance: an Essential Step Towards Trust on the Web, Procs METHOD 2013: The 2nd IEEE International Workshop on Methods for Establishing Trust with Open Data Held in conjunction with COMPSAC, the IEEE Signature Conference on Computers, Software & Applications - July 22-26, 2013 - Kyoto, Japan http://dx.doi.org/10.1109/COMPSACW.2013.29
  • 12. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Provenance in the Semantic Web Stack :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 13. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Use cases on the Social Web Open Data, Data Journalism NowNews publishes an article based on the latest employment data published by GovStat PolicyOrg compiles a report including NowNews article Bob: Journalist Alice: Data Cruncher Tom: Editor Nick: Web Master :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 14. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Derivation - Timeliness Derivation: • Charts, graphs and visualizations are all based on multiple data sets • Eg Bob’s article on employment that appeared in NowNews • Which data was a figure based upon? Is the report based on the most up-to-date data? Bob: Journalist Alice: Data Cruncher Tom: Editor Nick: Web Master :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 15. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Derivation - Trusted sources Derivation: • Is this content derived from data coming from a reliable source? • The chart within Bob’s article is based on GovStat data • However that information is hidden: • the chart was produced by a complex process performed by Alice Policy rule: “data supplied by the government is reliable” Bob: Journalist Alice: Data Cruncher Tom: Editor Nick: Web Master :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 16. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Tracing the source of errors Derivation, attribution: • When did this error occur? • Who was responsible for the chart? Nick discovers an error in the chart included in Bob’s article prov:wasAttributedTonowpeople: Bob now: employment-article-v1.html :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 17. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Ensuring policy compliance Process inspection: • Which process steps led to publication? • Was editorial check part of it? Policy rule: “posts are to be checked by an editor prior to publication” :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 18. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Ensuring credit and acknowledgement NowNews relies on multiple contributors Bob: Journalist Alice: Data Cruncher Tom: Editor Nick: Web Master employment-article-v1.html David att Bob delAttribution and responsibility: • How do we ensure that all relevant contributors are acknowledged? :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 19. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Reproducibility Documenting the data generation process: • How do we ensure that the figures can be reproduced using the new versions of the data? NowNews must ensure that the article figures reflect the most recent data Bob: Journalist Alice: Data Cruncher Tom: Editor Nick: Web Master data-crunching data-source-A use data-source-B use Alice assoc version: 1.0 version: 2.0 employment-article-v1.html gen :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 20. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier So, why does provenance matter? • To establish quality, relevance, trust • To track information attribution through complex transformations • To enable process analysis for debugging, improvement, evolution • To enable reproducibility of processes (eg in science, data journalism…) See also: ACM Journal of Data and Information Quality (JDIQ) - Special Issue on Provenance, Data and Information Quality, Paolo Missier, Paolo Papotti, Eds. Volume 5 Issue 3, February 2015 DOI: 10.1145/2692312 http://dl.acm.org/citation.cfm?id=2700413 http://jdiq.acm.org/archive.cfm?id=2698232
  • 21. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier The W3C Working Group on Provenance W3C Incubator group on provenance Chair: Yolanda Gil, ISI, USC W3C working group approved Chairs: Luc Moreau, Paul Groth 2009-2010 Main output: “Provenance XG Final Report” http://www.w3.org/2005/Incubator/prov/XGR-prov/ - provides an overview of the various existing approaches, vocabularies - proposes the creation of a dedicated W3C Working Group April, 2011 April, 2013 Proposed Recommendations finalised prov-dm: Data Model prov-o: OWL ontology, RDF encoding prov-n: prov notation prov-constraints ...plus a number of non-prescriptive Notes http://www.w3.org/2011/prov/wiki/
  • 22. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier PROV: scope and structure 23 source: http://www.w3.org/TR/prov-overview/ Recommendation track See also: Moreau, Luc, and Paul Groth. “Provenance: An Introduction to PROV.” Synthesis Lectures on the Semantic Web: Theory and Technology 3, no. 4 (September 15, 2013): 1–129. doi:10.2200/S00528ED1V01Y201308WBE007.
  • 23. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier PROV Core Elements (graph depiction) 2 4 An entity is a physical, digital, conceptual, or other kind of thing with some fixed aspects; entities may be real or imaginary. An activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, ..., using, or generating entities. An agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity.
  • 24. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Generation, Usage 2 5 Generation is the completion of production of a new entity by an activity. This entity did not exist before generation and becomes available for usage after this generation. Usage is the beginning of utilizing an entity by an activity. Before usage, the activity had not begun to utilize this entity PROV is based on a notion of instantaneous events, that mark transitions in the world - generation, usage (and others) Ordering constraints amongst events: “generation of e must precede each of usages” “a can only use / generate e after it has started and before it has ended”
  • 25. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Concepts and relations 2 6 Generation of “draft v1” expressed as relation: wasGeneratedBy(“draft v1”, ...) Usage of “draft v1” by “commenting” expressed as relation: used(“commenting, “draft v1”,...)
  • 26. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier PROV notation 2 7 document prefix prov <http://www.w3.org/ns/prov#> prefix ex <http://www.example.com/> entity(ex:draftComments) entity(ex:draftV1, [ ex:distr='internal', ex:status = "draft"]) entity(ex:paper1) entity(ex:paper2) activity(ex:commenting) activity(ex:drafting) wasGeneratedBy(ex:draftComments, ex:commenting, 2013-03-18T11:10:00) used(ex:commenting, ex:draftV1, -) wasGeneratedBy(ex:draftV1, ex:drafting, -) used(ex:drafting, ex:paper1, -) used(ex:drafting, ex:paper2, -) endDocument
  • 27. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Same example — PROV-O notation (RDF/N3) 2 8 :draftComments a prov:Entity ; :distr "internal"^^xsd:string ; prov:wasGeneratedBy :commenting . :commenting a prov:Activity ; prov:used :draftV1 . :draftV1 a prov:Entity ; :distr "internal"^^xsd:string ; :status "draft"^^xsd:string ; :version "0.1"^^xsd:string ; prov:wasGeneratedBy :drafting . :drafting a prov:Activity ; prov:used :paper1, :paper2 . :paper1 a prov:Entity, "reference"^^xsd:string . :paper2 a prov:Entity, "reference"^^xsd:string .
  • 28. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Association, Attribution, Delegation: who did what? 2 9 An activity association is an assignment of responsibility to an agent for an activity, indicating that the agent had a role in the activity. Attribution is the ascribing of an entity to an agent. entity(ex:draftComments, [ ex:distr='internal' ]) activity(ex:commenting) agent(ex:Bob, [prov:type = "mainEditor"] ) agent(ex:Alice, [prov:type = "srEditor"]) wasAssociatedWith(ex:commenting, Bob, -, [prov:role = "editor"]) actedOnBehalfOf(Bob, Alice) wasAttributedTo(ex:draftComments, ex:Bob)
  • 29. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Same example — PROV-O notation (RDF/N3) 3 0 :Alice a prov:Agent, "ex:chiefEditor"; :firstName "Alice"; :lastName "Cooper". :Bob a prov:Agent, "ex:seniorEditor"; :firstName "Robert"; :lastName "Thompson"^; prov:actedOnBehalfOf :Alice . :draftComments prov:wasAttributedTo :Bob . :drafting a prov:Activity ; prov:wasAssociatedWith :Bob .
  • 30. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Association and Attribution 3 1 Q.: what is the relationship between attribution and association? This is defined as an inference rule in the PROV-CONSTR document entity(e) agent(Ag) activity(a) wasAttributedTo(e, Ag) wasGeneratedBy(e, a,-) wasAssociatedWith(a, Ag,-)
  • 31. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Communication amongst activities 3 2 Communication is the exchange of some unspecified entity by two activities, one activity using some entity generated by the other. activity(ex:commenting) activity(ex:drafting) wasInformedBy(ex:commenting, ex:drafting) :drafting a prov:Activity . :commenting a prov:Activity ; prov:wasInformedBy :drafting .
  • 32. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Communication, generation, usage 3 3 activity(ex:commenting) activity(ex:drafting) entity(e) wasInformedBy(ex:commenting, ex:drafting) wasGeneratedBy(e,ex:drafting, -) used(ex:commenting, e, -) Q.: what is the relationship between communication, generation, and usage? This are inference rules 5 and 6 in the PROV-CONSTR document
  • 35. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Derivation amongst entities 3 6 A derivation is a transformation of an entity into another, an update of an entity resulting in a new one, or the construction of a new entity based on a pre-existing entity. entity(ex:draftV1) entity(ex:draftComments) wasDerivedFrom(ex:draftComments, ex:draftV1) Q.: what is the relationship between derivation, generation, and usage? :draftComments a prov:Entity ; prov:wasDerivedFrom :draftV1 . :draftV1 a prov:Entity .
  • 36. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Provenance and Big Data: what’s the connection? opportunities and challenges
  • 37. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Provenance {as,of} Big Data 1. BigProv: Provenance as big data • High volume provenance • What kind of analytics are interesting on big provenance? 2. Provenance of analytics processes • “Prediction provenance” • Train a model  provenance of the model as a record of the training process and data involved • Use the model to make predictions  provenance of the prediction 3. Provenance of a search • What is the provenance of a keyword search? • Why would it be interesting? What can we learn from it?
  • 38. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Recent research on Provenance as Big Data Chen, Peng; Plale, Beth A., "Big Data Provenance Analysis and Visualization," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.797,800, 4-7 May 2015 doi: 10.1109/CCGrid.2015.85 Chen, Peng; Plale, Beth A., "ProvErr: System Level Statistical Fault Diagnosis Using Dependency Model," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.525,534, 4-7 May 2015 doi: 10.1109/CCGrid.2015.86 Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs Peter Macko and Margo Seltzer, Harvard University, Procs. TAPP’11, 2011, Crete, Greece Provenance from Log Files: a BigData Problem, Devarshi Ghoshal and Beth Plale, Procs. BigProv workshop, EDBT, Genova, Italy, 2013 Adam Bates, Kevin Butler and Thomas Moyer. Take Only What You Need: Leveraging Mandatory Access Control Policy to Reduce Provenance Storage Costs. In Procs. TAPP’15 workshop, Edinburgh, 2015 http://workshops.inf.ed.ac.uk/tapp2015/TAPP15_II_3.pdf
  • 39. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier ProvGen • A Provenance Generator tool for experimenting with provenance at scale • Why generate synthetic provenance? • Synthetic PROV graphs can be a valuable complement to emerging natural provenance collections • … provided their structural properties reflect specific provenance patterns • control over their repetition and variability • varying scales • Useful for benchmarking emerging provenance management systems • Useful to test analytics algorithms that operate on large provenance collections trace size numberoftraces science datasets git2PROV mediaWiki History retweet history Firth, Hugo, and Paolo Missier. “ProvGen: Generating Synthetic PROV Graphs with Predictable Structure.” In Procs. IPAW 2014 (Provenance and Annotations). Koln, Germany: Springer, 2014. http://arxiv.org/pdf/1406.2495
  • 40. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier What does ProvGen do? • Accept a seed PROV graph • Grow the graph • Add nodes and relationships following the seed graph structure • … with constraints on how to grow document entity(e1, [type="Document", version="original"]) entity(e2, [type="Document"]) entity(e3, [type="Document"]) activity(a1, [type="create"]) activity(a2, [type="edit"]) activity(a3, [type="edit"]) agent(ag, [type="Person"]) used(a2, e1) used(a3, e2) wasGeneratedBy(e2, a2, [fct="save"]) wasGeneratedBy(e1, a1, [fct="publish"]) wasGeneratedBy(e3, a3, [fct="save"]) wasAssociatedWith(a3, ag, [role="contributor"]) wasAssociatedWith(a2, ag, [role="contributor"]) wasAssociatedWith(a1, ag, [role="creator"]) wasDerivedFrom(e2, e1) wasDerivedFrom(e3, e2) endDocument a1 type: create a2 e1 use type: edita3 e2 use type: edit gen type: Document version: original der type: Document e3 gen der type: Document plan type: prov:plan ag type: Personassoc assoc assoc
  • 41. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier ProvGen constraints an Entity must have relationship "WasDerivedFrom" exactly 2 times unless it has property("version"="original"); the Entity(e1) must not have relationship "WasDerivedFrom" with the Entity(e2) unless e1 has relationship "Used" with the Activity(a) and e2 has the relationship "WasGeneratedBy" with the Activity(a); an Entity must have relationship "WasGeneratedBy" exactly 1 times; an Entity must have property("version"="original") with probability 0.05; an Entity must have out degree at most 2; an Activity must have relationship "Used" at most 1 times; an Activity must have property("type"="create") with probability 0.01; an Activity must have relationship "WasAssociatedWith" exactly 1 times; an Activity must have relationship "Used" exactly 1 times unless it has property("type"="create"); an Activity must have relationship "WasGeneratedBy" exactly 1 times; an Agent must have relationship "WasAssociatedWith" with probability 0.1; an Agent must have relationship "WasAssociatedWith" between 1, 120 times with distribution gamma(1.3, 2.4);
  • 42. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Some test queries Generated graph loaded to Neo4J GDBMS Queries expressed using the Cypher graph query language Transitive closure over Derivation: Return all the derivation chains, along with their length MATCH (a)-[r:`WASDERIVEDFROM`*]->(b) RETURN a,b, length(r) MATCH (a)-[r:`WASDERIVEDFROM`*]->(b) WHERE length(r) > 10 RETURN a,b, length(r) ORDER BY length(r) desc limit 50 Return the top 50 length derivation chains MATCH (a)-[:`WASASSOCIATEDWITH`]->(b) RETURN a as Agent, b as Activity All agents and their associated activities All agents who created new documents MATCH (a{type:'create'})-[:`WASASSOCIATEDWITH`]->(b) RETURN a,b LIMIT 25 All agents who edited a document that was derived from an original MATCH (doc1{version:'original'}) <- [:WASDERIVEDFROM] - (doc2) -[:`WASGENERATEDBY`] -> act -[:WASASSOCIATEDWITH] -> agent RETURN agent LIMIT 25
  • 43. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Provenance of Big Data Provenance of analytics processes: “Prediction provenance” • Train a model  provenance of the model as a record of the training process and data involved • Use the model to make predictions  provenance of the prediction 21 July 2015 11:38 21 July 2015 11:38
  • 44. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Relations may be given identifiers 4 5 entity(ex:draftComments) entity(ex:draftV1) activity(ex:commenting) wasGeneratedBy(gen1; ex:draftComments, ex:commenting, -) used(use1; ex:commenting, ex:draftV1, -) gen1 denotes a generation event use1 denotes a usage event wasDerivedFrom(id; e2, e1, a, g2, u1, attrs) General derivation relation: Relation IDs make it possible to refer to relations in other relations
  • 45. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Rendering N-ary relations in PROV-O 4 6 RDF is for binary relations —- N-ary relations require reification entity(ex:draftComments) entity(ex:draftV1) activity(ex:commenting) wasGeneratedBy(gen1; ex:draftComments, ex:commenting, 2013-03-18T10:00:01) used(use1; ex:commenting, ex:draftV1, -) :draftComments a prov:Entity ; prov:qualifiedGeneration :gen1 . :gen1 a prov:Generation ; prov:activity :commenting; prov:atTime “2013-03-18T10:00:01+09:00". :commenting a prov:Activity ; prov:qualifiedUsage :use1 . :use1 a prov:Usage ; :note "found comments useful"; prov:atTime "2013-03-21T10:00:01+09:00"; prov:entity :draftV1.
  • 46. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier “Qualified relation” RDF pattern 4 7 :draftComments a prov:Entity ; prov:qualifiedGeneration :gen1 . :gen1 a prov:Generation ; prov:activity :commenting; prov:atTime “2013-03-18T10:00:01+09:00". :commenting a prov:Activity ; prov:qualifiedUsage :use1 . :use1 a prov:Usage ; :note "found comments useful"; prov:atTime "2013-03-21T10:00:01+09:00"; prov:entity :draftV1.
  • 47. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Plans — why was something done? 4 8 Most relation types have two arguments which are { Entity, Activity, Agent} Derivation is one exception: wasDerivedFrom(id; e2, e1, a, g2, u1, attrs) Two other notable exceptions: - Associations with a plan - Delegation with an activity scope wasAssociatedWith(id; a, ag, pl, attrs) A plan is an entity that represents a set of actions or steps intended by one or more agents to achieve some goal
  • 48. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Association with a plan 4 9 A plan plays a role in an association
  • 49. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Plans are typed entities 5 0 activity(ex:_aProgramExecution, [ex:execTime="22.5sec"]) agent(ex:_aJVM, [prov:type = 'JVM-6.0']) entity(ex:myCleverProgram, [prov:type='prov:Plan', ex:label='Program 1']) wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM, ex:myCleverProgram, [prov:role='defaultRuntime', ex:accessPath="webapp" ]) A plan is an entity having prov:type = “prov:plan”
  • 50. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Plan pattern as PROV-O 5 1 :_aProgramExecution a prov:Activity ; :execTime "22.5sec; prov:qualifiedAssociation [ a prov:Association ; :accessPath "webapp"; prov:agent :_aJVM ; prov:hadPlan :myCleverProgram ; prov:hadRole "defaultRuntime"] . :_aJVM a prov:Agent, “Java-6.0". :myCleverProgram a prov:Entity, prov:Plan. activity(ex:_aProgramExecution, [ex:execTime="22.5sec"]) agent(ex:_aJVM, [prov:type = 'JVM-6.0']) entity(ex:myCleverProgram, [prov:type='prov:Plan', ex:label='Program 1']) wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM, ex:myCleverProgram, [prov:role='defaultRuntime', ex:accessPath='webapp' ])
  • 51. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Plan pattern as PROV-O 5 2 :_aProgramExecution a prov:Activity ; :execTime "22.5sec; prov:qualifiedAssociation [ a prov:Association ; :accessPath "webapp"; prov:agent :_aJVM ; prov:hadPlan :myCleverProgram ; prov:hadRole "defaultRuntime"] . :_aJVM a prov:Agent, “Java-6.0". :myCleverProgram a prov:Entity, prov:Plan.
  • 53. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Real-world artifacts vs provenance entities 5 4 ref: http://www.w3.org/2001/sw/wiki/PROV-FAQ#Examples_of_Provenance “What do I know about the car I see in this Cambridge street today?” •It was bought by Joe in 2011 •Joe drove it to Boston on March 16th, 2013. The car has now got 10,000 miles on it •Joe drove it to Cambridge on March 18th, 2013. “Same” car, but different provenance at each stage of its evolution
  • 54. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Alternate-specialization pattern 5 5 Two alternate entities present aspects of the same thing. These aspects may be the same or different, and the alternate entities may or may not overlap in time. An entity that is a specialization of another shares all aspects of the latter, and additionally presents more specific aspects of the same thing as the latter. ...But, this is still that car! Semantic notes: 1. Specialization implies alternate: IF specializationOf(e1,e2) THEN alternateOf(e1,e2). 2. Alternate is symmetric: IF alternateOf(e1,e2) THEN alternateOf(e2,e1) 3. Specialization is transitive: IF specializationOf(e1,e2) and specializationOf(e2,e3) THEN specializationOf(e1,e3). differing in their location same owner, added location
  • 55. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Reserved attributes and types 5 6 A small set of reserved attributes, with some usage restrictions
  • 56. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Bundles, provenance of provenance 5 7 A bundle is a named set of provenance descriptions, and is itself an entity, so allowing provenance of provenance to be expressed. bundle pm:bundle1 entity(ex:draftComments) entity(ex:draftV1) activity(ex:commenting) wasGeneratedBy(ex:draftComments, ex:commenting,-) used(ex:commenting, ex:draftV1, -) endBundle ... entity(pm:bundle1, [ prov:type='prov:Bundle' ]) agent(ex:Bob) wasGeneratedBy(pm:bundle1, -, 2013-03-20T10:30:00) wasAttributedTo(pm:bundle1, ex:Bob)
  • 57. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Bundles in PROV-O 5 8 Bundle definition (an RDF named graph): ex:bundle1 { :draftComments a prov:Entity ; :status “blah"; prov:wasGeneratedBy :commenting . :commenting a prov:Activity ; prov:used :draftV1 . :draftV1 a prov:Entity . } Bundle usage: ex:bundle1 a prov:Entity, "prov:Bundle"; prov:qualifiedGeneration [ a prov:Generation ; prov:atTime “2013-03-20T10:30:00+09:00" ]; prov:wasAttributedTo :Bob .
  • 61. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Time, Events 6 2 wasStartedBy(id; a2, e, a1, t, attrs) wasEndedBy(id; a2, e, a1, t, attrs) Instead, the PROV data model is implicitly based on a notion of instantaneous events, that mark transitions in the world (*) (*) PROV-CONSTR http://www.w3.org/TR/prov-constraints/#events (non-normative) Events: - activity start, activity end, - entity generation , entity usage, entity invalidation - Provenance statements are combined by different systems - An application may not be able to align the times involved to a single global timeline Therefore, PROV minimizes assumptions about time
  • 62. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier From “scruffy” provenance to “valid” provenance 6 3 - Are all possible temporal partial ordering of events equally acceptable? - How can we specify the set of all valid orderings? More generally, how do we formally define what it means for a set of provenance statements to be valid? PROV defines a set of temporal constraints that ensure consistency of a provenance graph
  • 63. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Summary • Motivation for collecting provenance of data and information • In Science • In the Social Web • The W3C PROV Recommendation (2013) • PROV-DM: The PROV data model • PROV-O: the Provenance Ontology • (PROV-CONSTRAINTS) • Provenance as Big Data • High volume provenance • Storage, analytics, visualisation • Provenance of analytics • How can I explain my predictions? • The ProvGen tool
  • 64. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Selected bibliography Moreau, Luc, Paolo Missier, Khalid Belhajjame, Reza B’Far, James Cheney, Sam Coppens, Stephen Cresswell, et al. PROV-DM: The PROV Data Model. Edited by Luc Moreau and Paolo Missier, 2012. http://www.w3.org/TR/prov-dm/ Cheney, James, Paolo Missier, and Luc Moreau. Constraints of the Provenance Data Model, 2012. http://www.w3.org/TR/prov-constraints/ Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. “The Rationale of PROV.” Web Semantics: Science, Services and Agents on the World Wide Web (April 2015). doi:10.1016/j.websem.2015.04.001. http://www.sciencedirect.com/science/article/pii/S1570826815000177 Marinho, Anderson, Leonardo Murta, Cláudia Werner, Vanessa Braganholo, Sérgio Manuel Serra da Cruz, Eduardo Ogasawara, and Marta Mattoso. “ProvManager: a Provenance Management System for Scientific Workflows.” Concurrency and Computation: Practice and Experience 24, no. 13 (2012): 1513–1530. http://dx.doi.org/10.1002/cpe.1870. ProvGen: generating synthetic PROV graphs with predictable structure. Firth, H.; and Missier, P. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer http://arxiv.org/pdf/1406.2495 ProvAbs: model, policy, and tooling for abstracting PROV graphs. Missier, P.; Bryans, J.; Gamble, C.; Curcin, V.; and Danger, R. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer http://arxiv.org/pdf/1406.1998 De Oliveira, Daniel, Vítor Silva, and Marta Mattoso. “How Much Domain Data Should Be in Provenance Databases?” In 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15). Edinburgh, Scotland: USENIX Association, 2015. https://www.usenix.org/conference/tapp15/workshop- program/presentation/de-oliveira.

Editor's Notes

  1. We have seen some examples of the look and feel of e-SC. Now we briefly go over the architecture. SaaS – Science as a Service
  2. We have seen some examples of the look and feel of e-SC. Now we briefly go over the architecture. SaaS – Science as a Service
  3. For many of its articles, NowNews relies on the integration of multiple data sources. In order to ensure correct credit is given, NowNews wants to provide a central acknowledgments list that recognizes all the people and data sources that contribute to all the various articles and information that it publishes.
  4. We have seen some examples of the look and feel of e-SC. Now we briefly go over the architecture. SaaS – Science as a Service
  5. W3C Recommendation (REC) A W3C Recommendation is a specification or set of guidelines that, after extensive consensus-building, has received the endorsement of W3C Members and the Director. W3C recommends the wide deployment of its Recommendations. Note: W3C Recommendations are similar to the standards published by other organizations.
  6. remark on PROV-AQ: nothing to do with querying, but a query model can be associated to each of the encodings W3C Recommendation (REC) A W3C Recommendation is a specification or set of guidelines that, after extensive consensus-building, has received the endorsement of W3C Members and the Director. W3C recommends the wide deployment of its Recommendations. Note: W3C Recommendations are similar to the standards published by other organizations. Working Group Note A Working Group Note is published by a chartered Working Group to indicate that work has ended on a particular topic. A Working Group may publish a Working Group Note with or without its prior publication as a Working Draft.
  7. Alice, a senior editor, produces draft V1 of a document, after reading papers paper1 and paper2. v1 is for internal distribution only Later, Bob who is the main editor and works for Alice, commented on the draft, producing a new document, draft comments
  8. duality between elements (generation) and relations (wasGeneratedBy)
  9. baseline-noAgents.provn
  10. baseline-noAgents-unqual.n3
  11. baseline-noAgents.provn agents are software, organization, person -- non-normative distinguish between normative and non-normative parts of the PROV documents Examples of association between an activity and an agent are: creation of a web page under the guidance of a designer; various forms of participation in a panel discussion, including audience member, panelist, or panel chair; a public event, sponsored by a company, and hosted by a museum;
  12. baseline-noAgents-unqual.n3 agents are software, organization, person -- non-normative distinguish between normative and non-normative parts of the PROV documents Examples of association between an activity and an agent are: creation of a web page under the guidance of a designer; various forms of participation in a panel discussion, including audience member, panelist, or panel chair; a public event, sponsored by a company, and hosted by a museum;
  13. agents are software, organization, person -- non-normative distinguish between normative and non-normative parts of the PROV documents Examples of association between an activity and an agent are: creation of a web page under the guidance of a designer; various forms of participation in a panel discussion, including audience member, panelist, or panel chair; a public event, sponsored by a company, and hosted by a museum;
  14. mention that derivation is missing -- this requires more insight into relation IDs
  15. Most relations admit optional arguments (e.g. time) First-class arguments may be optional, too. For instance, generation with implicit activity Often only some combinations of arguments are legal
  16. A single (real world) artifact may correspond to several entities in a provenance model that includes descriptions of such artifact. The choice of mapping is determined by which characteristics of the artifact are relevant for (a specific) provenance description of it Whenever one of these attributes changes, a new entity is created ex.: the doc before and after editing. Some characteristics that are relevant for provenance have changed.
  17. These entities are however related These relationships can be expressed in PROV
  18. ... and I could have bundles that refer to other bundles...