Lecture on Provenance modelling, given at the first Keystone Summer School, Malta July 2015.
With thanks to Prof. Luc Moreau for contributing some of the slide material from his own tutorial
Designing IA for AI - Information Architecture Conference 2024
Keystone summer school 2015 paolo-missier-provenance
1. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Provenance and the W3C PROV model
(in the Big Data context)§
Paolo Missier
School of Computing Science
Newcastle University, UK
Tutorial
First Keystone Summer School,
Malta, July 2015
Some of the slides courtesy of Luc Moreau – thanks!
3. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Lecture goals and outline
• What is provenance, and why does it matter?
• Definitions and case studies
• The W3C PROV standard in a nutshell
• PROV-O: the Provenance Ontology and examples of its usage
• Provenance and Big Data: what’s the connection?
• Opportunities and challenges
• Provenance tools [from Southampton]
5. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
1- Reproducibility and dissemination in Science
Independent validation of scientific claims is a cornerstone of
experimental science
• Scientific claims are supported by experiments
• How do express my “material and methods” so that you can
independently verify my results?
• How do I document my results to promote their understanding /
reuse
Provenance is the equivalent of a logbook
• Capture all steps involved in the derivation of a
result
• Replay, validate the execution, compare it with
others
To what extent these can be formalised and automated in data-
intensive science?
6. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
2- Explaining the outcome of a complex decision process
• Which process was used
to derive a diagnosis?
• How did the process use
the input data?
• How were the steps
configured?
• Which decisions were
made by human experts
(clinicians)?
MAF threshold
- Non-synonymous
- stop/gain
- frameshift
known polymorphisms
Homo / Heterozygous
Pathogenicity
predictors
Variant filtering
HPO match
HPO to OMIM
OMIM match
OMIM to Gene
Gene
Union
Gene
Intersect
Genes in scope
User-supplied
genes list
User-supplied
disease keywords
User-defined
preferred genes
Variant Scoping
Candidate
variants
Select
variants
in scope
variants
in scope
ClinVar
lookupClinVar
Annotated
patient
variants
Variant Classification
RED:
found,
pathogenic
AMBER:
not found
GREEN:
found,
benign
OMIM
AMBER/
not found
AMBER/
uncertain
NGS
pipelineClinical diagnosis of genetic diseases
7. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
3- Understanding the results of a computation
• Why has my [very complicated algorithm] produced this particular
result?
• Why is my predictive analytics model suggesting that it will rain
tomorrow?
• Why is this record part of the result of my database query?
• Database provenance
• Why is this record included in the result of my keyword search?
8. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
4- Content reuse on the Social Web
Open Data, Data Journalism
• A consume-select-curate-share workflow, not only professional
• Ethos: to expose the data and methods used to produce news items
• But: Data wrangling can introduce errors
• Is the data I am using valid? What is its primary source? What are the
transformation steps?
NowNews publishes an article based on
the latest employment data published by
GovStat
PolicyOrg compiles a report including
NowNews article
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
9. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
What is provenance?
Oxford English Dictionary:
• the fact of coming from some particular source or quarter; origin, derivation
• the history or pedigree of a work of art, manuscript, rare book, etc.;
• a record of the passage of an item through its various owners
Magna Carta (‘the Great Charter’) was
agreed between King John and his barons
on 15 June 1215.
10. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
What is provenance?
Provenance refers to the sources of information, including entities
and processes, involving in producing or delivering an artifact (*)
Provenance is a description of how things came to be, and how
they came to be in the state they are in today (*)
Provenance is a record that describes the people, institutions,
entities, and activities, involved in producing, influencing, or
delivering a piece of data or a thing in the world
11. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Provenance on the Web
Tim Berners-Lee’s “Oh Yeah” button:
• A browser button by which the user can express their uncertainty about a
document being displayed “so how do I know I can trust this information?”.
• Upon activation of the button, the software then retrieves metadata about the
document, listing assumptions on which trust can be based.
http://users.ugent.be/~tdenies/OhYeah/
Easy Access to Provenance: an Essential Step Towards Trust on the Web, Procs METHOD 2013: The 2nd IEEE
International Workshop on Methods for Establishing Trust with Open Data Held in conjunction with COMPSAC,
the IEEE Signature Conference on Computers, Software & Applications - July 22-26, 2013 - Kyoto, Japan
http://dx.doi.org/10.1109/COMPSACW.2013.29
13. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Use cases on the Social Web
Open Data, Data Journalism
NowNews publishes an article based on the latest employment
data published by GovStat
PolicyOrg compiles a report including NowNews article
Bob: Journalist
Alice: Data Cruncher
Tom: Editor
Nick: Web Master
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
14. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Derivation - Timeliness
Derivation:
• Charts, graphs and visualizations are all based on multiple data sets
• Eg Bob’s article on employment that appeared in NowNews
• Which data was a figure based upon?
Is the report based on the most
up-to-date data?
Bob: Journalist
Alice: Data Cruncher
Tom: Editor
Nick: Web Master
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
15. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Derivation - Trusted sources
Derivation:
• Is this content derived from data coming from a reliable source?
• The chart within Bob’s article is based on GovStat data
• However that information is hidden:
• the chart was produced by a complex process performed by Alice
Policy rule:
“data supplied by the government
is reliable”
Bob: Journalist
Alice: Data Cruncher
Tom: Editor
Nick: Web Master
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
16. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Tracing the source of errors
Derivation, attribution:
• When did this error occur?
• Who was responsible for the chart?
Nick discovers an error in the
chart included in Bob’s article
prov:wasAttributedTonowpeople:
Bob
now:
employment-article-v1.html
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
17. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Ensuring policy compliance
Process inspection:
• Which process steps led to publication?
• Was editorial check part of it?
Policy rule:
“posts are to be checked by an
editor prior to publication”
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
18. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Ensuring credit and acknowledgement
NowNews relies on multiple
contributors
Bob: Journalist
Alice: Data Cruncher
Tom: Editor
Nick: Web Master
employment-article-v1.html
David
att
Bob
delAttribution and responsibility:
• How do we ensure that all relevant
contributors are acknowledged?
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
19. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Reproducibility
Documenting the data
generation process:
• How do we ensure that
the figures can be
reproduced using the
new versions of the
data?
NowNews must ensure that the
article figures reflect the most
recent data
Bob: Journalist
Alice: Data Cruncher
Tom: Editor
Nick: Web Master
data-crunching
data-source-A
use
data-source-B
use
Alice
assoc
version: 1.0 version: 2.0
employment-article-v1.html
gen
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
20. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
So, why does provenance matter?
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To enable process analysis for debugging, improvement, evolution
• To enable reproducibility of processes (eg in science, data journalism…)
See also:
ACM Journal of Data and Information Quality (JDIQ) - Special Issue on Provenance, Data
and Information Quality, Paolo Missier, Paolo Papotti, Eds. Volume 5 Issue 3, February 2015
DOI: 10.1145/2692312
http://dl.acm.org/citation.cfm?id=2700413
http://jdiq.acm.org/archive.cfm?id=2698232
21. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
The W3C Working Group on Provenance
W3C
Incubator group
on provenance
Chair: Yolanda Gil,
ISI, USC
W3C
working group
approved
Chairs:
Luc Moreau,
Paul Groth
2009-2010
Main output:
“Provenance XG Final Report”
http://www.w3.org/2005/Incubator/prov/XGR-prov/
- provides an overview of the various existing
approaches, vocabularies
- proposes the creation of a dedicated W3C Working
Group
April, 2011 April, 2013
Proposed
Recommendations
finalised
prov-dm: Data Model
prov-o: OWL ontology, RDF encoding
prov-n: prov notation
prov-constraints
...plus a number of non-prescriptive
Notes
http://www.w3.org/2011/prov/wiki/
22. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
PROV: scope and structure
23
source: http://www.w3.org/TR/prov-overview/
Recommendation
track
See also:
Moreau, Luc, and Paul Groth. “Provenance: An Introduction to PROV.” Synthesis Lectures
on the Semantic Web: Theory and Technology 3, no. 4 (September 15, 2013): 1–129.
doi:10.2200/S00528ED1V01Y201308WBE007.
23. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
PROV Core Elements (graph depiction)
2
4
An entity is a physical, digital, conceptual, or other kind of thing with some fixed
aspects; entities may be real or imaginary.
An activity is something that occurs over a period of time and acts upon or with entities; it
may include consuming, processing, transforming, ..., using, or generating entities.
An agent is something that bears some form of responsibility for an activity taking place,
for the existence of an entity, or for another agent's activity.
24. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Generation, Usage
2
5
Generation is the completion of production of a new entity by an activity. This entity did not
exist before generation and becomes available for usage after this generation.
Usage is the beginning of utilizing an entity by an activity. Before usage, the activity had
not begun to utilize this entity
PROV is based on a notion of instantaneous events, that mark transitions in the world
- generation, usage (and others)
Ordering constraints amongst events:
“generation of e must precede each of usages”
“a can only use / generate e after it has started and before it has ended”
27. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Same example — PROV-O notation (RDF/N3)
2
8
:draftComments a prov:Entity ;
:distr "internal"^^xsd:string ;
prov:wasGeneratedBy :commenting .
:commenting a prov:Activity ;
prov:used :draftV1 .
:draftV1 a prov:Entity ;
:distr "internal"^^xsd:string ;
:status "draft"^^xsd:string ;
:version "0.1"^^xsd:string ;
prov:wasGeneratedBy :drafting .
:drafting a prov:Activity ;
prov:used :paper1,
:paper2 .
:paper1 a prov:Entity,
"reference"^^xsd:string .
:paper2 a prov:Entity,
"reference"^^xsd:string .
28. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Association, Attribution, Delegation: who did what?
2
9
An activity association is an assignment of responsibility to an agent for an activity,
indicating that the agent had a role in the activity.
Attribution is the ascribing of an entity to an agent.
entity(ex:draftComments, [ ex:distr='internal' ])
activity(ex:commenting)
agent(ex:Bob, [prov:type = "mainEditor"] )
agent(ex:Alice, [prov:type = "srEditor"])
wasAssociatedWith(ex:commenting, Bob, -, [prov:role = "editor"])
actedOnBehalfOf(Bob, Alice)
wasAttributedTo(ex:draftComments, ex:Bob)
29. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Same example — PROV-O notation (RDF/N3)
3
0
:Alice a prov:Agent,
"ex:chiefEditor";
:firstName "Alice";
:lastName "Cooper".
:Bob a prov:Agent,
"ex:seniorEditor";
:firstName "Robert";
:lastName "Thompson"^;
prov:actedOnBehalfOf :Alice .
:draftComments prov:wasAttributedTo :Bob .
:drafting a prov:Activity ;
prov:wasAssociatedWith :Bob .
35. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Derivation amongst entities
3
6
A derivation is a transformation of an entity into another, an update of an entity
resulting in a new one, or the construction of a new entity based on a pre-existing
entity.
entity(ex:draftV1)
entity(ex:draftComments)
wasDerivedFrom(ex:draftComments, ex:draftV1)
Q.: what is the relationship between derivation, generation, and usage?
:draftComments a prov:Entity ;
prov:wasDerivedFrom :draftV1 .
:draftV1 a prov:Entity .
37. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Provenance {as,of} Big Data
1. BigProv: Provenance as big data
• High volume provenance
• What kind of analytics are interesting on big provenance?
2. Provenance of analytics processes
• “Prediction provenance”
• Train a model provenance of the model as a record of the training
process and data involved
• Use the model to make predictions provenance of the prediction
3. Provenance of a search
• What is the provenance of a keyword search?
• Why would it be interesting? What can we learn from it?
38. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Recent research on Provenance as Big Data
Chen, Peng; Plale, Beth A., "Big Data Provenance Analysis and Visualization," Cluster, Cloud and Grid
Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.797,800, 4-7
May 2015 doi: 10.1109/CCGrid.2015.85
Chen, Peng; Plale, Beth A., "ProvErr: System Level Statistical Fault Diagnosis Using Dependency
Model," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium
on , vol., no., pp.525,534, 4-7 May 2015
doi: 10.1109/CCGrid.2015.86
Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs
Peter Macko and Margo Seltzer, Harvard University, Procs. TAPP’11, 2011, Crete, Greece
Provenance from Log Files: a BigData Problem, Devarshi Ghoshal and
Beth Plale, Procs. BigProv workshop, EDBT, Genova, Italy, 2013
Adam Bates, Kevin Butler and Thomas Moyer. Take Only What You Need: Leveraging Mandatory
Access Control Policy to Reduce Provenance Storage Costs. In Procs. TAPP’15 workshop,
Edinburgh, 2015
http://workshops.inf.ed.ac.uk/tapp2015/TAPP15_II_3.pdf
39. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
ProvGen
• A Provenance Generator tool for experimenting with provenance at scale
• Why generate synthetic provenance?
• Synthetic PROV graphs can be a valuable complement to emerging natural
provenance collections
• … provided their structural properties reflect specific provenance patterns
• control over their repetition and variability
• varying scales
• Useful for benchmarking emerging provenance management systems
• Useful to test analytics algorithms that operate on large provenance collections
trace size
numberoftraces
science
datasets
git2PROV
mediaWiki
History
retweet
history
Firth, Hugo, and Paolo Missier. “ProvGen: Generating Synthetic PROV Graphs with Predictable Structure.” In
Procs. IPAW 2014 (Provenance and Annotations). Koln, Germany: Springer, 2014.
http://arxiv.org/pdf/1406.2495
40. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
What does ProvGen do?
• Accept a seed PROV graph
• Grow the graph
• Add nodes and relationships following the seed graph
structure
• … with constraints on how to grow
document
entity(e1, [type="Document",
version="original"])
entity(e2, [type="Document"])
entity(e3, [type="Document"])
activity(a1, [type="create"])
activity(a2, [type="edit"])
activity(a3, [type="edit"])
agent(ag, [type="Person"])
used(a2, e1)
used(a3, e2)
wasGeneratedBy(e2, a2, [fct="save"])
wasGeneratedBy(e1, a1, [fct="publish"])
wasGeneratedBy(e3, a3, [fct="save"])
wasAssociatedWith(a3, ag,
[role="contributor"])
wasAssociatedWith(a2, ag,
[role="contributor"])
wasAssociatedWith(a1, ag,
[role="creator"])
wasDerivedFrom(e2, e1)
wasDerivedFrom(e3, e2)
endDocument
a1
type: create
a2
e1
use
type: edita3
e2
use
type: edit
gen
type: Document
version: original
der
type: Document
e3
gen
der
type: Document
plan
type: prov:plan
ag
type: Personassoc
assoc
assoc
41. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
ProvGen constraints
an Entity must have relationship "WasDerivedFrom" exactly 2 times unless it has
property("version"="original");
the Entity(e1) must not have relationship "WasDerivedFrom" with the Entity(e2)
unless e1 has relationship "Used" with the Activity(a) and e2 has the
relationship "WasGeneratedBy" with the Activity(a);
an Entity must have relationship "WasGeneratedBy" exactly 1 times;
an Entity must have property("version"="original") with probability 0.05;
an Entity must have out degree at most 2;
an Activity must have relationship "Used" at most 1 times;
an Activity must have property("type"="create") with probability 0.01;
an Activity must have relationship "WasAssociatedWith" exactly 1 times;
an Activity must have relationship "Used" exactly 1 times unless it has
property("type"="create");
an Activity must have relationship "WasGeneratedBy" exactly 1 times;
an Agent must have relationship "WasAssociatedWith" with probability 0.1;
an Agent must have relationship "WasAssociatedWith" between 1, 120 times with
distribution gamma(1.3, 2.4);
42. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Some test queries
Generated graph loaded to Neo4J GDBMS
Queries expressed using the Cypher graph query language
Transitive closure over Derivation:
Return all the derivation chains, along with their length
MATCH (a)-[r:`WASDERIVEDFROM`*]->(b) RETURN a,b, length(r)
MATCH (a)-[r:`WASDERIVEDFROM`*]->(b)
WHERE length(r) > 10
RETURN a,b, length(r)
ORDER BY length(r) desc limit 50
Return the top 50 length derivation chains
MATCH (a)-[:`WASASSOCIATEDWITH`]->(b)
RETURN a as Agent, b as Activity
All agents and their associated activities
All agents who created new documents
MATCH (a{type:'create'})-[:`WASASSOCIATEDWITH`]->(b)
RETURN a,b LIMIT 25
All agents who edited a document that was derived from an original
MATCH (doc1{version:'original'}) <- [:WASDERIVEDFROM] - (doc2)
-[:`WASGENERATEDBY`] -> act -[:WASASSOCIATEDWITH] -> agent
RETURN agent LIMIT 25
43. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Provenance of Big Data
Provenance of analytics processes:
“Prediction provenance”
• Train a model provenance of the model as a record of the training
process and data involved
• Use the model to make predictions provenance of the prediction
21 July 2015 11:38
21 July 2015 11:38
44. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Relations may be given identifiers
4
5
entity(ex:draftComments)
entity(ex:draftV1)
activity(ex:commenting)
wasGeneratedBy(gen1; ex:draftComments, ex:commenting, -)
used(use1; ex:commenting, ex:draftV1, -)
gen1 denotes a generation event
use1 denotes a usage event
wasDerivedFrom(id; e2, e1, a, g2, u1, attrs)
General derivation relation:
Relation IDs make it possible to refer to relations in other relations
45. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Rendering N-ary relations in PROV-O
4
6
RDF is for binary relations —- N-ary relations require reification
entity(ex:draftComments)
entity(ex:draftV1)
activity(ex:commenting)
wasGeneratedBy(gen1; ex:draftComments,
ex:commenting,
2013-03-18T10:00:01)
used(use1; ex:commenting, ex:draftV1, -)
:draftComments a prov:Entity ;
prov:qualifiedGeneration :gen1 .
:gen1 a prov:Generation ;
prov:activity :commenting;
prov:atTime “2013-03-18T10:00:01+09:00".
:commenting a prov:Activity ;
prov:qualifiedUsage :use1 .
:use1 a prov:Usage ;
:note "found comments useful";
prov:atTime "2013-03-21T10:00:01+09:00";
prov:entity :draftV1.
47. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Plans — why was something done?
4
8
Most relation types have two arguments which are { Entity, Activity, Agent}
Derivation is one exception:
wasDerivedFrom(id; e2, e1, a, g2, u1, attrs)
Two other notable exceptions:
- Associations with a plan
- Delegation with an activity scope
wasAssociatedWith(id; a, ag, pl, attrs)
A plan is an entity that represents a set of actions or steps
intended by one or more agents to achieve some goal
53. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Real-world artifacts vs provenance entities
5
4
ref: http://www.w3.org/2001/sw/wiki/PROV-FAQ#Examples_of_Provenance
“What do I know about the car I see in this Cambridge street today?”
•It was bought by Joe in 2011
•Joe drove it to Boston on March 16th,
2013. The car has now got 10,000 miles
on it
•Joe drove it to Cambridge on March
18th, 2013.
“Same” car, but different provenance at
each stage of its evolution
54. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Alternate-specialization pattern
5
5
Two alternate entities present aspects of the same thing. These aspects may be the same or
different, and the alternate entities may or may not overlap in time.
An entity that is a specialization of another shares all aspects of the latter, and additionally
presents more specific aspects of the same thing as the latter.
...But, this is still that car!
Semantic notes:
1. Specialization implies alternate: IF specializationOf(e1,e2) THEN alternateOf(e1,e2).
2. Alternate is symmetric: IF alternateOf(e1,e2) THEN alternateOf(e2,e1)
3. Specialization is transitive: IF specializationOf(e1,e2) and specializationOf(e2,e3) THEN specializationOf(e1,e3).
differing in their
location
same owner,
added location
56. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Bundles, provenance of provenance
5
7
A bundle is a named set of provenance descriptions, and is itself an entity,
so allowing provenance of provenance to be expressed.
bundle pm:bundle1
entity(ex:draftComments)
entity(ex:draftV1)
activity(ex:commenting)
wasGeneratedBy(ex:draftComments, ex:commenting,-)
used(ex:commenting, ex:draftV1, -)
endBundle
...
entity(pm:bundle1, [ prov:type='prov:Bundle' ])
agent(ex:Bob)
wasGeneratedBy(pm:bundle1, -, 2013-03-20T10:30:00)
wasAttributedTo(pm:bundle1, ex:Bob)
57. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Bundles in PROV-O
5
8
Bundle definition (an RDF named graph):
ex:bundle1 {
:draftComments a prov:Entity ;
:status “blah";
prov:wasGeneratedBy :commenting .
:commenting a prov:Activity ;
prov:used :draftV1 .
:draftV1 a prov:Entity .
}
Bundle usage:
ex:bundle1 a prov:Entity, "prov:Bundle";
prov:qualifiedGeneration [ a prov:Generation ;
prov:atTime “2013-03-20T10:30:00+09:00" ];
prov:wasAttributedTo :Bob .
61. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Time, Events
6
2
wasStartedBy(id; a2, e, a1, t, attrs)
wasEndedBy(id; a2, e, a1, t, attrs)
Instead, the PROV data model is implicitly based on a notion of
instantaneous events, that mark transitions in the world (*)
(*) PROV-CONSTR http://www.w3.org/TR/prov-constraints/#events (non-normative)
Events:
- activity start, activity end,
- entity generation , entity usage, entity invalidation
- Provenance statements are combined by different systems
- An application may not be able to align the times involved to a single
global timeline
Therefore, PROV minimizes assumptions about time
62. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
From “scruffy” provenance to “valid” provenance
6
3
- Are all possible temporal partial ordering of events equally acceptable?
- How can we specify the set of all valid orderings?
More generally, how do we formally define what it means for a set of
provenance statements to be valid?
PROV defines a set of temporal constraints that ensure consistency
of a provenance graph
63. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Summary
• Motivation for collecting provenance of data and information
• In Science
• In the Social Web
• The W3C PROV Recommendation (2013)
• PROV-DM: The PROV data model
• PROV-O: the Provenance Ontology
• (PROV-CONSTRAINTS)
• Provenance as Big Data
• High volume provenance
• Storage, analytics, visualisation
• Provenance of analytics
• How can I explain my predictions?
• The ProvGen tool
64. FirstKeystoneSummerSchool–
MaltaJuly2015–P.Missier
Selected bibliography
Moreau, Luc, Paolo Missier, Khalid Belhajjame, Reza B’Far, James Cheney, Sam Coppens, Stephen Cresswell,
et al. PROV-DM: The PROV Data Model. Edited by Luc Moreau and Paolo Missier, 2012.
http://www.w3.org/TR/prov-dm/
Cheney, James, Paolo Missier, and Luc Moreau. Constraints of the Provenance Data Model, 2012.
http://www.w3.org/TR/prov-constraints/
Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. “The Rationale of PROV.” Web
Semantics: Science, Services and Agents on the World Wide Web (April 2015).
doi:10.1016/j.websem.2015.04.001.
http://www.sciencedirect.com/science/article/pii/S1570826815000177
Marinho, Anderson, Leonardo Murta, Cláudia Werner, Vanessa Braganholo, Sérgio Manuel Serra da Cruz,
Eduardo Ogasawara, and Marta Mattoso. “ProvManager: a Provenance Management System for Scientific
Workflows.” Concurrency and Computation: Practice and Experience 24, no. 13 (2012): 1513–1530.
http://dx.doi.org/10.1002/cpe.1870.
ProvGen: generating synthetic PROV graphs with predictable structure.
Firth, H.; and Missier, P. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer
http://arxiv.org/pdf/1406.2495
ProvAbs: model, policy, and tooling for abstracting PROV graphs.
Missier, P.; Bryans, J.; Gamble, C.; Curcin, V.; and Danger, R. In Procs. IPAW 2014 (Provenance and
Annotations), Koln, Germany, 2014. Springer http://arxiv.org/pdf/1406.1998
De Oliveira, Daniel, Vítor Silva, and Marta Mattoso. “How Much Domain Data Should Be in Provenance
Databases?” In 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15). Edinburgh,
Scotland: USENIX Association, 2015. https://www.usenix.org/conference/tapp15/workshop-
program/presentation/de-oliveira.
We have seen some examples of the look and feel of e-SC.
Now we briefly go over the architecture.
SaaS – Science as a Service
We have seen some examples of the look and feel of e-SC.
Now we briefly go over the architecture.
SaaS – Science as a Service
For many of its articles, NowNews relies on the integration of multiple data sources.
In order to ensure correct credit is given, NowNews wants to provide a central acknowledgments list that recognizes all the people and data sources that contribute to all the various articles and information that it publishes.
We have seen some examples of the look and feel of e-SC.
Now we briefly go over the architecture.
SaaS – Science as a Service
W3C Recommendation (REC)
A W3C Recommendation is a specification or set of guidelines that, after extensive consensus-building, has received the endorsement of W3C Members and the Director. W3C recommends the wide deployment of its Recommendations. Note: W3C Recommendations are similar to the standards published by other organizations.
remark on PROV-AQ: nothing to do with querying, but a query model can be associated to each of the encodings
W3C Recommendation (REC)
A W3C Recommendation is a specification or set of guidelines that, after extensive consensus-building, has received the endorsement of W3C Members and the Director. W3C recommends the wide deployment of its Recommendations. Note: W3C Recommendations are similar to the standards published by other organizations.
Working Group Note
A Working Group Note is published by a chartered Working Group to indicate that work has ended on a particular topic. A Working Group may publish a Working Group Note with or without its prior publication as a Working Draft.
Alice, a senior editor, produces draft V1 of a document, after reading papers paper1 and paper2. v1 is for internal distribution only
Later, Bob who is the main editor and works for Alice, commented on the draft, producing a new document, draft comments
duality between elements (generation) and relations (wasGeneratedBy)
baseline-noAgents.provn
baseline-noAgents-unqual.n3
baseline-noAgents.provn
agents are software, organization, person -- non-normative
distinguish between normative and non-normative parts of the PROV documents
Examples of association between an activity and an agent are:
creation of a web page under the guidance of a designer;
various forms of participation in a panel discussion, including audience member, panelist, or panel chair;
a public event, sponsored by a company, and hosted by a museum;
baseline-noAgents-unqual.n3
agents are software, organization, person -- non-normative
distinguish between normative and non-normative parts of the PROV documents
Examples of association between an activity and an agent are:
creation of a web page under the guidance of a designer;
various forms of participation in a panel discussion, including audience member, panelist, or panel chair;
a public event, sponsored by a company, and hosted by a museum;
agents are software, organization, person -- non-normative
distinguish between normative and non-normative parts of the PROV documents
Examples of association between an activity and an agent are:
creation of a web page under the guidance of a designer;
various forms of participation in a panel discussion, including audience member, panelist, or panel chair;
a public event, sponsored by a company, and hosted by a museum;
mention that derivation is missing -- this requires more insight into relation IDs
Most relations admit optional arguments (e.g. time)
First-class arguments may be optional, too. For instance, generation with implicit activity
Often only some combinations of arguments are legal
A single (real world) artifact may correspond to several entities in a provenance model that includes descriptions of such artifact.
The choice of mapping is determined by which characteristics of the artifact are relevant for (a specific) provenance description of it
Whenever one of these attributes changes, a new entity is created
ex.: the doc before and after editing. Some characteristics that are relevant for provenance have changed.
These entities are however related
These relationships can be expressed in PROV
... and I could have bundles that refer to other bundles...