SlideShare a Scribd company logo
1 of 39
Provenance for Reproducible Data Science
Andreas Schreiber
German Aerospace Center (DLR)
Cologne/Berlin, Germany
PyData Seattle 2017
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 1
Topics
• Introduction to provenance and PROV
• Modelling provenance for data processing
• Python APIs for provenance recording
• Provenance recording for Jupyter notebooks
• Storing provenance in graph databases
• Analysis of provenance information
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 2
Introduction
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 3
Study Industrial Mathematics
(dt. Technomathematik)
Soldier (SIGINT)
Scientist (Grid Computing)
Introduction
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 4
Co-Founder
Data Scientist
Patient
Simulation and Software Technology, Cologne/Berlin
Head of Intelligent and Distributed Systems department
Institute of Data Science, Jena
Head of Secure Software Engineering group
Reproducibility
Reproducibility in (data) science is based on
• Open Source Software
• Code Reviews
• Code Repositories
• Publications with code
• Container (Docker etc.)
• Workflows
• (Electronic) laboratory notebooks
• Open data formats
• Data management
• Metadata and Provenance
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 5
Metadata and Provenance
Each data file in data processing has two parts
• Data: Actual measured, simulated, or generated data results
• Metadata: Information that describes or relates to the data
Metadata can include a large variety of information
• Geographic information that can be used to limit a spatial search
• Quality information (“Data is bad for some reason,” “Granule is cloud obscured”)
• Instrument configuration information (“Instrument in spectral zoom mode,” “Spacecraft
maneuver in progress”)
• Extra information about the data files themselves: file size, checksum for data integrity
verification
• Provenance information (Where did I get this file? How did it come to exist?)
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 6
Provenance
Basics
• Provenance refers to the source of information and the process that led to its existence
• Provenance information is critical to users trying to understand where a particular data file
came from
Other and related terms
• Traceability
• Lineage
• Logging
• Monitoring
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 7
Provenance Information
Capture, archive, and distribute provenance information, for example
• The source of all externally supplied data files
• The source of the algorithms used to transform the data within the system
• Algorithm Design Documents
• A complete description of the processing environment
• A complete description of the processing framework
• A record of each job’s execution
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 8
Big Data Processing
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 9
High-Performance Computing
• Explicit Control
Distributed Computing
• Implicit Control (via Graphs)
MPI Processes Threads Hadoop Spark Dask
Scale Up Scale Out
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 10
Data Science Workflows
More Formal Definition of Provenance
Provenance is
information about entities, activities, and people
involved in
producing a piece of data or thing,
which can be used to form
assessments about its quality, reliability or trustworthiness.
PROV W3C Working Group
https://www.w3.org/TR/prov-overview
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 11
W3C Specification „PROV“
• PROV-O, the PROV ontology, an OWL2 ontology allowing the mapping of the PROV data
model to RDF
• PROV-DM, the PROV data model for provenance
• PROV-N, a notation for provenance aimed at human consumption
• PROV-CONSTRAINTS, a set of constraints applying to the PROV data model
• PROV-XML, an XML schema for the PROV data model
• PROV-AQ, mechanisms for accessing and querying provenance
• PROV-DICTIONARY introduces a specific type of collection, consisting of key-entity pairs
• PROV-DC provides a mapping between PROV-O and Dublin Core Terms
• PROV-SEM, a declarative specification in terms of first-order logic of the PROV data model
• PROV-LINKS introduces a mechanism to link across bundles
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 12
PROV Elements
Entities
• Physical, digital, conceptual, or other kinds of things
• For example, documents, web sites, graphics, or data sets
Activities
• Activities generate new entities or
make use of existing entities
• Activities could be actions or processes
Agents
• Agents takes a role in an activity and have
the responsibility for the activity
• For example, persons, pieces of software,
or organizations
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 13
Activity
Entity
Agent
PROV Relations
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 14
Activity
Entity
Agent
wasGeneratedBy
used
wasDerivedFrom
wasAttributedTo
wasAssociatedWith
Baking a Cake
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 15
100 g
butter
bake
2
eggs
100 g
sugar
100 g
flour
cake
used
used
used
used
wasGeneratedBy
wasDerivedFrom
Textual Representations Visualizations
PROV Notations and Representations
• Formats: PROV-N, JSON, Turtle, XML, …
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 16
document
prefix userdata http://software.dlr.de/qs/userdata/
. . .
wasDerivedFrom(userdata:weights, userdata:WeightReport.csv,
wasDerivedFrom(qs:graphic/weights, userdata:weights,
wasAssociatedWith(qs:graphic/weights, qs:user/onyame@gmail.com, -)
used(python_method:read_csv, library:pandas, -)
used(python_method:matplotlib_plot, userdata:weights, -)
used(python_method:matplotlib_plot, library:matplotlib, -)
used(python_method:read_csv, userdata:WeightReport.csv, -)
wasAttributedTo(userdata:WeightReport.csv, qs:user/onyame@gmail.com)
agent(qs:user/onyame@gmail.com, [prov:type="prov:Person"])
entity(library:pandas, [library:version="0.17.1"])
entity(userdata:WeightReport.csv)
entity(userdata:weights)
. . .
endDocument
Provenance Architecture
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 17
Provenance
Store
Recording of Data Processing
Information
Application
Data (Results)
Storing and Retrieving Provenance
Some Storage Technologies
• Relational databases and SQL
• XML and Xpath
• RDF and SPARQL
• Graph databases and Gremlin/Cypher
Services
• REST APIs
• PROVSTORE
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 18
ProvStore
University of Southampton
• RESTful web service
• storage and access of
provenance documents
• Public and private
documents
• Conversion to various
text formats
• Simple visualizations
• APIs
• Python
• jQuery
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 19
https://provenance.ecs.soton.ac.uk/store/
Graphs
Provenance is a Directed Acyclic Graph (DAG)
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 20
A
B
E
F
G
D
C
Graph Databases
Naturally, graph databases are a good
technology for storing (Provenance) graphs
Many graph databases are available
• Neo4J
• Titan
• ArangoDB
• ...
Query languages
• Cypher
• Gremlin (TinkerPop)
• GraphQL
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 21
Neo4j
• Open-Source
• Implemented in Java
• Stores property graphs
(key-value-based, directed)
http://neo4j.com
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 22
Storing Provenance in Graph Database
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 23
Graph database Neo4j
MATCH (e:Entity)-[*]-(u:Agent) RETURN u
Trusted Provenance: Storing Provenance in a Blockchain
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 24
PROV2BIGCHAINDB
https://github.com/DLR-SC/prov2bigchaindb
Gather or Generate Provenance
Depends on your application (tools, languages, etc.)
• Generation at run-time, compile-time, or retrospectively
Runtime
• Instrumentation of the application
• Cumbersome from software engineering perspective
• Combined with logging or with aspect-oriented approaches
Compile time
• Based on static code analysis (dependency analysis, program slicing, etc.)
Retrospectively
• Reconstructed from files or filesystem metadata
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 25
Tools and Libraries for Python
Libraries for Python
• PROVPY
• PROVNEO4J
• PROV-DB-CONNECTOR
Other Tools
• NOWORKFLOW
• GIT2PROV
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 26
Python Library ProvPy (PROV)
https://github.com/trungdong/prov
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 27
from prov.model import ProvDocument
# Create a new provenance document
d1 = ProvDocument()
# Entity: now:employment-article-v1.html
e1 = d1.entity('now:employment-article-v1.html')
# Agent: nowpeople:Bob
d1.agent('nowpeople:Bob')
# Attributing the article to the agent
d1.wasAttributedTo(e1, 'nowpeople:Bob')
d1.entity('govftp:oesm11st.zip',
{'prov:label': 'employment-stats-2011',
'prov:type': 'void:Dataset'})
d1.wasDerivedFrom('now:employment-article-v1.html',
'govftp:oesm11st.zip')
# Adding an activity
d1.activity('is:writeArticle')
d1.used('is:writeArticle', 'govftp:oesm11st.zip')
d1.wasGeneratedBy('now:employment-article-v1.html', 'is:writeArticle')
Python Library ProvPy (PROV)
https://github.com/trungdong/prov
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 28
PROVNEO4J – Storing PROV Documents in Neo4j
https://github.com/DLR-SC/provneo4j
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 29
import provneo4j.api
provneo4j_api = provneo4j.api.Api(
base_url="http://localhost:7474/db/data",
username="neo4j", password="python")
provneo4j_api.document.create(prov_doc, name=”MyProv”)
PROVNEO4J – Storing PROV Documents in Neo4j
https://github.com/DLR-SC/provneo4j
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 30
PROV-DB-CONNECTOR
https://github.com/DLR-SC/prov-db-connector
Successor of PROVNEO4J
• Connectors for Neo4j
implemented
• ArangoDB planned
• APIs in REST
implemented
• ZeroMQ & MQTT planned
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 31
from prov.model import ProvDocument
from provdbconnector import ProvDb
from provdbconnector.db_adapters.in_memory import SimpleInMemoryAdapter
prov_api = ProvDb(adapter=SimpleInMemoryAdapter, auth_info=None)
# create the prov document
prov_document = ProvDocument()
prov_document.add_namespace("ex", "http://example.com")
prov_document.agent("ex:Bob")
prov_document.activity("ex:Alice")
prov_document.association("ex:Alice", "ex:Bob")
document_id = prov_api.save_document(prov_document)
print(prov_api.get_document_as_provn(document_id))
Provenance Instrumentation of TensorFlow
Provenance of TensorFlow workflows
• Tensor  PROV Entity
• Operations  PROV Activity
Example: MNIST with 400 training iterations
• 64581 database nodes
• 33549 Entities
• 31032 Activities
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 32
Example Query
• Shortest paths from all tensors
in 400. iteration to init operation
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 33
MATCH path=allShortestPaths((root)<-[*]-(n))
WHERE root.`tf:type`="tf:Session_init" and n.`tf:name` =~ ".*_400"
RETURN path
NOWORKFLOW – Provenance of Scripts
https://github.com/gems-uff/noworkflow
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 34
Project
experiment.py
p12.dat
p13.dat
precipitation.py
p14.dat
out.png
$ now run -e Tracker experiment.py
GIT2PROV
http://git2prov.org
• Generate PROV documents from
git repositories
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 35
GIT2PROV Example Output
https://provenance.ecs.soton.ac.uk/store/documents/116377/
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 36
Provenance Visualization
Visualization of Provenance is an ongoing research topic
• Especially, for non-experts (“Provenance for people”)
• Example: PROV COMICS
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 37
Key Messages and Summary
Recording the Provenance of Data Science workflows is important
• to understand where data came from
• to reproduce data processing steps or whole workflows
Use a standard for Provenance
• W3C standard PROV
• Mapping to (graph) databases, allows easy querying
• A standard allow interoperability and comparison
Recording Provenance is not hard
• APIs for Python
• Tools
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 38
Activity
Entity
Agent
wasGeneratedBy
used
wasDerivedFrom
wasAttributedTo
wasAssociatedWith
> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 39
Thank You!
Questions?
Andreas.Schreiber@dlr.de
www.DLR.de/sc | @onyame

More Related Content

What's hot

A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
Wes McKinney
 
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Ian Gomez
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
Kevin Weil
 
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014
James Chittenden
 
Presentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecPresentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresec
Tiago Henriques
 

What's hot (20)

Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWS
 
Integration of data ninja services with oracle spatial and graph
Integration of data ninja services with oracle spatial and graphIntegration of data ninja services with oracle spatial and graph
Integration of data ninja services with oracle spatial and graph
 
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
 
Python at Warp Speed
Python at Warp SpeedPython at Warp Speed
Python at Warp Speed
 
Graph Analysis over JSON, Larus
Graph Analysis over JSON, LarusGraph Analysis over JSON, Larus
Graph Analysis over JSON, Larus
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)
 
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014
 
MATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAPMATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAP
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance Initiative
 
Observe Changes of Taiwan Big Data Communities with Small Data
Observe Changes of Taiwan Big Data Communities with Small DataObserve Changes of Taiwan Big Data Communities with Small Data
Observe Changes of Taiwan Big Data Communities with Small Data
 
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
 
ISAX
ISAXISAX
ISAX
 
Stacked Ensembles in H2O
Stacked Ensembles in H2OStacked Ensembles in H2O
Stacked Ensembles in H2O
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
 
Presentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecPresentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresec
 

Similar to Provenance for Reproducible Data Science

Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Data Preparation vs. Inline Data Wrangling in Data Science and Machine LearningData Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Kai Wähner
 
Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16
Andy Lathrop
 

Similar to Provenance for Reproducible Data Science (20)

Provenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructureProvenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructure
 
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
Graph databases and the #panamapapers
Graph databases and the #panamapapersGraph databases and the #panamapapers
Graph databases and the #panamapapers
 
From Developer to Data Scientist
From Developer to Data ScientistFrom Developer to Data Scientist
From Developer to Data Scientist
 
Neo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to Graphs
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
 
2017-01-08-scaling tribalknowledge
2017-01-08-scaling tribalknowledge2017-01-08-scaling tribalknowledge
2017-01-08-scaling tribalknowledge
 
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Data Preparation vs. Inline Data Wrangling in Data Science and Machine LearningData Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
 
An R primer for SQL folks
An R primer for SQL folksAn R primer for SQL folks
An R primer for SQL folks
 
Talavant Data Lake Analytics
Talavant Data Lake Analytics Talavant Data Lake Analytics
Talavant Data Lake Analytics
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data Implementation
 
Intro to Neo4j Webinar
Intro to Neo4j WebinarIntro to Neo4j Webinar
Intro to Neo4j Webinar
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Reproducible Science with Python
Reproducible Science with PythonReproducible Science with Python
Reproducible Science with Python
 
Oracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingOracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream Processing
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
 
Resume (kaushik shakkari)
Resume (kaushik shakkari)Resume (kaushik shakkari)
Resume (kaushik shakkari)
 
Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time Analytics
 

More from Andreas Schreiber

More from Andreas Schreiber (20)

Provenance-based Security Audits and its Application to COVID-19 Contact Trac...
Provenance-based Security Audits and its Application to COVID-19 Contact Trac...Provenance-based Security Audits and its Application to COVID-19 Contact Trac...
Provenance-based Security Audits and its Application to COVID-19 Contact Trac...
 
Visualization of Software Architectures in Virtual Reality and Augmented Reality
Visualization of Software Architectures in Virtual Reality and Augmented RealityVisualization of Software Architectures in Virtual Reality and Augmented Reality
Visualization of Software Architectures in Virtual Reality and Augmented Reality
 
Raising Awareness about Open Source Licensing at the German Aerospace Center
Raising Awareness about Open Source Licensing at the German Aerospace CenterRaising Awareness about Open Source Licensing at the German Aerospace Center
Raising Awareness about Open Source Licensing at the German Aerospace Center
 
Open Source Licensing for Rocket Scientists
Open Source Licensing for Rocket ScientistsOpen Source Licensing for Rocket Scientists
Open Source Licensing for Rocket Scientists
 
Interactive Visualization of Software Components with Virtual Reality Headsets
Interactive Visualization of Software Components with Virtual Reality HeadsetsInteractive Visualization of Software Components with Virtual Reality Headsets
Interactive Visualization of Software Components with Virtual Reality Headsets
 
Visualizing Provenance using Comics
Visualizing Provenance using ComicsVisualizing Provenance using Comics
Visualizing Provenance using Comics
 
Quantified Self Comics
Quantified Self ComicsQuantified Self Comics
Quantified Self Comics
 
Nachvollziehbarkeit mit Hinblick auf Privacy-Verletzungen
Nachvollziehbarkeit mit Hinblick auf Privacy-VerletzungenNachvollziehbarkeit mit Hinblick auf Privacy-Verletzungen
Nachvollziehbarkeit mit Hinblick auf Privacy-Verletzungen
 
A Provenance Model for Quantified Self Data
A Provenance Model for Quantified Self DataA Provenance Model for Quantified Self Data
A Provenance Model for Quantified Self Data
 
Open Source im DLR
Open Source im DLROpen Source im DLR
Open Source im DLR
 
Tracking after Stroke: Doctors, Dogs and All The Rest
Tracking after Stroke: Doctors, Dogs and All The RestTracking after Stroke: Doctors, Dogs and All The Rest
Tracking after Stroke: Doctors, Dogs and All The Rest
 
High Throughput Processing of Space Debris Data
High Throughput Processing of Space Debris DataHigh Throughput Processing of Space Debris Data
High Throughput Processing of Space Debris Data
 
Bericht von der QS15 Conference & Exposition
Bericht von der QS15 Conference & ExpositionBericht von der QS15 Conference & Exposition
Bericht von der QS15 Conference & Exposition
 
Telemedizin: Gesundheit, messbar für jedermann
Telemedizin: Gesundheit, messbar für jedermannTelemedizin: Gesundheit, messbar für jedermann
Telemedizin: Gesundheit, messbar für jedermann
 
Big Python
Big PythonBig Python
Big Python
 
Quantified Self mit Wearable Devices und Smartphone-Sensoren
Quantified Self mit Wearable Devices und Smartphone-SensorenQuantified Self mit Wearable Devices und Smartphone-Sensoren
Quantified Self mit Wearable Devices und Smartphone-Sensoren
 
Example Blood Pressure Report of BloodPressureCompanion
Example Blood Pressure Report of BloodPressureCompanionExample Blood Pressure Report of BloodPressureCompanion
Example Blood Pressure Report of BloodPressureCompanion
 
Beispiel-Blutdruckbericht des BlutdruckBegleiter
Beispiel-Blutdruckbericht des BlutdruckBegleiterBeispiel-Blutdruckbericht des BlutdruckBegleiter
Beispiel-Blutdruckbericht des BlutdruckBegleiter
 
Informatik für die Welt von Morgen
Informatik für die Welt von MorgenInformatik für die Welt von Morgen
Informatik für die Welt von Morgen
 
Python for High Performance and Scientific Computing
Python for High Performance and Scientific ComputingPython for High Performance and Scientific Computing
Python for High Performance and Scientific Computing
 

Recently uploaded

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 

Recently uploaded (20)

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 

Provenance for Reproducible Data Science

  • 1. Provenance for Reproducible Data Science Andreas Schreiber German Aerospace Center (DLR) Cologne/Berlin, Germany PyData Seattle 2017 > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 1
  • 2. Topics • Introduction to provenance and PROV • Modelling provenance for data processing • Python APIs for provenance recording • Provenance recording for Jupyter notebooks • Storing provenance in graph databases • Analysis of provenance information > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 2
  • 3. Introduction > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 3 Study Industrial Mathematics (dt. Technomathematik) Soldier (SIGINT) Scientist (Grid Computing)
  • 4. Introduction > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 4 Co-Founder Data Scientist Patient Simulation and Software Technology, Cologne/Berlin Head of Intelligent and Distributed Systems department Institute of Data Science, Jena Head of Secure Software Engineering group
  • 5. Reproducibility Reproducibility in (data) science is based on • Open Source Software • Code Reviews • Code Repositories • Publications with code • Container (Docker etc.) • Workflows • (Electronic) laboratory notebooks • Open data formats • Data management • Metadata and Provenance > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 5
  • 6. Metadata and Provenance Each data file in data processing has two parts • Data: Actual measured, simulated, or generated data results • Metadata: Information that describes or relates to the data Metadata can include a large variety of information • Geographic information that can be used to limit a spatial search • Quality information (“Data is bad for some reason,” “Granule is cloud obscured”) • Instrument configuration information (“Instrument in spectral zoom mode,” “Spacecraft maneuver in progress”) • Extra information about the data files themselves: file size, checksum for data integrity verification • Provenance information (Where did I get this file? How did it come to exist?) > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 6
  • 7. Provenance Basics • Provenance refers to the source of information and the process that led to its existence • Provenance information is critical to users trying to understand where a particular data file came from Other and related terms • Traceability • Lineage • Logging • Monitoring > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 7
  • 8. Provenance Information Capture, archive, and distribute provenance information, for example • The source of all externally supplied data files • The source of the algorithms used to transform the data within the system • Algorithm Design Documents • A complete description of the processing environment • A complete description of the processing framework • A record of each job’s execution > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 8
  • 9. Big Data Processing > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 9 High-Performance Computing • Explicit Control Distributed Computing • Implicit Control (via Graphs) MPI Processes Threads Hadoop Spark Dask Scale Up Scale Out
  • 10. > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 10 Data Science Workflows
  • 11. More Formal Definition of Provenance Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. PROV W3C Working Group https://www.w3.org/TR/prov-overview > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 11
  • 12. W3C Specification „PROV“ • PROV-O, the PROV ontology, an OWL2 ontology allowing the mapping of the PROV data model to RDF • PROV-DM, the PROV data model for provenance • PROV-N, a notation for provenance aimed at human consumption • PROV-CONSTRAINTS, a set of constraints applying to the PROV data model • PROV-XML, an XML schema for the PROV data model • PROV-AQ, mechanisms for accessing and querying provenance • PROV-DICTIONARY introduces a specific type of collection, consisting of key-entity pairs • PROV-DC provides a mapping between PROV-O and Dublin Core Terms • PROV-SEM, a declarative specification in terms of first-order logic of the PROV data model • PROV-LINKS introduces a mechanism to link across bundles > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 12
  • 13. PROV Elements Entities • Physical, digital, conceptual, or other kinds of things • For example, documents, web sites, graphics, or data sets Activities • Activities generate new entities or make use of existing entities • Activities could be actions or processes Agents • Agents takes a role in an activity and have the responsibility for the activity • For example, persons, pieces of software, or organizations > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 13 Activity Entity Agent
  • 14. PROV Relations > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 14 Activity Entity Agent wasGeneratedBy used wasDerivedFrom wasAttributedTo wasAssociatedWith
  • 15. Baking a Cake > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 15 100 g butter bake 2 eggs 100 g sugar 100 g flour cake used used used used wasGeneratedBy wasDerivedFrom
  • 16. Textual Representations Visualizations PROV Notations and Representations • Formats: PROV-N, JSON, Turtle, XML, … > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 16 document prefix userdata http://software.dlr.de/qs/userdata/ . . . wasDerivedFrom(userdata:weights, userdata:WeightReport.csv, wasDerivedFrom(qs:graphic/weights, userdata:weights, wasAssociatedWith(qs:graphic/weights, qs:user/onyame@gmail.com, -) used(python_method:read_csv, library:pandas, -) used(python_method:matplotlib_plot, userdata:weights, -) used(python_method:matplotlib_plot, library:matplotlib, -) used(python_method:read_csv, userdata:WeightReport.csv, -) wasAttributedTo(userdata:WeightReport.csv, qs:user/onyame@gmail.com) agent(qs:user/onyame@gmail.com, [prov:type="prov:Person"]) entity(library:pandas, [library:version="0.17.1"]) entity(userdata:WeightReport.csv) entity(userdata:weights) . . . endDocument
  • 17. Provenance Architecture > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 17 Provenance Store Recording of Data Processing Information Application Data (Results)
  • 18. Storing and Retrieving Provenance Some Storage Technologies • Relational databases and SQL • XML and Xpath • RDF and SPARQL • Graph databases and Gremlin/Cypher Services • REST APIs • PROVSTORE > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 18
  • 19. ProvStore University of Southampton • RESTful web service • storage and access of provenance documents • Public and private documents • Conversion to various text formats • Simple visualizations • APIs • Python • jQuery > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 19 https://provenance.ecs.soton.ac.uk/store/
  • 20. Graphs Provenance is a Directed Acyclic Graph (DAG) > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 20 A B E F G D C
  • 21. Graph Databases Naturally, graph databases are a good technology for storing (Provenance) graphs Many graph databases are available • Neo4J • Titan • ArangoDB • ... Query languages • Cypher • Gremlin (TinkerPop) • GraphQL > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 21
  • 22. Neo4j • Open-Source • Implemented in Java • Stores property graphs (key-value-based, directed) http://neo4j.com > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 22
  • 23. Storing Provenance in Graph Database > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 23 Graph database Neo4j MATCH (e:Entity)-[*]-(u:Agent) RETURN u
  • 24. Trusted Provenance: Storing Provenance in a Blockchain > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 24 PROV2BIGCHAINDB https://github.com/DLR-SC/prov2bigchaindb
  • 25. Gather or Generate Provenance Depends on your application (tools, languages, etc.) • Generation at run-time, compile-time, or retrospectively Runtime • Instrumentation of the application • Cumbersome from software engineering perspective • Combined with logging or with aspect-oriented approaches Compile time • Based on static code analysis (dependency analysis, program slicing, etc.) Retrospectively • Reconstructed from files or filesystem metadata > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 25
  • 26. Tools and Libraries for Python Libraries for Python • PROVPY • PROVNEO4J • PROV-DB-CONNECTOR Other Tools • NOWORKFLOW • GIT2PROV > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 26
  • 27. Python Library ProvPy (PROV) https://github.com/trungdong/prov > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 27 from prov.model import ProvDocument # Create a new provenance document d1 = ProvDocument() # Entity: now:employment-article-v1.html e1 = d1.entity('now:employment-article-v1.html') # Agent: nowpeople:Bob d1.agent('nowpeople:Bob') # Attributing the article to the agent d1.wasAttributedTo(e1, 'nowpeople:Bob') d1.entity('govftp:oesm11st.zip', {'prov:label': 'employment-stats-2011', 'prov:type': 'void:Dataset'}) d1.wasDerivedFrom('now:employment-article-v1.html', 'govftp:oesm11st.zip') # Adding an activity d1.activity('is:writeArticle') d1.used('is:writeArticle', 'govftp:oesm11st.zip') d1.wasGeneratedBy('now:employment-article-v1.html', 'is:writeArticle')
  • 28. Python Library ProvPy (PROV) https://github.com/trungdong/prov > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 28
  • 29. PROVNEO4J – Storing PROV Documents in Neo4j https://github.com/DLR-SC/provneo4j > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 29 import provneo4j.api provneo4j_api = provneo4j.api.Api( base_url="http://localhost:7474/db/data", username="neo4j", password="python") provneo4j_api.document.create(prov_doc, name=”MyProv”)
  • 30. PROVNEO4J – Storing PROV Documents in Neo4j https://github.com/DLR-SC/provneo4j > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 30
  • 31. PROV-DB-CONNECTOR https://github.com/DLR-SC/prov-db-connector Successor of PROVNEO4J • Connectors for Neo4j implemented • ArangoDB planned • APIs in REST implemented • ZeroMQ & MQTT planned > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 31 from prov.model import ProvDocument from provdbconnector import ProvDb from provdbconnector.db_adapters.in_memory import SimpleInMemoryAdapter prov_api = ProvDb(adapter=SimpleInMemoryAdapter, auth_info=None) # create the prov document prov_document = ProvDocument() prov_document.add_namespace("ex", "http://example.com") prov_document.agent("ex:Bob") prov_document.activity("ex:Alice") prov_document.association("ex:Alice", "ex:Bob") document_id = prov_api.save_document(prov_document) print(prov_api.get_document_as_provn(document_id))
  • 32. Provenance Instrumentation of TensorFlow Provenance of TensorFlow workflows • Tensor  PROV Entity • Operations  PROV Activity Example: MNIST with 400 training iterations • 64581 database nodes • 33549 Entities • 31032 Activities > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 32
  • 33. Example Query • Shortest paths from all tensors in 400. iteration to init operation > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 33 MATCH path=allShortestPaths((root)<-[*]-(n)) WHERE root.`tf:type`="tf:Session_init" and n.`tf:name` =~ ".*_400" RETURN path
  • 34. NOWORKFLOW – Provenance of Scripts https://github.com/gems-uff/noworkflow > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 34 Project experiment.py p12.dat p13.dat precipitation.py p14.dat out.png $ now run -e Tracker experiment.py
  • 35. GIT2PROV http://git2prov.org • Generate PROV documents from git repositories > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 35
  • 36. GIT2PROV Example Output https://provenance.ecs.soton.ac.uk/store/documents/116377/ > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 36
  • 37. Provenance Visualization Visualization of Provenance is an ongoing research topic • Especially, for non-experts (“Provenance for people”) • Example: PROV COMICS > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 37
  • 38. Key Messages and Summary Recording the Provenance of Data Science workflows is important • to understand where data came from • to reproduce data processing steps or whole workflows Use a standard for Provenance • W3C standard PROV • Mapping to (graph) databases, allows easy querying • A standard allow interoperability and comparison Recording Provenance is not hard • APIs for Python • Tools > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 38 Activity Entity Agent wasGeneratedBy used wasDerivedFrom wasAttributedTo wasAssociatedWith
  • 39. > PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 39 Thank You! Questions? Andreas.Schreiber@dlr.de www.DLR.de/sc | @onyame