Thinking about the need for deeper provenance for knowledge graphs but also using knowledge graphs to enrich provenance. Presented at https://seminariomirianandres.unirioja.es/sw19/
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Thoughts on Knowledge Graphs & Deeper Provenance
1. Faculty of Science
Paul Groth | @pgroth | pgroth.com
Oct 29, 2019
Data Provenance Staff Week - Universidad de La Rioja
Thoughts on Knowledge Graphs
& Deeper Provenance
2. Faculty of Science
OUTLINE
1. The problem
2. Knowledge Graphs
3. Transparency through data provenance
4. Assessment for human-machine teams
3. Faculty of Science
The making of data is important
“There is a major, largely unrealised potential to
merge and integrate the data from different
disciplines of science in order to reveal deep
patterns in the multi-facetted complexity that
underlies most of the domains of application that
are intrinsic to the major global challenges that
confront humanity.” – Grand Challenge for
Science
http://dataintegration.codata.org
Committee on Data of the
International Council for Science
(CODATA)
4. Faculty of Science
Software 2.0
https://link.medium.com/srrJhEl5bS
“In the 2.0 stack, the programming is done by
accumulating, massaging and cleaning datasets”
Figure 8
Data Science
Surveys 2017
& 2018
The making of data is hard
6. Faculty of Science
NOT JUST DATA SCIENCE
Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2019).
Searching Data: A Review of Observational Data Retrieval
Practices. Journal of the Association for Information Science and
Technology. doi:10.1002/asi.24165
Some observations from @gregory_km
survey & interviews :
• The needs and behaviors of specific user groups (e.g. early
career researchers, policy makers, students) are not well
documented.
• Participants require details about data collection and handling
• Reconstructing data tables from journal articles, using
general search engines, and making direct data requests are
common.
K Gregory, H Cousijn, P Groth, A Scharnhorst, S Wyatt (2018).
Understanding Data Retrieval Practices: A Social Informatics Perspective.
arXiv preprint arXiv:1801.04971
7. Faculty of Science
OUTLINE
1. The problem
2. Knowledge Graphs
3. Transparency through data provenance
4. Assessment for human-machine teams
8. Faculty of Science
Knowledge Graphs for Integration
A knowledge graph is "graph structured knowledge bases (KBs) which store factual
information in form of relationships between entities" (Nickel et al. 2015).
Nickel, M., Murphy, K., Tresp, V., & Gabrilovich, E. (2015). A Review
of Relational Machine Learning for Knowledge Graphs, 1–18.
10. Faculty of Science
Frank van Harmelen Adoption of Knowledge Graphs: https://www.slideshare.net/Frank.van.Harmelen/adoption-of-knowledge-graphs-mid-2019
12. Faculty of Science
LARGE SCALE PIPELINES
Content
Universal
schema
Surface form
relations
Structured
relations
Factorization
model
Matrix
Construction
Open
Information
Extraction
Entity
Resolution
Matrix
Factorization
Knowledge
graph
Curation
Predicted
relations
Matrix
Completion
Taxonomy
Triple
Extraction
Concept
Resolution
14M
SD articles
475 M
triples
3.3 million
relations
49 M
relations
~15k ->
1M
entries
Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel
“Applying Universal Schemas for Domain Specific Ontology Expansion”
5th Workshop on Automated Knowledge Base Construction (AKBC) 2016
Michael Lauruhn, and Paul Groth. "Sources of Change for Modern
Knowledge Organization Systems." Knowledge Organization 43, no. 8
(2016).
13. Faculty of Science
Bottlenecks
1.Manual
2.Difficulty in creating flexible reusable workflows
3.Lack of transparency
Paul Groth."The Knowledge-Remixing Bottleneck,"
Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-
Oct. 2013 doi: 10.1109/MIS.2013.138
Paul Groth, "Transparency and Reliability in the Data Supply
Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71,
March-April 2013 doi: 10.1109/MIC.2013.41
14. Faculty of Science
OUTLINE
1. The problem
2. Knowledge Graphs
3. Transparency through data provenance
4. Assessment for human-machine teams
16. Faculty of Science
PROVENANCE
• Where and how was this data or document produced?
• Data Provenance is “a record that describes the people,
institutions, entities, and activities involved in producing,
influencing, or delivering a piece of data” – W3C
Provenance Recommendation
• Central issues:
• Data workflows go beyond single systems
• How do you capture this information
effectively?
• What functionality can the provenance
support?
From: https://www.w3.org/TR/prov-primer/
17. Faculty of Science
DATA PROVENANCE INTEROPERABILITY
Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. "The
rationale of PROV." Web Semantics: Science, Services and Agents on the World Wide
Web 35 (2015): 235-257.
Luc Moreau and Paul Groth. "Provenance: an introduction to Prov." Synthesis Lectures
on the Semantic Web: Theory and Technology 3.4 (2013): 1-129.
Paul Groth, Yolanda Gil, James Cheney, and Simon Miles. "Requirements for
provenance on the web." International Journal of Digital Curation 7, no. 1 (2012): 39-56.
19. Faculty of Science
Select one of the activities in the PROV graph
Entities and Activities are sized according to information flow
Missing type information is automatically inferred
Embed the generated visualisation in your own webpage
20. Faculty of Science
What to capture?
Simon Miles, Paul Groth, Paul, Steve Munroe, Luc Moreau.
PrIMe: A methodology for developing provenance-aware
applications.
ACM Transactions on Software Engineering and Methodology, 20,
(3), 2011.
1
9
22. Faculty of Science
What if we missed something?
Disclosed provenance systems:
• Re-apply methodology (e.g. PriME), produce new application version.
• Time consuming.
Observed provenance systems:
• Update the applied instrumentation.
• Instrumentation becomes progressively more intense.
Provenance is Post-Hoc
23. Faculty of Science
Re-execution
Common tactic in disclosed provenance:
• DB: Reenactment queries (Glavic ‘14)
• DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13), DistTape (Zhao ‘12)
• Workflows: Pegasus (Groth ‘09)
• PL: Slicing (Perera ‘12)
• Desktop: Excel (Asuncion ‘11)
Can we extend this idea to observed provenance systems?
2
2
24. Faculty of Science
Faster Capture: Record & Replay
PROV 2R: Practical Provenance Analysis of Unstructured Processes
M Stamatogiannakis, E Athanasopoulos, H Bos, P Groth (2017)
ACM Transactions on Internet Technology (TOIT) 17 (4), 37
26. Prototype Implementation
• PANDA: an open-source
Platform for
Architecture-Neutral
Dynamic Analysis. (Dolan-
Gavitt ‘14)
• Based on the QEMU
virtualization platform.
25
27. • PANDA logs self-contained execution traces.
– An initial RAM snapshot.
– Non-deterministic inputs.
• Logging happens at virtual CPU I/O ports.
– Virtual device state is not logged can’t “go-live”.
Prototype Implementation (2/3)
PANDA
CPU RAM
Input
Interrupt
DMA
Initial RAM Snapshot
Non-
determinism
log
RAM
PANDA Execution Trace
26
28. Prototype Implementation (3/3)
• Analysis plugins
– Read-only access to the VM state.
– Invoked per instr., memory access, context switch, etc.
– Can be combined to implement complex functionality.
– OSI Linux, PROV-Tracer, ProcStrMatch, Taint tracking
• Debian Linux guest.
• Provenance stored PROV/RDF triples, queried with SPARQL.
PANDA
Execution
Trace
PANDA
Triple
Store
Plugin APlugin C
Plugin B
CPU
RAM
27
used
endedAtTime
wasAssociatedWith
actedOnBehalfOf
wasGeneratedBy
wasAttributedTo
wasDerivedFrom
wasInformedBy
Activity
Entity
Agent
xsd:dateTime
startedAtTime
xsd:dateTime
29. Faculty of Science
Enabling more detail
Application
• Observed provenance systems treat programs as
black-boxes
• Can’t tell if an input file was actually used
• Can’t quantify the influence of input to output
30. Faculty of Science
DATA TRACKER
• Captures high-fidelity provenance using Taint Tracking
• Key building blocks:
• libdft (Kemerlis ‘12) ➞ Reusable taint-tracking
framework
• Intel Pin (Luk ‘05) ➞ Dynamic instrumentation
framework
• Does not require modification of applications
• Does not require knowledge of application semantics
Stamatogiannakis, Manolis, Paul Groth, and Herbert Bos. "Looking inside the black-box:
capturing data provenance using dynamic instrumentation." In International Provenance and
Annotation Workshop (IPAW’14), pp. 155-167. 2014.
31. Faculty of Science
Systems provenance
• Adam Bates, Dave Tian, Kevin R.B. Butler, and Thomas
Moyer. Trustworthy Whole-System Provenance for the
Linux Kernel. USENIX Security Symposium (SECURITY),
August 2015.
32. Faculty of Science
• We can capture ever more provenance
• Still the question: what to capture?
• But is that enough?
Is this enough?
33. Faculty of Science
OUTLINE
1. The problem
2. Knowledge Graphs
3. Transparency through data provenance
4. Assessment for human-machine teams
34. Faculty of Science
Peer Review At Scale – People
2016:
• 1.5 million papers submitted
• Published 420,000 articles
• 2,500 journals
• 20,000 “level 1” editor
• 60,000 editors
http://senseaboutscience.org/wp-
content/uploads/2016/09/peer-
review-the-nuts-and-bolts.pdf
36. Faculty of Science
Machines see things differently than people
From: Alain, G. and Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. arXiv:1610.01644v1.
Thanks Brad Allen
38. Faculty of Science
Models reuse data
From: Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B.
and Vijayanarasimhan, S. YouTube-8M: a large-scale video classification benchmark. arXiv:1609.08675.
40. Faculty of Science
People read what machines say
Towards Automating Data Narratives.
Gil, Y.; and Garijo, D. In Proceedings of the
Twenty-Second ACM International Conference
on Intelligent User Interfaces (IUI-17), Limassol,
Cyprus, 2017.
41. Faculty of Science
Lauruhn, Michael, and Paul Groth. "Sources of
Change for Modern Knowledge Organization
Systems." Knowledge Organization 43, no. 8
(2016).
Machines, People, Organizations
42. Faculty of Science
Groth, Paul, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE
, vol.28, no.5, pp.44,48, Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138
• 10 different extractors
• E.g mapping-based infobox extractor
• Infobox uses a hand-built ontology based on the 350
• Based on acommonly used English language infoboxes
• Integrates with Yago
• Yago relies on Wikipedia + Wordnet
• Upper ontology from Wordnet and then a mapping to Wikipedia
categories based frequencies
• Wordnet is built by psycholinguists
Machines, People, Organizations
44. Faculty of Science
• People are sources too – need
modelling and assessment
• Must take into account the entire
provenance history including
assessment structures
• Propagation but also discounting
and elevation are needed for
computation of assessment
• Not just explanation, but decisions
All source assessment
Ceolin, D., Groth, P., Maccatrozzo, V., Fokkink, W., Hage, W.R. Van and Nottamkandath, A.
2016. Combining User Reputation and Provenance Analysis for Trust Assessment. Journal of
Data and Information Quality. 7, 1–2 (Jan. 2016), 1–28. DOI:https://doi.org/10.1145/2818382.
Ceolin, D., Groth, P. and Hage, W.R. Van 2010. Calculating the Trust of Event Descriptions
using Provenance. Proceedings Of The SWPM 2010, Workshop At The 9th International
Semantic Web Conference, ISWC-2010 (Nov. 2010).
45. Faculty of Science
Giant Global Provenance Graph?
Martin Fenner and Amir Aryani “Introducing the PID Graph” March 28, 2019
https://doi.org/10.5438/jwvf-8a66
P. Groth, H. Cousijn, T. Clark & C. Goble. FAIR data reuse – the path through data citation. Data
Intelligence 2(2020), 78–86. doi: 10.1162/dint_a_00030
46. Faculty of Science
• Data reuse though integration/munging/remixing is pervasive
• Knowledge graphs are common and complex
• Our information environments are heterogenous, deep,
intermixed and socially embedded
• Use provenance to help humans and machines perform
assessments and make decisions
Conclusion
Contact:
Paul Groth | @pgroth | pgroth.com
A big problem for systems capturing provenance is deciding what to capture.
For disclosed provenance systems we can apply some methodology to decide what to capture.
Disclosed provenance methods require knowledge of application semantics and modification of the application.
OTOH observed provenance methods usually have a high false positive ratio.
Execution Capture: happens realtime
Instrumentation: applied on the captured trace to generate provenance information
Analysis: the provenance information is explored using existing tools (e.g. SPARQL queries)
Selection: a subset of the execution trace is selected – we start again with more intensive instrumentation
We implemented our methodology using PANDA.
PANDA is based on QEMU.
Input includes both executed instructions and data.
RAM snapshot + ND log are enough to accurately replay the whole execution.
ND log conists of inputs to CPU/RAM and other device status is not logged we can replay but we cannot “go live” (i.e. resume execution)
Note: Technically, plugins can modify VM state. However this will eventually crash the execution as the trace will be out of sync with the replay state.
Plugins are implemented as dynamic libraries.
We focus on the highlighted plugins in this presentation.
We built a tool based on taint tracking to capture provenance. Our tool is called DataTracker and has two key building blocks.
We don’t start with a full formal definition but formalize over time from usage