Graph applications were once considered “exotic” and expensive. Until recently, few software engineers had much experience putting graphs to work. However, the use cases are now becoming more commonplace.
This talk explores a practical use case, one which addresses key issues of data governance and reproducible research, and depends on sophisticated use of graph technology.
Consider: some academic disciplines such as astronomy enjoy a wealth of data — mostly open data. Popular machine learning algorithms, open source Python libraries, and distributed systems all owe much to those disciplines and their history of big data.
Other disciplines require strong guarantees for privacy and security. Datasets used in social science research involve confidential details about human subjects: medical histories, wages, home addresses for family members, police records, etc.
Those cannot be shared openly, which impedes researchers from learning about related work by others. Reproducibility of research and the pace of science in general are limited. Nonetheless, social science research is vital for civil governance, especially for evidence-based policymaking (US federal law since 2018).
Even when data may be too sensitive to share openly, often the metadata can be shared. Constructing knowledge graphs of metadata about datasets — along with metadata about authors, their published research, methods used, data providers, data stewards, and so on — that provides effective means to tackle hard problems in data governance.
Knowledge graph work supports use cases such as entity linking, discovery and recommendations, axioms to infer about compliance, etc. This talk reviews the Rich Context AI competition and the related ADRF framework used now by more than 15 federal agencies in the US.
We’ll explore knowledge graph use cases, use of open standards and open source, and how this enhances reproducible research. Social science research for the public sector has much in common with data use in industry.
Issues of privacy, security, and compliance overlap, pointing toward what will be required of banks, media channels, etc., and what technologies apply. We’ll look at comparable work emerging in other parts of industry: open source projects, open standards emerging, and in particular a new set of features in Project Jupyter that support knowledge graphs about data governance.
3. Personal Background
• applied math, machine learning, distributed systems
• R&D for neural networks, incl. hardware (1986-1997)
• “guinea pig” for early cloud (2005-ff)
• led data teams in industry
• assisted popular open source projects:
Spark, Jupyter, etc.
• development focus on natural language plus adjacent
knowledge graph use cases
• since 2018, increasingly working at the intersection
of public sector + enterprise + open source
4. Motivations
Not all that long ago, graph applications were considered
exotic and expensive.
Until recently, few software engineers had much experience
putting graphs to work; however, those use cases have now
become much more commonplace.
This talk explores a practical use case, one that addresses
key issues of data governance and reproducible research,
and depends on sophisticated use of graph technology.
First, some perspectives and industry analysis…
6. Perspectives
• the ubiquity of linked data
• the tyranny of “thinking relational”
• the primacy of working with graphs
(and their math analog, tensors)
• nouns vs. verbs vs. adjectives
(extreme nominalization)
• evolution of hardware, cloud,
and cluster topologies
• the power of graph embeddings
7. Historical Context
“Data Science: Past and Future”
Rev 2 (2019-05-24) slides
“What is Data Science?”
IBM Data Science Community (2019-03-04)
Just Enough Math
O’Reilly Media (2014)
• John Tukey: data analytics as an intrinsically empirical
and interdisciplinary field (1962)
• most popular data frameworks leveraged some graph
processing, albeit obscured, ad-hoc, clumsy…
• they did well, given the hardware available at the time
8. Beauty in sparsity…
SuiteSparse Matrix Collection:
a widely used set of sparse matrix
benchmarks collected from a wide
range of applications
sparse.tamu.edu/
…for when you really, really, need
some interesting graph data
9. Theme 1: Stuffing graphs into matrices
algebraic graph theory allowed reuse of linear algebra impl
v
u
w
x
u v w x
u 0 1 0 1
v 1 0 1 1
w 0 1 0 1
x 1 1 1 0
• e.g., transform graph to an adjacency matrix
• most will be relatively sparse
• use LINPACK, BLAS, or libraries built atop
• much to leverage: SVD, power method, QR decomp, etc.
10. Theme 1: Stuffing graphs into matrices
for many real-world problems, the data are essentially graphs
1. real-world data
2. graph theory for representation
3. convert to sparse matrix for production
4. cost-effective parallel processing at scale
ergo, leverage low dimensional structure in high dimensional data
11. N Dims good, 2 Dims baa-d
However, complex graphs cannot be represented
as 2D matrices without serious information loss.
Ideally, tensors would be a better representation
to use for linear algebra libraries.
While tensor decomposition is a hard problem,
the general class of problems became much
more interesting after 2012…
12. N Dims good, 2 Dims baa-d
However, complex graphs cannot be represented
as 2D matrices without serious information loss.
Ideally, tensors would be a better representation
to use for linear algebra libraries.
While tensor decomposition is a hard problem,
the general class of problems became much
more interesting after 2012…
“The real problem is that programmers have spent far too
much time worrying about efficiency in the wrong place
and at the wrong times; premature optimization is the
root of all evil (or at least most of it) in programming.”
Don Knuth
13. Theme 2: Nouns, Verbs, Adjectives
Tracing back to the origins of relational databases,
Edgar Codd was furious about how badly SQL and RDBMS
had misinterpreted his mathematical modeling of relations.
Years of EDW reinforced a sense of an extreme nominalization
with so much of the data representation being reduced into
dimensions, facts, indexes
14. Theme 2: Nouns, Verbs, Adjectives
a carry-over of extreme nominalization into graph DBs also
over-emphasizes the role of nodes and centrality for adjusting
the granularity of graph representations:
• discounts the importance of relations
• “mostly nouns, a few verbs, some adjectives”
• serious information loss
IMO: graph DB frameworks tend to err in this aspect,
both in terms of representation and algorithm support.
15. Part of a long-term narrative arc in IT…
• arguably, circa 2001 was the heyday of DW+BI – later
acting as an “embedded institution” w.r.t. data science
• Agile Manifesto became another “embedded institution”
• a generation of developers equated “database” with “relational”,
with a belief that legibility of systems == legibility of the data
• even so, first-movers collectively made a sudden turn
toward NoSQL, partly in reaction to RDBMS pricing
• see also:
“Statistical Modeling: The Two Cultures”
Leo Breiman UC Berkeley (2001)
16. Adjusting data resolution in graphs
In contrast, consider:
“Extracting the multiscale backbone of complex weighted networks”
M. Ángeles Serrano, Marián Boguña, Alessandro Vespignani
PNAS (2009-04-21)
Filtering large noisy graphs based on both
nodes and edges can be useful for automated
approaches in knowledge graph construction,
see: github.com/DerwenAI/disparity_filter
17. An emerging trend disrupts the past 15-20 years
of software engineering practice:
hardware > software > process
Hardware is now evolving more rapidly than software,
which is evolving more rapidly than effective process
Moore’s Law is all but dead, although ironically
many inefficiencies had been based on it
See also: Pete Warden (2018) regarding
TensorFlow.js on low-power devices
Theme 3: Hardware in perspective
18. Theme 3: Evolution of cloud patterns
UC Berkeley published a 2009 report
about early use cases for cloud
computing, which foresaw the shape of
industry deployments over much of the
next decade, and led directly to Apache
Mesos and Apache Spark
It’s fascinating to study the contrasts
between that 2009 report and its 2019
follow-up.
(minor footnote: vimeo.com/3616394)
2009
19. Theme 3: Evolution of cloud patterns
Early cloud was intentionally “dumbed down”
to resemble popular virtualization software –
recognizable by IT staff – to support migration.
That approach is no longer needed.
Also, the physics + economics of cloud use
tend to imply less “framework” layers.
More contemporary patterns will force a
restructuring – for efficiency and security –
i.e., decoupling computation and storage.
2019
20. Theme 3: Cluster topologies, by generation
Opinion: one problem with software/hardware interface for distributed
systems is that it’s taken decades to prioritize the need for handling
graphs/tensors directly within popular, accessible open source libraries,
without having some commercial database vendor intermediate.
1990s mid-2000s current
21. Theme 3: Cluster topologies, by generation
1990s mid-2000s current
see also: Jeff Dean (2013)
youtu.be/S9twUcX1Zp0
NB: graph
23. “Two Cultures” for AI
A more useful distinction:
• ML is about the tools and technologies
• AI is about use case impact on social systems
24. Industry surveys for AI and Cloud adoption
• “Three surveys of AI adoption reveal key advice
from more mature practices”
Ben Lorica, Paco Nathan O’Reilly Media (2019-02-20)
• Episode 7, Domino: surveying “ABC” adoption in enterprise
(2019-03-03)
27. Trends: Knowledge Graphs
Mature practices show more
interest in use of knowledge
graphs than firms which are
still evaluating ML use cases.
28. Trends: an accelerating gap in AI funding
Note: firms with early advantage
are investing more, moving still
further away from the pack.
29. Overview of Data Governance
Paco Nathan @pacoid
Overview of Data Governance
derwen.ai/s/6fqt
cloud
3
cloud
2
cloud
1
security
security
security
compliance
db
mobile
devices
mobile
devices
mobile
devices
edge
cache
web
servers
dw
business
analytics
dat
govdurable
store
logs
models
other
data sci
workflows
cluster
compute
external
data
external
data
external
data
edge
inference
edge
inference
edge
inference
models
streaming data
dat
gov
dat
gov
dat
gov
dat
gov
dat
gov
dat
gov
dat
gov
we noted a resurgence in data
governance – this report examines
key themes, vendors, issues, etc.
30. Unpacking AutoML
derwen.ai/s/yvkg
we noted an uptick in adoption for
a third aspect, co-evolving along
with DG and MLOps
meta-learning feature
selection
hyperparameter
optimization
model
selection
auto
scaling
feature
engineering
train
models
evaluate
results
integrate,
deploy
data
prep
data platform
usecases
31. Data Gov dovetails with MLOps and AutoML
meta-learning feature
selection
hyperparameter
optimization
model
selection
auto
scaling
feature
engineering
train
models
evaluate
results
integrate,
deploy
data
prep
data platform
usecases
data gov
trends
augment
AutoML
data gov
practices
follow
MLOps
32. Emerging category: watch the “AI Natives”
Projects (mostly OSS) that leverage knowledge graph
of metadata about datasets and their usage:
• Amundsen @ Lyft
data discovery and metadata
• Databook @ Uber
manage metadata about datasets (pending OSS)
• Marquez @ WeWork, Stitch Fix
collect, aggregate, visualize metadata
• Data Hub @ LinkedIn
data discovery and lineage
• Metcat @ Netflix
data discovery, metadata service
• Dataportal @ Airbnb
integrated data-space (not OSS)
34. Administrative Data Research Facility
Coleridge Initiative
Julia Lane, et al. NYU Wagner
• FedRAMP-compliant ADRF framework on AWS GovCloud:
“public agency capacity to accelerate the effective use of
new datasets”
• for research projects using cross-agency sensitive data,
in US and EU (and UK) – now in use by 15+ agencies
• cited as the first federal example of Secure Access to
Confidential Data in the final report of the Commission
on Evidence-Based Policymaking
• augments Data Stewardship practices; collaboration
with Project Jupyter on the related data gov features
• funding by Schmidt Futures, Sloan, Overdeck
35. ADRF and Rich Context
Coleridge Initiative
Julia Lane, et al. NYU Wagner
• Rich Context: knowledge graph of metadata about datasets,
used for entity linking, link prediction, recommendations, etc.
• benefits: agencies, researchers, publishers, data stewards,
data providers – see white paper
• ongoing ML competition for linking research publications
with dataset attribution (first comp. won by Allen AI)
• see “Human-in-the-loop AI for scholarly infrastructure”
• upcoming book:
Rich Search and Discovery for Research Datasets: Building
the next generation of scholarly infrastructure
36. AI for Scholarly Infrastructure
Rich Context
overall scope
leaderboard
competition
publisher
use cases
HITL:
RePEC, etc.
authors
accept/reject
links
models
infer links
corpus
research
pubs
leaderboard
evals results
inferred
linked data
1
2
3
• collaboration with SAGE Pub, Digital Science,
RePEc, etc.; partnering with Bundesbank (EU)
• knowledge graph vocabulary integrates W3C
metadata standards: DCAT, PAV, DCMI, CITO,
FaBiO, FOAF, etc.
• data as a strategic asset: knowledge graph
produces an open corpus for the leaderboard
competition
• human-in-the-loop AI used to infer metadata
then confirm with authors via RePEC, etc.
• adjacent work: graph embedding, meta-learning,
persistent identifiers, reproducible research
37.
38.
39. Related work at Project Jupyter
Make datasets and projects top-level constructs,
support metadata exchange and privacy-preserving
telemetry from notebook usage (due Oct 2019):
• JupyterLab Commenting and real-time collab
similar to Google Docs
• JupyterLab Data Explorer: register datasets
within research projects
• JupyterLab Metadata Explorer: browse metadata
descriptions, get recommendations through
knowledge graph inference (via extension)
• Data Registry (original proposal)
• Telemetry (privacy-preserving, reports usage)
41. Active Learning as a data strategy
Experts decide
about edge cases,
providing examples
Experts learn through
Customer interactions
Customers request
Sales, Marketing,
Service, Training
Experts gain insights
via Model explanations
ML
Models
Models focus Experts
(e.g., weak supervision)
Organizational
Learning
Human
Experts
Examples,
Actions
Customers
Models act on decisions
when possible
Customer
Use Cases
Models explore
uncertainty when needed
derwen.ai/s/d8b7
teams of people + machines,
leveraging the complementary
strengths of both
42. Parting thought
In many ways, we’re at a point in the industry with
graph data – particularly for use of knowledge graph
of metadata about dataset usage – which resembles
conditions immediately before “Web 2.0” became
big news.
The emerging category of “AI natives” projects
mentioned earlier could be parlayed into data utilities
more flexible than the AI services which the current
hyperscalers are fielding.
Watch this space.
43. Just Enough Math Rich Context Hylbert-Speys Themes + Confs
per Pacoid
publicaXons, interviews, conference summaries…
https://derwen.ai/paco
@pacoid
Rev conf