The #knowledgegraph--smart data that can describe your business and its domains--is now eating software. We won't be able to scale AI or other emerging tech without knowledge graphs, because those techs all require a transformed data foundation, large-scale integration, and shared data infrastructure.
Key to knowledge graphs are #semantics, #graphdatabase technology and a Tinker Toy-style approach to adding the missing verbs (which provide connections and context) back into your data. A knowledge graph foundation provides a means of contextualizing business domains, your content and other data, for #AI at scale.
This is from a talk I gave at the Data Centric Design for SMART DATA & CONTENT Enthusiasts meetup on July 31, 2019 at PwC Chicago. Thanks to Mary Yurkovic and Matt Turner for a very fun event!.
4. PwC | Data-centric design and the knowledge graph
In the mirrorworld,
everything will have a
paired twin.
Kevin Kelly in Wired
Feb 12, 2019
June 2019
4
5. PwC | Data-centric design and the knowledge graph
What AI needs versus what it has
5
•What it needs: Contextualized, disambiguated, highly relevant
and specific integrated data, flowing to the point of need
•What it has: Single batch datasets cleaned up to be good
enough by data scientists, who spend 80% of their time
on cleanup
•What it needs: Knowledge engineers, and many bold Data
Visionaries in addition to big D Data Scientists, data-centric
architects, pipeline engineers, specialists in many new data niches
•What it has: A growing group of tool users versed only in
probability theory, neural networks, python and R, including small
D data scientists, engineers and architects, plus scads of
entrenched application-centric developers
Finance
Operations
Marketing
Input Output
Input
layer
Hidden
layer 1
Hidden
layer 2
Output
layer
6. PwC | Data-centric design and the knowledge graph
Consider how long it took to build out the world’s oil &
gas infrastructure.
Now think about where we are with traditional data
management:
• How do we free ourselves from legacy IT?
• How do we build sharable digital twins?
• How do we scale a shared data infrastructure?
The mirrorworld poses a
massive global data
infrastructure challenge
6
7. PwC | Data-centric design and the knowledge graph
Why treating smart data as a strategic asset is so critical right now
7
Challenge of the 2020s: Feeding your AIs enough
relevant, quality data
• Emerging tech often gets adopted just in pockets,
• That’s particularly the case with AI.
• Retraining, hiring new people, or buying more tools
isn’t enough.
• Many never figure out how to take advantage of
important AI-enabling tech. They’ll just use it in ad-
hoc projects or subscribe to AI-enhanced apps.
• But the impact on decision making will be minimal
without an industrial-scale approach to data and
flow.
Opportunity of the 2020s:
Pipelines, distribution networks and
volumes of quality, contextualized
smart data flowing to the point of
need
The challenge we face is the same
as the oil and gas industry faced in
the 1920s:
• Collecting enough raw material
• Refining and enriching it
• Distributing it to the places that
need it most
• Creating enough supply to
generate massive demand and
drive down the cost of AI
9. PwC | Data-centric design and the knowledge graph
Three steps to understanding smart data
Step I: A logical, unified model in data at the data layer clears a path to actionable
understanding
5: Unified
model
3 to 4:
Competency
with knowledge
graphs
1 to 2: Struggles with
basic entity resolution
Under-
standing
Knowledge
Interpretation
Contextualization
Recognition
Data collection
Smart data for decisionmaking Data maturity levels
Actionable Enables IT
rationalization
and AI at scale
10. PwC | Data-centric design and the knowledge graph
Three steps to understanding smart data
10
Step II: Smart data is all about verbs as data
connectors
Ian
knows
Mary
The verbs are the sticks in your Tinker Toy kit—they’re
what you can use to simplify connections radically and
scale three very important things:
1. Disambiguating entities (distinguishing between
people, places or things)
2. Creating a rich, relevant context
3. Connecting different contexts
Step III: Using verbs as
connectors, companies can
scale knowledge graphs and
and accelerate their AI efforts
11. PwC | Data-centric design and the knowledge graph
Largest changes in market cap by global company, cross industry, 2018
11
1. Change in market cap from IPO date
2. Market cap at IPO date
Source: Bloomberg and PwC analysis
• Other major tech, FS and pharma cos. are also working on cross-enterprise knowledge graphs
• Many have cross-enterprise knowledge graph ambitions, but most are focused on a single use case
• S&P does cross-enterprise data management using relational tech
Company name Location Industry
Change in market cap
2009 – 2018 ($bn)
Market cap
2018 ($bn)
1 Apple United States Technology 757 851
2 Amazon.Com United States Consumer Services 670 701
3 Alphabet United States Technology 609 719
4 Microsoft Corp United States Technology 540 703
5 Tencent Holdings China Technology 483 496
6 Facebook United States Technology 3831 464
7 Berkshire Hathaway United States Financial 358 492
8 Alibaba China Consumer Services 3021 470
9 JPMorgan Chase United States Financials 275 375
10 Bank of America United States Financials 263 307
Known knowledge
graph builders
Operator of
Taobao and AliBot
KG builder
Known KG
builders
The most value-creating companies in the world are using knowledge graphs
12. PwC | Data-centric design and the knowledge graph
Why traditional data management doesn’t scale
12
1. Relational databases don’t treat relationship
data as a first-class citizen
2. As a result, most companies have buried or are
missing the relationship data they need for
contextualization
3. Tables alone don’t help you dynamically model
your data or share the models
4. Managing large numbers of tables soon gets
unwieldy
5. Limiting your database resources to tabular
methods ensures you won’t take full advantage
of today’s compute, networking and storage
Relationship
richness
Relationship
sparseness
Static selective
fragmented
labor intensive
Additive
Index friendly
Immutable
versioning possible
More dynamic
More inclusive
More integrated
More machine assisted
Relational:
Row and column headers
And up-front taxonomies
Document:
Nested, cumulative
hierarchies
Graph:
Any-to-any
relationships
PwC, 2016
When overused, RDBMSes
perpetuate the provincial data
mentality of the 1980s, back
when computing didn’t scale
Lots of data is missing from relational
datasets—namely the contextual clues
needed for disambiguation via entity
resolution and, therefore, large-scale
integration
13. PwC | Data-centric design and the knowledge graph
The consequence of logic and data siloing – App-centric system-level complexity
and disconnectedness spinning out of control (Result – Table and Code sprawl)
13
Hardware
DBMS
OS
Custom code
Hardware
Lots of OSes
1,000+ SQL/
NoSQL DBs
Custom code
ERP+ suites
Hardware
A few more
OSes
More
DBMSes
Custom code
ERP+ suites
Hardware
Lots more OSes
5,000+
databases
Componentized
suites
Custom code
Cloud layer
Hardware
More types
of OSes
10,000+ DBs +
blockchains
Multicloud layer
Suites as
services
Various SaaSes
Custom code
Hardware
A few
DBMSes
A few OSes
ERP+ suites
Custom code
Threat of more
application centric
sprawl
Early1990s Late 1990s 2000s 2010s1973-1990sPre 1970 2020s
14. PwC | Data-centric design and the knowledge graph
Data-centric design at the micro level brings human and machines together, with
the humans helping the machines build and scale relationship data
14
Relationship logic to shared at scale needs to be created in human-machine feedback loops and
embedded in a standard form at the data layer for full reuse—not trapped in app silos
Relationship-
sparse, but
highly
articulated
data context
that humans
need to help
machines
refine and
enrich
Relationship-
rich smart
data that
uses
description or
predicate
logic to scale
integration,
context and
interoperation
15. PwC | Data-centric design and the knowledge graph
The key opportunity – Large-scale integration and model-driven intelligence in
a de-siloed and de-duplicated way
15
Previously dominant
Rule-based systems (includes KR)
Handcrafted knowledge” is the term DARPA
uses; rule-based programming + procedure
replication in process automation, + some
knowledge representation (KR)
• Strong on logical reasoning in specific
concrete contexts
- Procedural + declarative programming +
set theory, etc.
- Deterministic
• Can’t learn or abstract
• Still exceptionally common and useful
On the rise and rapidly improving
Statistical machine learning
• Probabilistic
• From Bayesian algorithms to neural nets
(yes, deep learning also)
• Strong on perceiving and learning
(classifying, predicting)
• Weak on abstracting and reasoning
• Quite powerful in the aggregate but
individually (instance by instance) unreliable
• Can require lots of data
Perceiving
Learning
Abstracting
Reasoning
Perceiving
Learning
Abstracting
Reasoning
Perceiving
Learning
Abstracting
Reasoning
Example: Consumer tax software Example: Facial recognition using
deep learning/neural nets
John Launchbury of DARPA (https://www.youtube.com/watch?v=N2L8AqkEDLs), Estes Park Group and PwC research, 2017
Nascent, just beginning
Contextualized, model-driven approach
• Contextualized modeling approach-allows
efficiency, precision and certainty
• Combines power of deterministic,
probabilistic and description logic
• Allows explanations to be added
to decisions
• Accelerates the training process with the
help of specific, contextual human input
• Takes less data
Example: Explains first how handwritten
letters are formed so machines can decide-
less data needed, more transparency.
16. PwC | Data-centric design and the knowledge graph
Origins of data-centric thinking
16
Software
Wasteland
How the Application-
Centric Mindset
is Hobbling our
Enterprises
Dave McComb The Data-Centric Manifesto
Principles
1. Data is a key asset of any organization.
2. The current enterprise software paradigm is
“Application-Centric.”
3. Hoarding data in proprietary and complex
apps is a mistake.
4. Most of the excessive cost and complexity in
Enterprise Apps stems from the relationship
of the apps to the data.
5. We are committed to reversing this trend.
6. We understand that there is money to be
made in the applciation-centric paradigm.
http://datacentricmanifesto.org/principles/
Data-centric Architecture Forum
Fort Collins, CO|February 3 – 5, 2020
February 2019 we hosted the inaugural Data-Centric Conference
where we started a profound conversation about the exploding costs
of enterprise systems, discussed strategic to reverse the application-
centric mindset and committed to move the needle in the right
direction forging data-centric projects going forward. We are very
pleased to announce we’ll do this again February 2020 as the Data-
centric architecture Forum. The theme of next year’s forum will be
experience reports on attempting to implement portions of the
architecture. Join us and our mission to get more people involved and
skilled in data-centricity. Here’s a quick summary of our 2019 event to
give you an idea of what to expect.
Hold the date, and save some money: Super Early Bird Discount
of $300 off if you register by June 30, 2019!
https://www.semanticarts.com/dcc/
17. PwC | Data-centric design and the knowledge graph
The solution – Data-centric architecture reduces both application and
database sprawl
17
Trapped app code and databases
Application centric versus Data centric
Semantic model/rules
Data lake or hub
Applets
Applications for execution only
Models exposed with the data
18. PwC | Data-centric design and the knowledge graph
Rationalize – Identify and declare the few hundred business rules you need
as a model
18
“In every company I’ve ever studied, there are only a few hundred key concepts and relationships that the entire business runs on. Once you
understand that, you realize all of these millions of distinctions are just slight variations of those few hundred important things.”
--Dave McComb, author of Software Wasteland, quoted in Strategy + Business
See “Are you Spending Way too Much on Software at
https://www.strategy-business.com/article/Are-You-
Spending-Way-Too-Much-on-Software?
19. PwC | Data-centric design and the knowledge graph
Reuse – Call the model to reuse those rules whenever necessary
19
“You discover that many of the slight variations aren’t variations at all. They’re really the same things with different names, different structures,
or different labels. So it’s desirable to describe those few hundred concepts and relationships in the form of a declarative model that small
amounts of code refer to again and again.”
--Dave McComb (as previously cited)
See “Are you Spending Way too Much on Software at
https://www.strategy-business.com/article/Are-You-
Spending-Way-Too-Much-on-Software?
21. PwC | Data-centric design and the knowledge graph
State of the art knowledge graph – Blue Brain Nexus (1 of 2)
21
How do scientists record the provenance, curate, share in open
source and collaborate on what they’re documented using 3D
imaging techniques generated with the help of a supercomputer,
such as the slices of a rat’s brain?
From the EPFL Blue Brain Portal Gallery, https://portal.bluebrain.epfl.ch/gallery-2/
22. PwC | Data-centric design and the knowledge graph
State of the art knowledge graph – Blue Brain Nexus (2 of 2)
22
Bogdan Roman, “Blue Brain Nexus Technical Introduction,” March 2018, https://www.slideshare.net/BogdanRoman1/bluebrain-nexus-technical-introduction-91266871
23. PwC | Data-centric design and the knowledge graph
Montefiore’s semantic data lake
23
Montefiore Health, Franz, Intel and PwC research, 2017
Various data sources,
some structured, some
not, now all part of
a knowledge graph with
a simple patient
care-centric ontology
Hadoop cluster with
high-performance
processors and memory
Scalable graph database
supporting open W3C
semantic standards
Standard open source
querying, ML and
analytics frameworks, API
accessibility
HL7 feed
Web
services
EMR LIMS Legacy
OMICs CTMS
Claims
Annotation
engine
HDFS
Hadoop
HDFS
Hadoop
HDFS
Hadoop
HDFS
Hadoop
HDFS
Hadoop
HDFS
Hadoop
HDFS
Hadoop
HDFS
Hadoop
HDFS
Hadoop
HDFS
Hadoop
AllegrographAllegrographAllegrographAllegrograph Allegrograph
SDL loader
ML-LIB/R SPARQL
Prolog
Spark
Java API
Doctors can query the graph or
harness ML + analytics and receive
answers from the system at the
point of care via their handhelds.
The system also acts as a giant
feedback-response or learning loop
which learns from the data collected
via user/system interactions.
24. PwC | Data-centric design and the knowledge graph
A semantic knowledge graph could enable the model-driven organization (a digital
twin) at the data layer
24
Step One: Model the relevant
elements of the organization, how
they relate to one another
and interoperate
Step Two: Embed the model where
it lives as machine-readable data
Step Three: Integrate the source
datasets as a target knowledge
graph with model-driven mappings
Step Four: Browse, query,
disambiguate, detect and discover
via the resulting knowledge graph
Capability
enables
process
Process uses
information
https://virtualdutchman.com/2018/10/14/moving-to-a-model-based-enterprise-the-business-model/
Clearvision, 2019. Used with permission.
Prog/proj
creates
information
Prog/proj
Supports
process
Prog/proj
Has person
Prog/proj
creates
technology
Person uses
process
Person uses
information
Person
creates
information
Person uses
technology
Person uses
capability
Capability uses
technology
Information
uses
technology
Technology
Supports
process
Prog/proj
has risk
Portfolio
has person
Risk owned
by personPerson
Identified risk
Company
employs person
Portfolio
Has prog/proj
Prog/proj
outputs
Work package
Prog/proj
Has role
Prog/proj
Has parente prog/pro
Company
Has prog/proj
Prog/proj
Delivers strategy
Prog/proj
Has milestone
Company
has portfolio
Strategy
has milestone
Company
Has role
Role needs
competenceWork package
Needs competence
Work
package
Process
Information
Person
Risk
Portfolio
Milestone
Strategy
Company
Role
Competence
Technology
Capability
Capability uses
information
Prog/proj
Uses information
Prog/proj
Uses technology
Prog/proj
delivers
capability
Prog/proj
Work Package
has person
Person has
competence
26. PwC | Data-centric design and the knowledge graph
Seven obstacles to semantics and knowledge graph adoption and ways around them
26
Obstacle to adoption Nature of the problem Ways to overcome
1. Tribalism Each tribe works off on its own, rarely with
other tribes
Encourage activist leadership and hire to emphasize
the blended nature of the solution
2. Low awareness in the
trenches
Few seem to acknowledge or care about what’s
actually happening
Find those who want to learn and be inspired
3. Magic bullet mentality Inflated, unrealistic expectations regarding “AI”,
RPA, blockchain, etc.
Promote foxes (breadth) rather than hedgehogs
(depth)
4. Indifference about the
back end
While the front end seems always bright and shiny,
few seem to care about the plumbing
Highlight the end user benefits the back end and
a systems approach enables
5. Lack of university
coursework
Few universities in the US seem to offer courses
in semantics
Follow the European example
6. Misplaced belief in the
centrality of the app layer
Shallow understanding of data + logic, declarative
versus imperative programming, etc.; reinforcement
of the status quo
Focus on less mature areas where alternative
approaches are more likely to be accepted
7. Buy rather than build habit Enthusiasm for the latest new products and services Focus on the system rather than the piece parts
28. PwC | Data-centric design and the knowledge graph
Graphs (including hybrids) complete the picture of your transformed data lifecycle
and how it’s managed
28
29. PwC | Data-centric design and the knowledge graph
Bottom line – The 4D approach to insight
29
1. De-silo: Integrate all the relevant sources in a declarative fashion that enables reuse, cross-enterprise scalability, and continuous refinement.
2. Disambiguate: Triangulate using set theory and linguistic description logic in addition to statistical methods, enabling precise
entity resolution.
3. Detect: Uncover weaker signals by articulating the most relevant and distant relationships between entities, via richer contextualization.
4. Discover: Radically expand the ability to discover insights, moving beyond keywords to concepts.
Outlook and conclusion
Kevin Kelly's concept of the mirrorworld describes the future vision, which he says will take 25
years to materialize.
Poor data management is the main reason we're stuck at the starting gate with the mirrorworld.