Graph enhancements to AI and ML are changing the landscape of intelligent applications. In this session, we’ll focus on how using connected features can help improve the accuracy, precision, and recall of machine learning models. You’ll learn how graph algorithms can provide more predictive features as well as aid in feature selection to reduce overfitting. We’ll look at a link prediction example to predict collaboration with measurable improvement when including graph-based features.
4. Relationships:
Strongest Predictors of Behavior!
“Increasingly we're learning that you can make
better predictions about people by getting all the
information from their friends and their friends’
friends than you can from the information you
have about the person themselves”
James Fowler David Burkus
James Fowler
Albert-Laszlo
Barabasi
5. Native Graph Platforms are Designed for Connected Data
TRADITIONAL
PLATFORMS
BIG DATA
TECHNOLOGY
Store and retrieve data Aggregate and filter data Connections in data
Real time storage & retrieval Real-Time Connected Insights
Long running queries
aggregation & filtering
“Our Neo4j solution is literally thousands of times faster
than the prior MySQL solution, with queries that require
10-100 times less code”
Volker Pacher, Senior Developer
Max # of hops ~3
Millions
9. • Current data science models ignore network structure & complex relationships
• Graphs add highly predictive features to existing ML models
• Otherwise unattainable predictions based on relationships
Novel & More Accurate Predictions
with the Data You Already Have
Machine Learning Pipeline
12. Connection-related metrics about our graph, such
as the number of relationships going into or out of
nodes, a count of potential triangles, or neighbors in
common.
14c
What Are Connected Features?
13. Query (e.g. Cypher)
Real-time, local decisioning
and pattern matching
Graph Algorithms Libraries
Global analysis
and iterations
You know what you’re looking
for and making a decision
You’re learning the overall structure of a
network, updating data, and predicting
Local
Patterns
Global
Computation
Deriving Connected Features
14. Connected Feature Engineering
Feature Engineering is how we combine and process the data to create new,
more meaningful features, such as clustering or connectivity metrics.
Add More Descriptive Features:
- Influence
- Relationships
- Communities
Extraction
15. 17
Graph Feature Categories & Algorithms
Pathfinding
& Search
Finds the optimal paths or evaluates
route availability and quality
Centrality /
Importance
Determines the importance of
distinct nodes in the network
Community
Detection
Detects group clustering or
partition options
Heuristic
Link Prediction
Estimates the likelihood of nodes
forming a relationship
Evaluates how alike nodes
are
Similarity
Embeddings
Learned representations
of connectivity or topology
17. 19
Can we infer new interactions in the future?
What unobserved facts we’re missing?
18. + 50 years of biomedical data
integrated in a knowledge
graph
Predicting new uses for drugs
by using the graph structure to
create features for link
prediction
Example: het.io
20. Methods for Link Prediction
Algorithm Measures
Run targeted algorithms and score
outcomes
Set a threshold value used to predict a
link between nodes
Machine Learning
Use the measures as features to train an
ML model
Community
Detection
Link
Prediction
Similarity
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
Centrality
22. • Citation Network Dataset - Research Dataset
– “ArnetMiner: Extraction and Mining of Academic Social Networks”, by
J. Tang et al
– Used a subset with 52K papers, 80K authors, 140K author
relationships and 29K citation relationships
• Neo4j
– Create a co-authorship graph and connected feature engineering
• Spark and MLlib
– Train and test our model using a random forest classifier
24
Predicting Collaboration
with a Graph Enhanced ML Model
23. Our Link Prediction Workflow
Import Data
Create Co-Author
Graph
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine Learning
Train
Models
Evaluate Results
Productionize
24.
25. Our Link Prediction Workflow
Import Data
Create Co-Author
Graph
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine Learning
Train
Models
Evaluate Results
Productionize
Identified sparse
feature areas
Feature
Engineering:
New graphy
features
26. Graph Algorithms Used for
Feature Engineering (few examples)
Preferential Attachment measure the closeness of
nodes based on shared neighbors
Common Neighbors measures the number of possible
neighbors (triadic closure)
Illustration be.amazd.com/link-prediction/
27. Graph Algorithms Used for
Feature Engineering (few examples)
Triangle counting and clustering coefficients measure the
density of connections around nodes
Louvain Modularity identifies interacting communities and
hierarchies
28. Our Link Prediction Workflow
Import Data
Create Co-Author
Graph
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine Learning
Train
Models
Evaluate Results
Productionize
Identified sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
32. OMG I’m Good!
Data Leakage!
Graph metric computation for the train set
touches data from the test set.
Did you get really high accuracy on your first
run without tuning?
33. Train and Test Graphs: Time Based Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
Train
Test
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
< 2006
>= 2006
34. Train and Test Graphs: Time Based Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
Train
Test
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
36. There are significantly more negative examples than positive ones:
# negative examples = (# nodes)² - (# relationships) - (# nodes)
38
Class Imbalance
37. A very high accuracy model could predict that a pair of nodes are not linked.
39
Class Imbalance
39. Our Link Prediction Workflow
Import Data
Create Co-Author
Graph
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine Learning
Train
Models
Evaluate Results
Productionize
Identified sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
Model Selection:
Random Forest
Ensemble
method
41. Training Our Model
This is one decision tree in our
Random Forest used as a binary
classifier to learn how to classify a
pair: predicting either linked or not
linked.
42. 4 Models Trained
with Multiple Graph Features
Graph Features:
• Common Authors
“Graphy”
Model
Common Authors
Model
Triangles
Model
Community
Model
Graph Features:
• Preferential
Attachment
• Total Neighbors
Graph Features:
• Min & Max Triangles
• Min & Max
Clustering
Coefficient
Graph Features:
• Label Propagation
• Louvain Modularity
43. Our Link Prediction Workflow
Import Data
Create Co-Author
Graph
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine Learning
Train
Models
Evaluate Results
Productionize
Identified sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
Precision,
Accuracy, Recall
ROC Curve &
AUC
Model Selection:
Random Forest
Ensemble
method
44. Measures
Accuracy Proportion of total correct predictions.
Beware of skewed data!
Precision Proportion of positive predictions that
are correct.
Low score = more false positives
Recall /
True Positive Rate
Proportion of actual positives that are
correct.
Low score = more false negatives
False Positive Rate Proportion of incorrect positives
ROC Curve & AUC X-Y Chart mapping above 2 metrics
(TPR and FPR) with area under curve
45. Result: First Model ROC & AUC
Problematic False Positives!
Common Authors
Model 1
47. Iteration & Tuning: Feature Influence
For feature importance, the Spark
random forest averages the
reduction in impurity across all
trees in the forest
Feature rankings are in comparison
to the group of features evaluated
Also try PageRank!
Try removing different features
(LabelPropagation)
48. Graph Machine Learning Workflow
Data aggregation
Create and store
graphs
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine Learning
Train
Models
Evaluate Results
Productionize
Identify
uninteresting
features
Cleanse (outliers+)
Feature
engineering/
extraction
Train / Test split
Resample for
meaningful
representation
(proportional, etc.)
Precision, accuracy,
recall
(ROC curve & AUC)
SME Review
Cross-validation
Model & variable
selection
Hyperparameter
tuning
Ensemble methods
51. 53
Connected Feature Extraction
Feature Extraction is how when we change the shape or format of the data
to be usable in a machine learning pipeline. For example, from a graph, we
extract the relevant subset of the data into a tabular format for model
building.
52. Connected Feature Selection
Feature Selection is how we reduce the number of features used in a model
to a relevant subset. This can be done algorithmically or based on domain
expertise, but the objective is to maximize the predictive power of your
model while minimizing overfitting.
53. 720+
7/10
12/2
5
8/10
53K+
100+
300+
450+
Adoption
Top Retail Firms
Top Financial Firms
Top Software Vendors
Customers Partners
• Creator of the Neo4j Graph Platform
• ~250 employees
• HQ in Silicon Valley, other offices include
London, Munich, Paris and Malmö Sweden
• $80M new funding led by Morgan Stanley &
One Peak. Total $160M from Fidelity,
Sunstone, Conor, Creandum, and
Greenbridge Capital
• Over 15M+ downloads & container pulls
• 325+ enterprise subscription customers
with over half with >$1B in revenue
Ecosystem
Startups in program
Enterprise customers
Partners
Meet up members
Events per year
Industry’s Largest Dedicated Investment in Graphs
Neo4j - The Graph Company
54. Strictly ConfidentialStrictly Confidential
56
Helping The World To Make Sense of Data
ICIJ used Neo4j to uncover the
world’s largest journalistic leak to
date, The Panama Papers
NASA uses Neo4j for a “Lessons
Learned” database to improve
effectiveness in search missions in
space
Neo4j is used to graph the human
body, map correlations, identify cause
& effect and search for the cure for
cancer
SAVING DEMOCRACY
MISSION TO
MARS
CURING CANCER
55. Graph and ML Algorithms in Neo4j
• Parallel Breadth First Search & DFS
• Shortest Path
• Single-Source Shortest Path
• All Pairs Shortest Path
• Minimum Spanning Tree
• A* Shortest Path
• Yen’s K Shortest Path
• K-Spanning Tree (MST)
• Random Walk
• Degree Centrality
• Closeness Centrality
• CC Variations: Harmonic, Dangalchev,
Wasserman & Faust
• Betweenness Centrality
• Approximate Betweenness Centrality
• PageRank
• Personalized PageRank
• ArticleRank
• Eigenvector Centrality
• Triangle Count
• Clustering Coefficients
• Connected Components (Union Find)
• Strongly Connected Components
• Label Propagation
• Louvain Modularity – 1 Step & Multi-Step
• Balanced Triad (identification)
• Euclidean Distance
• Cosine Similarity
• Jaccard Similarity
• Overlap Similarity
• Pearson Similarity
Pathfinding
& Search
Centrality /
Importance
Community
Detection
Similarity
neo4j.com/docs/
graph-algorithms/current/
Updated April 2019
Link
Prediction
• Adamic Adar
• Common Neighbors
• Preferential Attachment
• Resource Allocations
• Same Community
• Total Neighbors
57. Neo4j is an enterprise-grade native graph platform that enables you to:
• Store, reveal and query data relationships
• Traverse and analyze any levels of depth in real-time
• Add context and connect new data on the fly
59
Who We Are: Leader in Graph Innovations
• Performance
• ACID Transactions
• Schema-free Agility
• Graph Algorithms
Designed, built and tested natively
for graphs from the start for:
• Developer Productivity
• Hardware Efficiency
• Global Scale
• Graph Adoption
Graph
Transactions
Graph
Analytics
Data Integration
Development
& Admin
Analytics
Tooling
Drivers & APIs Discovery & Visualization
58. 60
• Record “Cyber Monday” sales
• About 35M daily transactions
• Each transaction is 3-22 hops
• Queries executed in 4ms or less
• Replaced IBM Websphere commerce
• 300M pricing operations per day
• 10x transaction throughput on half the
hardware compared to Oracle
• Replaced Oracle database
• Large postal service with over 500k
employees
• Neo4j routes 7M+ packages daily at peak,
with peaks of 5,000+ routing operations per
second.
Handling Large Graph Work Loads for Enterprises
Real-time promotion
recommendations
Marriott’s Real-time
Pricing Engine
Handling Package
Routing in Real-Time
59. Recommendations Dynamic Pricing IoT-applicationsFraud Detection
Real-Time Transaction Applications
Generate and
Protect Revenue
Customer
Engagement
Metadata and Advanced Analytics
Data Lake
Integration
Knowledge
Graphs for AI
Risk
Mitigation
Generate
Actionable Insights
Network
Management
Supply Chain
Efficiency
Identity and Access
Management
Internal Business Processes
Improve Efficiency
and Cut Costs
Graph Use Cases by Value Proposition
62. Collections-Focused
Multi-Model, Documents, Columns
& Simple Tables, Joins
Neo4j is designed for data relationships
Different Paradigms
NoSQL
Relational
DBMS
Neo4j Graph
Platform
Connections-Focused
Focused on
Data Relationships
Development Benefits
Easy model maintenance
Easy query
Deployment Benefits
Ultra high performance
Minimal resource usage
63. How Neo4j Fits — Common Architecture Patterns
From Disparate Silos
To Cross-Silo Connections
From Tabular Data
To Connected Data
From Data Lake Analytics
to Real-Time Operations
64. Cypher: Powerful & Expressive Query Language
MATCH (:Person { name:“Dan”} ) -[:MARRIED_TO]-> (spouse)
MARRIED_TO
Dan Ann
NODE RELATIONSHIP TYPE
LABEL PROPERTY VARIABLE
67. Graphs Drive Innovation
69
Context Paths
Auto-Graphs
Graph Layers
1st Graph
Cross-
Connect
Cross-tech applications
Internet of Things
operations
Transparent Neural
Networks
Blockchain-managed
systems
Adjacent graph layers
inspire new innovations
Metadata / Risk
Management
Knowledge Graphs
AI- Powered Customer
Experiences
Connect unlike objects
such as people to products,
locations
Mobile app explosion
Recommendation engines
Fraud detectors
Desire for more context to
follow connections
Connects like objects
People, computer
networks, telco, etc
68. Business Problem
• Find relationships between people, accounts, shell companies
and offshore accounts
• Journalists are non-technical
• Biggest “Snowden-Style” document leak ever; 11.5 million
documents, 2.6TB of data
Solution and Benefits
• Pulitzer Prize winning investigation resulted in robust
coverage of fraud and corruption
• PM of Iceland & Pakistan resigned, exposed Putin, Prime
Ministers, gangsters, celebrities (Messi)
• Led to assassination of journalist in Malta
Background
• International Consortium of Investigative Journalists (ICIJ),
small team of data journalists
• International investigative team specializing in cross-border
crime, corruption and accountability of power
• Works regularly with leaks and large datasets
ICIJ Panama Papers INVESTIGATIVE JOURNALISM
Fraud Detection / Knowledge Graph70
70. Background
• Personal shopping assistant
• Converses with buyer via text, picture and voice
to provide real-time recommendations
• Combines AI and natural language understanding
(NLU) in Neo4j Knowledge Graph
• First of many apps in eBay's AI Platform
Business Problem
• Improve personal context in online shopping
• Transform buyer-provided context into ideal
purchase recommendations over social platforms
• "Feels like talking to a friend"
Solution and Benefits
• 3 developers, 8M nodes, 20M relationships
• Needed high-performance traversals to respond
to live customer requests
• Easy to train new algorithms and grow model
• Generating revenue since launch
eBay for Google Assistant ONLINE RETAIL
Knowledge Graph powers Real-Time Recommendations72
EE Customer since 2016 Q3
71. Background
• Over 7M citizens suffer from Diabetes
• Connecting over 400 researchers
• Incorporates over 50 databases, 100k’s of Excel
workbooks, 30 database of biological samples
• Sought to examine disease from as many angles as
possible.
Business Problem
• Genes are connected by proteins or to metabolites,
and patients are connected with their diets, etc…
• Needed to improve the utilization of immensely
technical data
• Needed to cater to doctors and researchers with
simple navigation, communication and connections
of the graph.
Solution and Benefits
• Dr. Alexander Jarasch, Head of Bioinformatics and
Data Management
• Scientists can conduct parallel research without
asking the same questions or repeating tests
• Built views like a liver sample knowledge graph
DZD - German Center for Diabetes Research
Medical Genomic Research73
EE Customer since 2016
Q4