Improve ML Predictions using Graph Analytics (today!)

Amy E. Hodler
Graph Analytics and AI Program Manager, Neo4j
San Francisco, May 2019
Improve ML Predictions Using
Graph Algorithms (Today!)
Amy.Hodler@neo4j.com
@amyhodler

neo4j.com/
graph-algorithms-book
free download
free in lobby today
Chapter 8:
Graph + ML
Spark & Neo4j

Relationships:
One of the Strongest
Predictors of Behavior
James Fowler

…And Success!
David Burkus
James Fowler
Albert-Laszlo
Barabasi

• Current data science models ignore network structure
• Graphs add highly predictive features to existing ML models
• Otherwise unattainable predictions based on relationships
Graphs Increase the Predictive Power of AI
with the Data You Already Have
Machine Learning Pipeline

Steps Forward in Graph Data Science
Graph Persistence
Knowledge
Graphs
Connected Feature
Engineering
Graph Native
Learning

Feature Engineering is how we combine and process the
data to create new, more meaningful features, such as
clustering or connectivity metrics.
Graph Feature Engineering
Add More Descriptive Features:
- Influence
- Relationships
- Communities
Feature Extraction

Graph Machine Learning Workflow
Data aggregation
Create and store
graphs
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
Identify
uninteresting
features
Cleanse (outliers+)
Feature
engineering/
extraction
Train / Test split
Resample for
meaningful
representation
(proportional, etc.)
Precision,
accuracy, recall
(ROC curve & AUC)
SME Review
Cross-validation
Model & variable
selection
Hyperparameter
tuning
Ensemble methods
Example Technologies

Example:
Machine Learning for
Link Prediction

Can we infer new interactions in the future?
What unobserved facts we’re missing?

Algorithm Measures
Run targeted algorithms and score
outcomes
Set a threshold value used to
predict a link between nodes
Methods for Link Prediction
Machine Learning
Use the measures as features to
train an ML model
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0

• Citation Network Dataset - Research Dataset
– “ArnetMiner: Extraction and Mining of Academic Social Networks”,
by J. Tang et al
– Used a subset with 52K papers, 80K authors, 140K author relationships
and 29K citation relationships
• Neo4j
– Create a co-authorship graph and connected feature engineering
• Spark and MLlib
– Train and test our model using a random forest classifier
Predicting Collaboration
with a Graph Enhanced ML Model

Graph Features:
• Common Authors
4 Models: Multiple Graph Features
“Graphy”
Model
Common Authors
Model
Triangles
Model
Community
Model
Graph Features:
• Preferential
Attachment
• Total Neighbors
Graph Features:
• Min & Max
Triangles
• Min & Max
Clustering
Coefficient
Graph Features:
• Label Propagation
• Louvain
Modularity
Trial, Trial and Error!

16
Test/Train Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
8 11 2 3 0
Train
Test

OMG I’m Good!
Data Leakage!
Graph metric computation for the train
set touches data from the test set.
Did you get really high accuracy on your
first run without tuning?

Train and Test Graphs: Time Based Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
Train
Test
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
<
2006
>=
2006

Train and Test Graphs: Time Based Split
Train
Test
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1

Class Imbalance
Negative
Examples
Positive
Examples

A very high accuracy model could
predict that a pair of nodes are
not linked.
22
Class Imbalance

Training Our Model
This is one decision tree in
our Random Forest used as a
binary classifier to learn how
to classify a pair: predicting
either linked or not linked.

Result: First Model ROC & AUC Common Authors
Model 1

Result: All Models Common Authors
Model 1
Community
Model 4

Graph Feature Influence for Tuning
For feature importance, the
Spark random forest averages
the reduction in impurity
across all trees in the forest
Feature rankings are in
comparison to the group of
features evaluated
Also try PageRank!
Try removing some features.
LabelPropagation

Feature Selection is how we reduce the number of features used in
a model to a relevant subset. This can be done algorithmically or
based on domain expertise, but the objective is to maximize the
predictive power of your model while minimizing overfitting
Graph Feature Selection

Graph Algorithms for Feature
Engineering

32
Graph Feature Categories & Algorithms
Pathfinding
& Search
Finds the optimal paths or evaluates
route availability and quality
Centrality /
Importance
Determines the importance of
distinct nodes in the network
Community
Detection
Detects group clustering or
partition options
Heuristic
Link Prediction
Estimates the likelihood of nodes
forming a relationship
Evaluates how alike
nodes are
Similarity
Embeddings
Learned representations
of connectivity or topology

• Basic network analysis, e.g.
“small-world” structures,
stability
• ML Features:
• Tightness of groups
• Probability of links
Triangles and Clustering Coefficient
u
Triangles = 2
CC= 0.33
u
Triangles = 2
CC= 0.2
Probability that neighbors are connected

• Nodes adopt labels based on
neighbors to infer clusters
• Great for proposing and
large-scale initial clustering
• Graph ML Feature:
• Group Membership
(classification)
Label Propagation
1
2
2
5
3
2
1
6
1
5
4
Iteration

• Broad influence based on
transitive relationships and
originating node’s influence
• ”Golfing with the CEO”
• Score top influencer
• Rank influence
• Contextual ranking
PageRank

• Measure of proportional
similarities between nodes
• Recommend of similar items
• Coefficient representing
similarity of nodes
• Often as part of link prediction
Jaccard Similarity
A B
A BB

• Measures closeness by
multiplying the number of
connections two nodes have
• Rich get richer…
• Probability of relationships
forming
Preferential Attachment
Illustration: be.amazd.com/link-prediction/

• Based on number of potential
triangles / closing triangles
• 2 strangers with a lot of friends
in common…
• Probability of relationships
forming
Common Neighbors
Weight with Adamic AdairIllustration: be.amazd.com/link-prediction/

DEMO: Data
Game of Thrones – Books
• 800 Nodes
• 2,900 Relationships
(weighted by interactions)
• Game of Thrones – TV-Series
• 400 Nodes
• 565,800 Relationships
Andrew Beveridge’s script to graph

DEMO: Neo4j Desktop, Algorithms, Playground
neo4j.com/download/ install.graphapp.io
Lab Tools

Resources
• Algorithms Guide in Sandbox
• neo4j.com/sandbox
• Algorithms Playground (NEuler)
• neo4j.com/developer/
• Community for Q&A
• community.neo4j.com
• Code & Citation data from O’Reilly
book
• bit.ly/2FPgGVV (ML Folder)
Amy.Hodler@neo4j.com
@amyhodler
neo4j.com/
graph-algorithms-book

Improve ML Predictions using Graph Analytics (today!)

Improve ML Predictions using Graph Analytics (today!)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Improve ML Predictions using Graph Analytics (today!)

Similar to Improve ML Predictions using Graph Analytics (today!) (20)

More from Neo4j

More from Neo4j (20)

Recently uploaded

Recently uploaded (20)

Improve ML Predictions using Graph Analytics (today!)