Graph algorithms and features can improve machine learning predictions by adding relationship data. Connected feature engineering extracts metrics like common neighbors and clustering from graph networks to enrich existing ML models. A demo shows applying graph algorithms to book character data to predict new relationships, highlighting how graphs increase predictive power through network structure.
Improve ML Predictions using Graph Analytics (today!)
1. Amy E. Hodler
Graph Analytics and AI Program Manager, Neo4j
San Francisco, May 2019
Improve ML Predictions Using
Graph Algorithms (Today!)
Amy.Hodler@neo4j.com
@amyhodler
5. • Current data science models ignore network structure
• Graphs add highly predictive features to existing ML models
• Otherwise unattainable predictions based on relationships
Graphs Increase the Predictive Power of AI
with the Data You Already Have
Machine Learning Pipeline
6. Steps Forward in Graph Data Science
Graph Persistence
Knowledge
Graphs
Connected Feature
Engineering
Graph Native
Learning
7. Feature Engineering is how we combine and process the
data to create new, more meaningful features, such as
clustering or connectivity metrics.
Graph Feature Engineering
Add More Descriptive Features:
- Influence
- Relationships
- Communities
Feature Extraction
8. Graph Machine Learning Workflow
Data aggregation
Create and store
graphs
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
Identify
uninteresting
features
Cleanse (outliers+)
Feature
engineering/
extraction
Train / Test split
Resample for
meaningful
representation
(proportional, etc.)
Precision,
accuracy, recall
(ROC curve & AUC)
SME Review
Cross-validation
Model & variable
selection
Hyperparameter
tuning
Ensemble methods
Example Technologies
10. Can we infer new interactions in the future?
What unobserved facts we’re missing?
11. Algorithm Measures
Run targeted algorithms and score
outcomes
Set a threshold value used to
predict a link between nodes
Methods for Link Prediction
Machine Learning
Use the measures as features to
train an ML model
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
12. • Citation Network Dataset - Research Dataset
– “ArnetMiner: Extraction and Mining of Academic Social Networks”,
by J. Tang et al
– Used a subset with 52K papers, 80K authors, 140K author relationships
and 29K citation relationships
• Neo4j
– Create a co-authorship graph and connected feature engineering
• Spark and MLlib
– Train and test our model using a random forest classifier
Predicting Collaboration
with a Graph Enhanced ML Model
13. Graph Features:
• Common Authors
4 Models: Multiple Graph Features
“Graphy”
Model
Common Authors
Model
Triangles
Model
Community
Model
Graph Features:
• Preferential
Attachment
• Total Neighbors
Graph Features:
• Min & Max
Triangles
• Min & Max
Clustering
Coefficient
Graph Features:
• Label Propagation
• Louvain
Modularity
Trial, Trial and Error!
16. OMG I’m Good!
Data Leakage!
Graph metric computation for the train
set touches data from the test set.
Did you get really high accuracy on your
first run without tuning?
17. Train and Test Graphs: Time Based Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
Train
Test
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
<
2006
>=
2006
18. Train and Test Graphs: Time Based Split
Train
Test
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
22. Training Our Model
This is one decision tree in
our Random Forest used as a
binary classifier to learn how
to classify a pair: predicting
either linked or not linked.
25. Graph Feature Influence for Tuning
For feature importance, the
Spark random forest averages
the reduction in impurity
across all trees in the forest
Feature rankings are in
comparison to the group of
features evaluated
Also try PageRank!
Try removing some features.
LabelPropagation
26. Feature Selection is how we reduce the number of features used in
a model to a relevant subset. This can be done algorithmically or
based on domain expertise, but the objective is to maximize the
predictive power of your model while minimizing overfitting
Graph Feature Selection
28. 32
Graph Feature Categories & Algorithms
Pathfinding
& Search
Finds the optimal paths or evaluates
route availability and quality
Centrality /
Importance
Determines the importance of
distinct nodes in the network
Community
Detection
Detects group clustering or
partition options
Heuristic
Link Prediction
Estimates the likelihood of nodes
forming a relationship
Evaluates how alike
nodes are
Similarity
Embeddings
Learned representations
of connectivity or topology
29. • Basic network analysis, e.g.
“small-world” structures,
stability
• ML Features:
• Tightness of groups
• Probability of links
Triangles and Clustering Coefficient
u
Triangles = 2
CC= 0.33
u
Triangles = 2
CC= 0.2
Probability that neighbors are connected
30. • Nodes adopt labels based on
neighbors to infer clusters
• Great for proposing and
large-scale initial clustering
• Graph ML Feature:
• Group Membership
(classification)
Label Propagation
1
2
2
5
3
2
1
6
1
5
4
Iteration
31. • Broad influence based on
transitive relationships and
originating node’s influence
• ”Golfing with the CEO”
• Graph ML Feature:
• Score top influencer
• Rank influence
• Contextual ranking
PageRank
32. • Measure of proportional
similarities between nodes
• Recommend of similar items
• Graph ML Feature:
• Coefficient representing
similarity of nodes
• Often as part of link prediction
Jaccard Similarity
A B
A BB
33. • Measures closeness by
multiplying the number of
connections two nodes have
• Rich get richer…
• Graph ML Feature:
• Probability of relationships
forming
Preferential Attachment
Illustration: be.amazd.com/link-prediction/
34. • Based on number of potential
triangles / closing triangles
• 2 strangers with a lot of friends
in common…
• Graph ML Feature:
• Probability of relationships
forming
Common Neighbors
Weight with Adamic AdairIllustration: be.amazd.com/link-prediction/
35. DEMO: Data
Game of Thrones – Books
• 800 Nodes
• 2,900 Relationships
(weighted by interactions)
• Game of Thrones – TV-Series
• 400 Nodes
• 565,800 Relationships
Andrew Beveridge’s script to graph