Graph enhancements to AI and ML are changing the landscape of intelligent applications. In this webinar, we’ll focus on using graph feature engineering to improve the accuracy, precision, and recall of machine learning models. You’ll learn how graph algorithms can provide more predictive features as well as aid in feature selection to reduce overfitting. We’ll illustrate a link prediction workflow using Spark and Neo4j to predict collaboration and discuss our missteps and tips to get to measurable improvements.
3. Relationships:
Strongest Predictors of Behavior!
“Increasingly we're learning that you can make
better predictions about people by getting all the
information from their friends and their friends’
friends than you can from the information you
have about the person themselves”
James Fowler David Burkus
James Fowler
Albert-Laszlo
Barabasi
4. • Graphs for Predictions
• Connected Features
• Link Prediction
• Neo4j + Spark Workflow
Amy E. Hodler
Graph Analytics & AI
Program Manager, Neo4j
Amy.Hodler@neo4j.com
@amyhodler
Jennifer Reif
Labs Engineer, Neo4j
Jennifer.Reif@neo4j.com
@JMHReif
4
5. Native Graph Platforms are Designed for Connected Data
TRADITIONAL
PLATFORMS
BIG DATA
TECHNOLOGY
Store and retrieve data Aggregate and filter data Connections in data
Real time storage & retrieval Real-Time Connected Insights
Long running queries
aggregation & filtering
“Our Neo4j solution is literally thousands of
times faster than the prior MySQL solution, with
queries that require 10-100 times less code”
Volker Pacher, Senior Developer
Max # of hops ~3
Millions
5
7. Graph in AI Research is Taking Off
7
4,000
3,000
2,000
1,000
0
2010 2011 2012 2013 2014 2015 2016 2017 2018
Mentions in Dimension
Knowledge System
graph neural network
graph convolutional
graph embedding
graph learning
graph attention
graph kernel
graph completion
Research Papers on Graph-Related AI
Dimension Knowledge System
8. Machine Learning Eats A Lot of Data
Machine Learning uses algorithms to
train software using specific examples
and progressive improvements
Algorithms iterate, continually adjusting
to get closer to an objective goal, such as
error reduction
This learning requires a lot of data to a model and enabling it to learn how
to process and incorporate that information
8
9. • Many data science models ignore network structure & complex relationships
• Graphs add highly predictive features to existing ML models
• Otherwise unattainable predictions based on relationships
More Accurate Predictions
with the Data You Already Have
Machine Learning Pipeline
9
13. Connection-related metrics about our graph,
such as the number of relationships going
into or out of nodes, a count of potential
triangles, or neighbors in common.
13
What Are Connected Features?
14. Query (e.g. Cypher)
Real-time, local decisioning
and pattern matching
Graph Algorithms Libraries
Global analysis
and iterations
You know what you’re looking
for and making a decision
You’re learning the overall structure of a
network, updating data, and predicting
Local
Patterns
Global
Computation
Deriving Connected Features
14
15. Graph Feature Engineering
Feature Engineering is how we combine and process the data to
create new, more meaningful features, such as clustering or
connectivity metrics.
Add More Descriptive Features:
- Influence
- Relationships
- Communities
Extraction
15
16. 16
Graph Feature Categories & Algorithms
Pathfinding
& Search
Finds the optimal paths or evaluates
route availability and quality
Centrality /
Importance
Determines the importance of
distinct nodes in the network
Community
Detection
Detects group clustering or
partition options
Heuristic
Link Prediction
Estimates the likelihood of nodes
forming a relationship
Evaluates how alike
nodes are
Similarity
Embeddings
Learned representations
of connectivity or topology
16
18. 18
Can we infer new interactions in the future?
What unobserved facts we’re missing?
19. + 50 years of biomedical data
integrated in a knowledge
graph
Predicting new uses for drugs
by using the graph structure to
create features for link
prediction
Example: het.io
19
21. 21
Using Graph Algorithms
Explore, Plan, Measure
Find significant patterns and plan
for optimal structures
Score outcomes and set a
threshold value for a prediction
Feature Engineering for
Machine Learning
The measures as features to train
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
23. • Citation Network Dataset - Research Dataset
– “ArnetMiner: Extraction and Mining of Academic Social Networks”,
by J. Tang et al
– Used a subset with 52K papers, 80K authors, 140K author
relationships and 29K citation relationships
• Neo4j
– Create a co-authorship graph and connected feature engineering
• Spark and MLlib
– Train and test our model using a random forest classifier
23
Predicting Collaboration
with a Graph Enhanced ML Model
24. Our Link Prediction Workflow
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
24
25. Our Link Prediction Workflow
Import Data
Create Co-Author
Graph
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
25
27. Our Link Prediction Workflow
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
Import Data
Create Co-Author
Graph
Identify sparse
feature areas
Feature
Engineering:
New graphy
features
27
28. Graph Algorithms Used for
Feature Engineering (few examples)
Preferential Attachment multiplies the number
of neighbors for pairs of nodes
Illustration be.amazd.com/link-prediction/28
Common Neighbors measures the number of
possible neighbors (triadic closure)
29. Graph Algorithms Used for
Feature Engineering (few examples)
Triangle counting and clustering coefficients
measure the density of connections around nodes
29
Louvain Modularity identifies interacting
communities and hierarchies
30. Our Link Prediction Workflow
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
Import Data
Create Co-Author
Graph
Identify sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
30
33. OMG I’m Good!
Data Leakage!
Graph metric computation for the train
set touches data from the test set.
Did you get really high accuracy on your
first run without tuning?
33
34. Train and Test Graphs: Time Based Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
Train
Test
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
< 2006
>= 2006
34
35. Train and Test Graphs: Time Based Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
Train
Test
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
35
39. Our Link Prediction Workflow
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
Import Data
Create Co-Author
Graph
Identify sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
Model Selection:
Random Forest
Ensemble method
39
41. Training Our Model
This is one decision tree in
our Random Forest used as a
binary classifier to learn how
to classify a pair: predicting
either linked or not linked.
41
42. 42
4 Layered Models Trained
Common Authors Model
“Graphy” Model
Triangles Model
Community Model
• Common Authors
Adds:
• Pref. Attachment
• Total Neighbors
Adds:
• Min & Max Triangles
• Min & Max Clustering Coefficient
Adds:
• Label Propagation
• Louvain Modularity
Multiple graph features used to train the models
43. Our Link Prediction Workflow
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
Import Data
Create Co-Author
Graph
Identify sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
Precision,
Accuracy, Recall
ROC Curve &
AUC
Model Selection:
Random Forest
Ensemble method
43
44. Measures
Accuracy Proportion of total correct predictions.
Beware of skewed data!
Precision Proportion of positive predictions that
are correct.
Low score = more false positives
Recall /
True Positive Rate
Proportion of actual positives that are
correct.
Low score = more false negatives
False Positive Rate Proportion of incorrect positives
ROC Curve & AUC X-Y Chart mapping above 2 metrics
(TPR and FPR) with area under curve
45. Result: First Model ROC & AUC
False Positives!
Common Authors
Model 1
45
FalseNegatives!
47. Iteration & Tuning: Feature Influence
For feature importance, the
Spark random forest averages
the reduction in impurity
across all trees in the forest
Feature rankings are in
comparison to the group of
features evaluated
Also try PageRank!
Try removing different features
(LabelPropagation)
47
48. Graph Machine Learning Workflow
Data aggregation
Create and store
graphs
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
Identify
uninteresting
features
Cleanse (outliers+)
Feature
engineering/
extraction
Train / Test split
Resample for
meaningful
representation
(proportional, etc.)
Precision,
accuracy, recall
(ROC curve & AUC)
SME Review
Cross-validation
Model & variable
selection
Hyperparameter
tuning
Ensemble methods
48
49. Resources
neo4j.com
• /sandbox
• /developer/graph-algorithms/
• /graphacademy/online-training/
Data & Code:
• This example from O’Reilly book
bit.ly/2FPgGVV (ML Folder)
Jennifer.Reif@neo4j.com @JMHReif
neo4j.com/
graph-algorithms-book
Amy.Hodler@neo4j.com @amyhodler
49