Improve ml predictions using graph algorithms (webinar july 23_19).pptx

Improve ML Predictions using
Graph Algorithms
Jennifer Reif, Neo4j
Amy Hodler, Neo4j
July 2019
#Neo4j
#GraphAnalytics

Relationships:
Strongest Predictors of Behavior!
“Increasingly we're learning that you can make
better predictions about people by getting all the
information from their friends and their friends’
friends than you can from the information you
have about the person themselves”
James Fowler David Burkus
James Fowler
Albert-Laszlo
Barabasi

• Graphs for Predictions
• Connected Features
• Link Prediction
• Neo4j + Spark Workﬂow
Amy E. Hodler
Graph Analytics & AI
Program Manager, Neo4j
Amy.Hodler@neo4j.com
@amyhodler
Jennifer Reif
Labs Engineer, Neo4j
Jennifer.Reif@neo4j.com
@JMHReif
4

Native Graph Platforms are Designed for Connected Data
TRADITIONAL
PLATFORMS
BIG DATA
TECHNOLOGY
Store and retrieve data Aggregate and ﬁlter data Connections in data
Real time storage & retrieval Real-Time Connected Insights
Long running queries
aggregation & ﬁltering
“Our Neo4j solution is literally thousands of
times faster than the prior MySQL solution, with
queries that require 10-100 times less code”
Volker Pacher, Senior Developer
Max # of hops ~3
Millions
5

Graph Databases Surging in Popularity
Trends since 2013
DB-Engines.com
6

Graph in AI Research is Taking Oﬀ
7
4,000
3,000
2,000
1,000
0
2010 2011 2012 2013 2014 2015 2016 2017 2018
Mentions in Dimension
Knowledge System
graph neural network
graph convolutional
graph embedding
graph learning
graph attention
graph kernel
graph completion
Research Papers on Graph-Related AI
Dimension Knowledge System

Machine Learning Eats A Lot of Data
Machine Learning uses algorithms to
train software using speciﬁc examples
and progressive improvements
Algorithms iterate, continually adjusting
to get closer to an objective goal, such as
error reduction
This learning requires a lot of data to a model and enabling it to learn how
to process and incorporate that information
8

• Many data science models ignore network structure & complex relationships
• Graphs add highly predictive features to existing ML models
• Otherwise unattainable predictions based on relationships
More Accurate Predictions
with the Data You Already Have
Machine Learning Pipeline
9

Graph Data Science Applications EXAMPLES
Financial
Crimes Recommendations
Cybersecurity
Predictive
Maintenance
Customer
Segmentation
Churn
Prediction
Search
& MDM
Drug
Discovery
10

Graph Data Science Gives Us
Better
Decisions
Knowledge
Graphs
Higher
Accuracy
Connected Feature
Engineering
More Trust
and Applicability
Graph Native
Learning
11

Connection-related metrics about our graph,
such as the number of relationships going
into or out of nodes, a count of potential
triangles, or neighbors in common.
13
What Are Connected Features?

Query (e.g. Cypher)
Real-time, local decisioning
and pattern matching
Graph Algorithms Libraries
Global analysis
and iterations
You know what you’re looking
for and making a decision
You’re learning the overall structure of a
network, updating data, and predicting
Local
Patterns
Global
Computation
Deriving Connected Features
14

Graph Feature Engineering
Feature Engineering is how we combine and process the data to
create new, more meaningful features, such as clustering or
connectivity metrics.
Add More Descriptive Features:
- Inﬂuence
- Relationships
- Communities
Extraction
15

16
Graph Feature Categories & Algorithms
Pathﬁnding
& Search
Finds the optimal paths or evaluates
route availability and quality
Centrality /
Importance
Determines the importance of
distinct nodes in the network
Community
Detection
Detects group clustering or
partition options
Heuristic
Link Prediction
Estimates the likelihood of nodes
forming a relationship
Evaluates how alike
nodes are
Similarity
Embeddings
Learned representations
of connectivity or topology
16

18
Can we infer new interactions in the future?
What unobserved facts we’re missing?

+ 50 years of biomedical data
integrated in a knowledge
graph
Predicting new uses for drugs
by using the graph structure to
create features for link
prediction
Example: het.io
19

21
Using Graph Algorithms
Explore, Plan, Measure
Find signiﬁcant patterns and plan
for optimal structures
Score outcomes and set a
threshold value for a prediction
Feature Engineering for
Machine Learning
The measures as features to train
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0

Example:
Predicting Collaboration

• Citation Network Dataset - Research Dataset
– “ArnetMiner: Extraction and Mining of Academic Social Networks”,
by J. Tang et al
– Used a subset with 52K papers, 80K authors, 140K author
relationships and 29K citation relationships
• Neo4j
– Create a co-authorship graph and connected feature engineering
• Spark and MLlib
– Train and test our model using a random forest classiﬁer
23
Predicting Collaboration
with a Graph Enhanced ML Model

Our Link Prediction Workﬂow
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
24

Import Data
Create Co-Author
Graph
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
25

Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
Import Data
Create Co-Author
Graph
Identify sparse
feature areas
Feature
Engineering:
New graphy
features
27

Graph Algorithms Used for
Feature Engineering (few examples)
Preferential Attachment multiplies the number
of neighbors for pairs of nodes
Illustration be.amazd.com/link-prediction/28
Common Neighbors measures the number of
possible neighbors (triadic closure)

Graph Algorithms Used for
Feature Engineering (few examples)
Triangle counting and clustering coeﬃcients
measure the density of connections around nodes
29
Louvain Modularity identiﬁes interacting
communities and hierarchies

Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
Import Data
Create Co-Author
Graph
Identify sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
30

31
Test/Train Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
8 11 2 3 0

32
Test/Train Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
8 11 2 3 0
Train
Test

OMG I’m Good!
Data Leakage!
Graph metric computation for the train
set touches data from the test set.
Did you get really high accuracy on your
ﬁrst run without tuning?
33

Train and Test Graphs: Time Based Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
Train
Test
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
< 2006
>= 2006
34

Train and Test Graphs: Time Based Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
Train
Test
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
35

Class Imbalance
Negative
Examples
Positive
Examples
36

37
Class Imbalance
A very high accuracy model
could predict that a pair
of nodes are not linked.

Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
Import Data
Create Co-Author
Graph
Identify sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
Model Selection:
Random Forest
Ensemble method
39

Training Our Model
This is one decision tree in
our Random Forest used as a
binary classiﬁer to learn how
to classify a pair: predicting
either linked or not linked.
41

42
4 Layered Models Trained
Common Authors Model
“Graphy” Model
Triangles Model
Community Model
• Common Authors
Adds:
• Pref. Attachment
• Total Neighbors
Adds:
• Min & Max Triangles
• Min & Max Clustering Coeﬃcient
Adds:
• Label Propagation
• Louvain Modularity
Multiple graph features used to train the models

Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
Import Data
Create Co-Author
Graph
Identify sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
Precision,
Accuracy, Recall
ROC Curve &
AUC
Model Selection:
Random Forest
Ensemble method
43

Measures
Accuracy Proportion of total correct predictions.
Beware of skewed data!
Precision Proportion of positive predictions that
are correct.
Low score = more false positives
Recall /
True Positive Rate
Proportion of actual positives that are
correct.
Low score = more false negatives
False Positive Rate Proportion of incorrect positives
ROC Curve & AUC X-Y Chart mapping above 2 metrics
(TPR and FPR) with area under curve

Result: First Model ROC & AUC
False Positives!
Common Authors
Model 1
45
FalseNegatives!

Result: All Models Common Authors
Model 1
Community
Model 4
46

Iteration & Tuning: Feature Inﬂuence
For feature importance, the
Spark random forest averages
the reduction in impurity
across all trees in the forest
Feature rankings are in
comparison to the group of
features evaluated
Also try PageRank!
Try removing diﬀerent features
(LabelPropagation)
47

Graph Machine Learning Workﬂow
Data aggregation
Create and store
graphs
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
Identify
uninteresting
features
Cleanse (outliers+)
Feature
engineering/
extraction
Train / Test split
Resample for
meaningful
representation
(proportional, etc.)
Precision,
accuracy, recall
(ROC curve & AUC)
SME Review
Cross-validation
Model & variable
selection
Hyperparameter
tuning
Ensemble methods
48

Resources
neo4j.com
• /sandbox
• /developer/graph-algorithms/
• /graphacademy/online-training/
Data & Code:
• This example from O’Reilly book
bit.ly/2FPgGVV (ML Folder)
Jennifer.Reif@neo4j.com @JMHReif
neo4j.com/
graph-algorithms-book
Amy.Hodler@neo4j.com @amyhodler
49

Improve ml predictions using graph algorithms (webinar july 23_19).pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Improve ml predictions using graph algorithms (webinar july 23_19).pptx

Similar to Improve ml predictions using graph algorithms (webinar july 23_19).pptx (20)

More from Neo4j

More from Neo4j (20)

Recently uploaded

Recently uploaded (20)

Improve ml predictions using graph algorithms (webinar july 23_19).pptx