Leveraging Graph Analytics for Fraud Detection in PaySim Data

Graph Analytics for
Fraud Detection
Using PaySim and the Neo4j Graph Data Science Library
Dave Voutila <dave.voutila@neo4j.com>
Sales Engineer
1

● Sales Engineer with Neo4j! 👋
● Based in Vermont, USA 🌲🍁⛰
○ Work primarily with our Canadian clients
● You can ﬁnd me on…
○ ...the web: https://sisu.io
○ ...LinkedIn: https://www.linkedin.com/in/davevoutila/
○ ...GitHub: https://github.com/voutilad
○ In the hive of scum and villainy aka Twitter: @voutilad
Who am I?

● Generating realistic, synthetic ﬁnancial
transactions with PaySim
● Quick rundown of the Neo4j
Graph Data Science Library
● Live Demo of using graph algorithms to
analyze PaySim for fraudulent and risky
behavior
A Tale in 3 Acts

Meet PaySim 👋
● Simulates actors in a mobile
money network
○ Clients
○ Merchants
○ Banks
● Generate synthetic data that
is realistic in the aggregate
● Open source, customizable
● DETERMINISTIC!

● Parameterized simulation of Client transactions
● Some fraud simulation, speciﬁcally money mules
PaySim v1 & v2-snapshot

● 1st Party / Synthetic Fraud
○ Reuse of identiﬁers (ssn, email, phone)
○ Fabrication of identiﬁers
● 3rd Party Fraud
○ Attacks via Merchant vectors
○ Persistence and retargeting of victims
● You can more easily build your own fraudsters
● And a bunch more knobs and dials
PaySim v2.3 (my fork)

An Aside: 1st Party & Synthetic Fraud
See: https://sisu.io/posts/paysim-part3/

The Neo4j
Graph Data Science
Library

Graph Data Science is a
science-driven approach to gain
knowledge from the relationships
and structures in data, typically to
power predictions.
What is Graph data science?
Data scientists use
relationships to answer
questions.

Query (e.g. Cypher)
Real-time, local decisioning
and pattern matching
Graph Algorithms
Global analysis
and iterations
You know what you’re
looking for and making a
decision
You’re learning the overall structure
of a network, updating data, and
predicting
Local
Patterns
Global
Computation

• Degree Centrality
• Closeness Centrality
• CC Variations: Harmonic, Dangalchev,
Wasserman & Faust
• Betweenness Centrality & Approximate
• PageRank
• Personalized PageRank
• ArticleRank
• Eigenvector Centrality
• Triangle Count
• Clustering Coefficients
• Connected Components (Union Find)
• Strongly Connected Components
• Label Propagation
• Louvain Modularity
• Balanced Triad (identification)
Graph Algorithms & Functions in Neo4j
• Shortest Path
• Single-Source Shortest Path
• All Pairs Shortest Path
• A* Shortest Path
• Yen’s K Shortest Path
• Minimum Weight Spanning Tree
• K-Spanning Tree (MST)
• Random Walk
• Triangle Count
• Clustering Coefficients
• Connected Components (Union Find)
• Strongly Connected Components
• Label Propagation
• Louvain Modularity
• K-1 Coloring
• Modularity Optimization
• Euclidean Distance
• Cosine Similarity
• Node Similarity (Jaccard)
• Overlap Similarity
• Pearson Similarity
• Approximate KNN
Pathfinding
& Search
Centrality /
Importance
Community
Detection
Similarity
Link
Prediction
• Adamic Adar
• Common Neighbors
• Preferential Attachment
• Resource Allocations
• Same Community
• Total Neighbors
...and also Auxiliary Functions:
• Random graph generation
• One hot encoding
• Distributions & metrics

It’s easier than it sounds (promise)
The GDS doesn’t operate using the Neo4j kernel API

Graph Algorithms for Detecting Fraud
Graph algorithms enable reasoning
about network structure
Louvain to identify communities
that frequently interact
PageRank to measure inﬂuence
and transaction volumes
Connected components
identify disjointed group
sharing identiﬁers
Jaccard to measure account
similarity

● Each step has a probability of
committing fraud
● If they’re feeling malicious…
○ They have a probability of
re-victimizing someone
○ Or they’ll ﬁnd a new victim via a high
risk Merchant that they target
(peeking into their history)
● They perform test charges (payments)
● Subsequently may transfer balance
Meet our 3rd Party Fraudster

● The Goal
○ Find unreported Fraud Victims
○ Find at-risk individuals
● The Approach
○ Build a training set of clients
○ Engineer some sort of risk score for merchants (our alleged
fraud vector)
○ Use Client transaction history with Merchants to categorize
them as likely fraud victims
Our mission

● PaySim generated
○ 1.6M Transactions
○ ~10k Clients
○ 500 Merchants
● The graph
○ 1.6M nodes (98% transactions)
○ 5M relationships (we’ll be making more)
Our Playground

● Known Fraud Victims
○ Folks that reported fraudulent charges
○ In the case of PaySim, we are all seeing
● Known Non-Victims
○ What’s a term for non-victims anyway?!
○ These are accounts that have no fraud
Our Training Set

● We’ll primarily be relating Clients to Merchants
Bipartite Graphs
Clients
Merchants
Transactions

Exponential de-what-now?
24
Plot-exponential-decay.png: PeterQderivative work: Autopilot / CC BY-SA
(https://creativecommons.org/licenses/by-sa/2.5)

PageRank
What: Finds important nodes
based on their relationships
Why: Recommendations,
identifying inﬂuencers
Features:
- Tolerance
- Damping

Label Propagation
What: Finds communities
Why: Useful for
recommendations, fraud
detection, ﬁnding common
co-occurrences. Very fast.
Features:
- Seeding
- Directed relationships
- Weighted relationships

28
Going further
Some recommended next steps

● Those ~650 high-risk client accounts...
○ Can similarity routines reveal
anything?
○ What if we look at additional historical
transactions?
● That suspect merchant…
○ What can we glean from their activity?
● Operationalizing our ﬁndings...
○ How can we implement mutable
graph projections?
Possible next steps in our investigation

● Make your own Fraudsters
○ https://github.com/voutilad/paysim
○ https://www.sisu.io/posts/paysim
○ Requires Java JDK 8 or newer (tested with 11)
● Integrate PaySim with Neo4j
○ https://github.com/voutilad/paysim-demo
○ Works with both Neo4j 3.5 and 4.0
Your Turn: Getting & Using PaySim

seed=time
nbSteps=720
multiplier=1
nbClients=10000
nbFraudsters=500
nbMerchants=500
nbBanks=5
firstPartyFraudProbability=0.001
thirdPartyFraudProbability=0.05
thirdPartyNewVictimProbability=0.025
thirdPartyPercentHighRiskMerchants=0.005
The PaySim Parameters
SORRY...I used TIME as a seed!

Leveraging Graph Analytics for Fraud Detection in PaySim Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Leveraging Graph Analytics for Fraud Detection in PaySim Data

Similar to Leveraging Graph Analytics for Fraud Detection in PaySim Data (20)

More from Neo4j

More from Neo4j (20)

Recently uploaded

Recently uploaded (20)

Leveraging Graph Analytics for Fraud Detection in PaySim Data