SlideShare a Scribd company logo
1 of 72
Optimized Graph
Algorithms in Neo4j
Use the Power of Connections to Drive Discovery
January 2018
Mark Needham
Amy Hodler
Mark Needham
Software Engineer, Neo4j
mark.needham@neo4j.com
@markhneedham
Next 50 Minutes
• Why Use Graph Analytics
• Randomness vs. Reality
• Graph Analytics Takes Off
• How to Run Graph Analytics
• Neo4j Graph Analytics and Algorithms
• Demos and Implementation
Graph
Algorithms
Real-World
Networks
Amy E. Hodler
Analytics Marketing, Neo4j
amy.hodler@neo4j.com
@amyhodler
Understand. Predict. Prescribe.
Forecast Complex Network Behavior
and Prescribe Action
Cascading Failures
Airline Congestion - 2010
Source: “Systemic delay propagation in the US airport network” – Fleurquin, Ramasco, Eguiluz
Planning and Least
Cost Routing
Bridge Points
Languages – Telecom Network
Source: “Fast unfolding of communities in large networks” – Blondel, Guillaume, Lambiotte, Lefebvre
Extract Structure and Model Processes
Real Networks Aren’t Random
Preferential
Attachment
Nodes tend to link to nodes
that already have a lot of links
Origins Debated
• Local Mechanisms
• Global Optimization
• Mixed or Other
Network Structures are Inseparable from Development
Concentrated
Distribution
Source: “How Stuff Spreads” – Pulsar Platform
NodeswithkLinks
Number of links (k)
Many nodes with only
a few links
A few hubs with a
large number of links
Power Law Distribution
“There is No Network in Nature that we
know of that would be described by the
Random network model.”
- Albert-László Barabási
Small-World
High local clustering
and short average path
lengths. Hub and spoke
architecture.
Scale-Free
Hub and spoke
architecture preserved
at multiple scales. High
power law distribution.
Random
Average distributions.
No structure or
hierarchical patterns.
Reality
The Lure of Averages
Source: Network Science - Barabasi
Art: Ulysses and the Sirens – Herbert James Draper
NodeswithkLinks
Number of Links (k)
Average Distribution
- Random -
Most nodes have the
same number of links
No highly
connected nodes
Resist The Lure
of AveragesNodeswithkLinks
Number of Links (k)
Average Distribution
- Random -
Most nodes have the
same number of links
No highly
connected nodes
NodeswithkLinks
Number of links (k)
Power Law Distribution
- Scale-Free -
Many nodes with only
a few links
A few hubs with a
large number of links
Source: Network Science - Barabasi
Resist The Lure
of AveragesNodeswithkLinks
Number of Links (k)
Average Distribution
- Random -
Art: Ulysses and the Sirens – Herbert James Draper
Most nodes have the
same number of links
No highly
connected nodes
You’ll Miss the Structure
Hidden in Your Networks
- Scale-Free -
- Small World -
Source: Network Science - Barabasi
Graph Analytics
Takes Off
#Finally!
Leonhard Euler 1707-1783
Critical Mass
• Collect, share and analyze
massive connected data
• Discovered common
principles and structures
• Existing mathematical tools
• Unfulfilled promises of
big data
Insights from Algorithms
Insights from Algorithms
Graph Algorithms
• Metrics
• Relevance
• Clustering
• Structural Insights
Machine Learning
• Classification, Regression
• NLP, Structural/Content
Predictions
• Neural Networks as Graphs
• Graph As Compute Fabric
Structures Can Hide
Source: “Communities, modules and large-scale structure in networks“ - Mark Newman
Source: “Hierarchical structure and the prediction of missing links in networks”; ”Structure and
inference in annotated networks” - A. Clauset, C. Moore, and M.E.J. Newman.
Graph of Thrones
A. Beveridge: GoT - Interaction Graph from Books
Graph of Thrones
A. Beveridge: GoT - Interaction Graph from Books
How to Run
Graph Analytics?
Existing Options (so far)
•Data Processing
•Spark with GraphX, Flink with Gelly
•Dedicated Graph Processing
• Urika, GraphLab, Giraph, Mosaic, GPS, Signal-Collect,
Gradoop
•Data Scientist Toolkit
• igraph, NetworkX, Boost(graph-tool) in Python, R, C
Drawbacks
• Manage several tools
• Selection -> learning ->
installation -> operation
• Data selection, projection and
transfer
• Tedious and time consuming
• Scalability
• Especially classic data
science tools
An Example
From Past GraphConnect
Source: John Swain - Twitter Analytics Right Relevance Talk
Many Moving Parts!
Example Workflow Pipeline
Twitter
Streaming API
Python Tweet
Collection
(includes user
data)
Rabbit
MQ
MongoDB
Neo4j
R Scripts
-Graph Stats
-Community
Detection
MySQL
Graph
.graphml
Tableau
Graph
Visualization
Moved from Twitter
Search API to
Streaming API
Replaced Python
Twitter libraries
(Tweepy) with raw API
calls
Streaming tweets in message queue
Full tweets and user data stored in
MongoDB
Built graph for analysis in Neo4j from
tweets persisted in MongoDB
Analysis in R
iGraph libraries for
algorithms
Some text analysis e.g.
LDA topics
Results published in MySQL
for Tableau
Graphml for import to Gephi
with stats precalculated
Our Goal
Twitter
Streaming API
Python Tweet
Collection
(includes user
data)
Rabbit
MQ
MongoDB
Neo4j
R Scripts
-Graph Stats
-Community
Detection
MySQL
Graph
.graphml
Tableau
Graph
Visualization
Example Workflow Pipeline
Neo4j Graph Analytics
and Algorithms
Neo4j
Native Graph
Database
Analytics
Integrations
Cypher Query
Language
Wide Range of
APOC Procedures
Optimized
Graph Algorithms
Finds the optimal path
or evaluates route
availability and quality
Evaluates how a
group is clustered
or partitioned
Determines the
importance of distinct
nodes in the network
1. Call as Cypher procedure
2. Pass in specification (Label, Prop, Query) and configuration
3. ~.stream variant returns (a lot) of results
CALL algo.<name>.stream('Label','TYPE',{conf})
YIELD nodeId, score
4. non-stream variant writes results to graph returns statistics
CALL algo.<name>('Label','TYPE',{conf})
Usage
Pass in Cypher statement for node- and relationship-lists.
CALL algo.<name>(
'MATCH ... RETURN id(n)',
'MATCH (n)-->(m)
RETURN id(n) as source,
id(m) as target', {graph:'cypher'})
Cypher Projection
• PageRank (baseline)
• Betweeness
• Closeness
• Degree
Algorithms - Centralities
Pathfinding
Centrality
Community
Detection
• Label Propagation
• Union Find / WCC
• Strongly Connected Components
• Louvain
• Triangle-Count / Clustering Coefficent
Algorithms – Communitity Detection
Pathfinding
Community
Detection
Centrality
• Single Source Short Path
• All-Nodes SSP
• Parallel BFS / DFS
Algorithms - Pathfinding
Centrality Community
Detection
Pathfinding
Iterate Quickly
• Combine data from sources into one graph
• Project to relevant subgraphs
• Enrich data with algorithms
• Traverse, collect, filter aggregate with queries
• Visualize, Explore, Decide, Export
• From all APIs and Tools
Demo Time!
Datasets
Yelp Business Graph
• 5m nodes
• 17m relationships
Bitcoin
• 1.7bn nodes,
• 2.7bn rels
DBPedia
• 11m nodes
• 116m relationships
DBpedia
DBPedia
Shallow Copy of Wikipedia: (Page) -[:Link]-> (Page)
CALL algo.pageRank.stream('Page', 'Link', {iterations:5}) YIELD node, score
WITH *
ORDER BY score DESC
LIMIT 5
RETURN node.title, score;
+--------------------------------------+
| node.title | score |
+--------------------------------------+
| "United States" | 13349.2 |
| "Animal" | 6077.77 |
| "France" | 5025.61 |
| "List of sovereign states" | 4913.92 |
| "Germany" | 4662.32 |
+--------------------------------------+
5 rows 46 seconds
DBPedia – Largest Clusters
CALL algo.labelPropagation();
// First 1M pages by Rank
MATCH (n:Page)
WITH n
ORDER BY n.pagerank DESC
LIMIT 1000000
// group by partition
WITH n.partition AS partition,
count(*) AS clusterSize,
collect(n.title) AS pages
// return most influential node for largest clusters
RETURN pages[0] AS mainPage,
pages[1..10] AS otherPages
ORDER BY clusterSize DESC
LIMIT 20
Yelp
Yelp
• Business Reviews by Users
•Businesses have Categories and Locations
•Users have Friends
•Bi-partite-Network (:User)-->(:Business)
projections (:User)<-->(:User) &
(:Business)<-->(:Business)
Yelp – Social - Statistics
MATCH (u:User) where exists ( (u)-[:FRIENDS]-() )
WITH u.average_stars as stars, u.review_count as reviews, u.funny as funny
RETURN max(stars),avg(stars),stdev(stars),max(reviews),avg(reviews),stdev(reviews),max(funny),avg(funny),stdev(funny);
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| max(stars) | avg(stars) | stdev(stars) | max(reviews) | avg(reviews) | stdev(reviews) | max(funny) | avg(funny) | stdev(funny) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 5.0 | 3.8238072950764947 | 0.8862511758625753 | 11284 | 45.81704314022204 | 120.52419266925014 | 170896 | 36.26637835535585 | 731.6024752545679 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
MATCH (u:User) where exists ( (u)-[:FRIENDS]-() )
WITH u.yelping_since as since
RETURN substring(since,0,4) as year, count(*) as total
ORDER BY year asc limit 10;
+----------------+
| year | total |
+----------------+
| "2004" | 64 |
| "2005" | 844 |
| "2006" | 4504 |
| "2007" | 11833 |
| "2008" | 20729 |
| "2009" | 33965 |
| "2010" | 53046 |
| "2011" | 70331 |
| "2012" | 62596 |
| "2013" | 57330 |
+----------------+
Yelp – Social - PageRank
call algo.pageRank.stream('User','FRIENDS')
yield node,score with node,score
order by score desc limit 10
return node {.name, .review_count, .average_stars,.useful,.yelping_since,.funny},
score,
size( (node)<-[:FRIENDS]-()<-[:FRIENDS]-()) as in,
size( (node)-[:FRIENDS]->()-[:FRIENDS]->()) as out;
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| node | score |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| {funny -> 61200, name -> "Philip", average_stars -> 3.93, review_count -> 788, useful -> 69448, yelping_since -> "2007-06-09"} | 208.31336799999994 |
| {funny -> 21432, name -> "Des", average_stars -> 3.88, review_count -> 78, useful -> 140024, yelping_since -> "2014-04-01"} | 201.28600150000003 |
| {funny -> 465, name -> "Dallas", average_stars -> 4.17, review_count -> 330, useful -> 5517, yelping_since -> "2010-11-07"} | 192.164762 |
| {funny -> 1019, name -> "Cara", average_stars -> 3.96, review_count -> 842, useful -> 11738, yelping_since -> "2010-07-21"} | 184.01898249999996 |
| {funny -> 1233, name -> "Walker", average_stars -> 3.91, review_count -> 462, useful -> 12332, yelping_since -> "2007-01-25"} | 180.48898350000005 |
| {funny -> 13432, name -> "Gabi", average_stars -> 4.05, review_count -> 1730, useful -> 20759, yelping_since -> "2007-08-10"} | 163.29424850000004 |
| {funny -> 12848, name -> "Ruggy", average_stars -> 3.92, review_count -> 2118, useful -> 72265, yelping_since -> "2007-07-31"} | 161.87635500000002 |
| {funny -> 9997, name -> "Bill", average_stars -> 3.38, review_count -> 595, useful -> 12074, yelping_since -> "2014-04-05"} | 157.0438075 |
| {funny -> 1544, name -> "Ashley", average_stars -> 3.7, review_count -> 224, useful -> 1610, yelping_since -> "2009-09-29"} | 150.21423599999997 |
| {funny -> 3599, name -> "Risa", average_stars -> 4.08, review_count -> 1044, useful -> 22121, yelping_since -> "2011-07-30"} | 138.20863199999997 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
10 rows
3236 ms
Yelp
•Inferred network of users, via jointly reviewed businesses
• (u1:User)-[:WROTE]->(review1)-[:REVIEWS]->(business)<-[:REVIEWS]-(review2)<-[:WROTE]-(u2:User)
• 1,3bn paths
• Inferred network of businesses, via jointly reviewed by user
• (b1:Business)<-[:REVIEWS]-()<-[:WROTE]-(u)-[:WROTE]->()-[:REVIEWS]->(b2:Business)
• 214m paths
• subset: (b1:Business)-[:CO_OCCURENT_REVIEWS]-(b2:Business)
Yelp
•Inferred network of users, via jointly reviewed businesses
• (u1:User)-[:WROTE]->(review1)-[:REVIEWS]->(business)<-[:REVIEWS]-(review2)<-[:WROTE]-(u2:User)
• 1.3bn paths
• Inferred network of businesses, via jointly reviewed by user
• (b1:Business)<-[:REVIEWS]-()<-[:WROTE]-(u)-[:WROTE]->()-[:REVIEWS]->(b2:Business)
• 214m paths
Yelp – Business – Co-Occurrence
•Find clusters of "similar" businesses
•Find peer groups of similar people
•Clusters of "interests"
Yelp – Business – Co-Occurrence
CALL apoc.periodic.iterate(
'MATCH (b:Business)
WHERE size((b)<-[:REVIEWS]-()) > 5 AND b.city="Las Vegas"
RETURN b',
'MATCH (b)<-[:REVIEWS]-(r1)<-[:WROTE]-(u)-[:WROTE]->(r2)-[:REVIEWS]->(b2)
WHERE id(b) < id(b2) AND b2.city="Las Vegas"
AND size((b2)<-[:REVIEWS]-()) > 5
AND r1.stars = r2.stars
WITH b, b2, count(*) AS weight, avg(r1.stars) as rating where weight > 5
MERGE (b)-[cr:B2B]-(b2)
ON CREATE SET cr.weight = weight, cr.rating = rating
SET b:Marked, b2:Marked',
{batchSize: 1});
Yelp - Clustering Union Find
CALL algo.unionFind.stream(
'MATCH (b:Business:Marked) RETURN id(b) as id’,
'MATCH (b1:Business:Marked)-[r:B2B]-(b2)
RETURN id(b1) as source,
id(b2) as target,
count(r) as value',
{graph:'cypher'}) YIELD setId as cluster, nodeId
RETURN cluster, count(*) as size
ORDER BY size DESC LIMIT 10;
+--------------+
|cluster| size |
+--------------+
| 3 | 5625 |
| 1876 | 3 |
| 155 | 2 |
| 1091 | 2 |
| 1728 | 2 |
| 1177 | 2 |
| 337 | 2 |
| 3046 | 2 |
| 674 | 2 |
| 1948 | 2 |
+--------------+
10 rows
6615 ms
Yelp - PageRank
CALL algo.pageRank.stream(
'MATCH (b:Business:Marked)
RETURN id(b) as id',
'MATCH (b1:Business:Marked)-[r:B2B]-(b2)
RETURN id(b1) as source,
id(b2) as target',
{graph:'cypher'})
YIELD node, score
RETURN node.name, score
ORDER BY score DESC
LIMIT 10;
+-------------------------------------------------------+
| node.name | score |
+-------------------------------------------------------+
| "McCarran International Airport" | 27.49973599999999 |
| "Hash House A Go Go" | 19.062398000000005 |
| "Bachi Burger" | 18.1494385 |
| "Mon Ami Gabi" | 17.720350000000003 |
| "Bacchanal Buffet" | 15.783480500000003 |
| "Yard House Town Square" | 14.427296999999998 |
| "Secret Pizza" | 13.156547 |
| "Rollin Smoke Barbeque" | 12.808718499999998 |
| "Wicked Spoon" | 12.639942499999997 |
| "Monta Ramen" | 12.3904845 |
+-------------------------------------------------------+
10 rows
6979 ms
BitCoin
BitCoin Graph
• Full Copy of the BitCoin BlockChain
• from learnmeabitcoin.com (Greg Walker)
• 1.7 billion nodes, 2.7 billion rels
• 474k blocks, 240m tx, 280m addresses, 650m outputs
• 600 GB on disk
BitCoin Graph
BitCoin Graph
Distribution of "locked" relationships for "addresses"
(participation in transactions)
call apoc.stats.degrees('<locked');
+--------------------------------------------------------------------------------------------------------------+
| type | direction | total | p50 | p75 | p90 | p95 | p99 | p999 | max | min | mean |
+--------------------------------------------------------------------------------------------------------------+
| "locked" | "INCOMING" | 654662356 | 0 | 0 | 1 | 1 | 2 | 28 | 1891327 | 0 | 0.37588608290716047 |
+--------------------------------------------------------------------------------------------------------------+
1 row
308 seconds
BitCoin Graph
Inferred network of addresses, via transaction and output
(a1)<-[:locked]-(o1)-[:in]->(tx)-[:out]->(o2)-[:locked]->(a2)
CALL algo.unionFind.stream(
'match (o:output)-[:locked]->(a) with a limit 10000000 return id(a) as id',
'match (o:output)-[:locked]->(a) with o limit 10000000
match (o)-[:in]->(tx)-[:out]->(o2)-[:locked]->(a2)
return id(a) as source, id(a2) as target, count(tx) as weight',
{graph:'cypher'})
YIELD setId as cluster, nodeId
RETURN cluster, count(*) AS size
ORDER BY size DESC
LIMIT 10;
+-------------------+
| cluster | size |
+-------------------+
| 5036 | 4409420 |
| 6295282 | 1999 |
| 5839746 | 1488 |
| 9356302 | 833 |
| 6560901 | 733 |
| 6370777 | 637 |
| 8101710 | 392 |
| 5945867 | 369 |
| 2489036 | 264 |
| 1703620 | 203 |
+-------------------+
10 rows, 296 seconds
Implementation
Design Considerations
• Ease of Use – Call as Procedures
• Parallelize everything: load, compute, write
• Efficiency: Use direct access, efficient datastructures, provide
high-level API
• Scale to billions of nodes and relationships
Use up to hundreds of CPUs and Terabytes of RAM
1. Load Data in parallel
from Neo4j
2. Store in efficient data
structures
3. Run Graph Algorithm
in parallel using
Graph API
4. Write data back in
parallel
Neo4j
1, 2
Algorithm
Datastructures
4
3
Graph API
Architecture
Scale: 144 CPU
Neo4j Graph Platform with Neo4j Algorithms
vs. Apache Spark’s GraphX
0
50
100
150
200
250
300
350
400
450
Union-Find (Connected Components) PageRank
251
Seconds
152
416
124
Neo4j is
Significantly
Faster
Spark GraphX results publicly available
• Amazon EC2 cluster running 64-bit Linux
• 128 CPUs with 68 GB of memory, 2 hard disks
Neo4j Configuration
• Physical machine running 64-bit Linux
• 128 CPUs with 55 GB RAM, SSDs
Twitter 2010 Dataset
• 1.47 Billion Relationships
• 41.65 Million Nodes
GraphX
Neo4j
Neo4j
GraphX
Compute At Scale – Payment Graph
3,000,000,000 nodes and 18,000,000,000 relationships (600G)
PageRank (20 iterations) on 1 machine, 20 threads, 900G RAM
call algo.pageRank('Account','SENT',
{graph:'huge',iterations:20,write:true,concurrency:20});
+-------------------------------------------------------------------+
| nodes | iterations | loadMillis | computeMillis | writeMillis |
+-------------------------------------------------------------------+
| 300000000 | 20 | 401404 | 6024994 | 47106 |
+-------------------------------------------------------------------+
1 row 6473526 ms -> 1h 47min
We Need Your Feedback
• neo4j.com/slack at #neo4j-graph-algorithms
• github.com/neo4j-contrib/neo4j-graph-algorithms
• Whitepaper on neo4j.com/graph-analytics
Graphs are one of
the Unifying Themes of computer science . . .
That so many different structures
can be modeled using a single formalism
is a Source of Great Power
to the educated programmer.”
- Steven S. Skiena,
The Algorithm Design Manual
“
Kudos:
Paul Horn
Martin Knobloch from Avantgarde Labs
Tomasz Bratanic (docs)
Thank You!
Questions !?

More Related Content

What's hot

Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Databricks
 
Intro to Graphs and Neo4j
Intro to Graphs and Neo4jIntro to Graphs and Neo4j
Intro to Graphs and Neo4j
jexp
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 

What's hot (20)

Neo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4jNeo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4j
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
Introducing Neo4j 3.0
Introducing Neo4j 3.0Introducing Neo4j 3.0
Introducing Neo4j 3.0
 
How Graph Databases efficiently store, manage and query connected data at s...
How Graph Databases efficiently  store, manage and query  connected data at s...How Graph Databases efficiently  store, manage and query  connected data at s...
How Graph Databases efficiently store, manage and query connected data at s...
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Graph Algorithms for Developers
Graph Algorithms for DevelopersGraph Algorithms for Developers
Graph Algorithms for Developers
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Be...
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Be...Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Be...
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Be...
 
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
 
Building Fullstack Graph Applications With Neo4j
Building Fullstack Graph Applications With Neo4j Building Fullstack Graph Applications With Neo4j
Building Fullstack Graph Applications With Neo4j
 
Intro to Graphs and Neo4j
Intro to Graphs and Neo4jIntro to Graphs and Neo4j
Intro to Graphs and Neo4j
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph database & neo4j
Graph database & neo4jGraph database & neo4j
Graph database & neo4j
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphX
 
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Einführung in Neo4j
Einführung in Neo4jEinführung in Neo4j
Einführung in Neo4j
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
 

Similar to Graph Analytics: Graph Algorithms Inside Neo4j

Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
ivascucristian
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
Tu Pham
 
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczFOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
Ioan Toma
 
Predicting Influence and Communities Using Graph Algorithms
Predicting Influence and Communities Using Graph AlgorithmsPredicting Influence and Communities Using Graph Algorithms
Predicting Influence and Communities Using Graph Algorithms
Databricks
 

Similar to Graph Analytics: Graph Algorithms Inside Neo4j (20)

Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
New Features in Neo4j 3.4 / 3.3 - Graph Algorithms, Spatial, Date-Time & Visu...
New Features in Neo4j 3.4 / 3.3 - Graph Algorithms, Spatial, Date-Time & Visu...New Features in Neo4j 3.4 / 3.3 - Graph Algorithms, Spatial, Date-Time & Visu...
New Features in Neo4j 3.4 / 3.3 - Graph Algorithms, Spatial, Date-Time & Visu...
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Keynote at AImWD
Keynote at AImWDKeynote at AImWD
Keynote at AImWD
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Optimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4jOptimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4j
 
Modeling the Impact of R & Python Packages: Dependency and Contributor Networks
Modeling the Impact of R & Python Packages: Dependency and Contributor NetworksModeling the Impact of R & Python Packages: Dependency and Contributor Networks
Modeling the Impact of R & Python Packages: Dependency and Contributor Networks
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine Learning
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy Cross
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczFOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
 
Predicting Influence and Communities Using Graph Algorithms
Predicting Influence and Communities Using Graph AlgorithmsPredicting Influence and Communities Using Graph Algorithms
Predicting Influence and Communities Using Graph Algorithms
 

More from Neo4j

More from Neo4j (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansQIAGEN: Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...
 
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafos
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafosBBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafos
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafos
 
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
 
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfRabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
 
Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG time
 
Neo4j: Data Engineering for RAG (retrieval augmented generation)
Neo4j: Data Engineering for RAG (retrieval augmented generation)Neo4j: Data Engineering for RAG (retrieval augmented generation)
Neo4j: Data Engineering for RAG (retrieval augmented generation)
 
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
 
Enabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsEnabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge Graphs
 
Neo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdf
Neo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdfNeo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdf
Neo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdf
 
Neo4j Jesus Barrasa The Art of the Possible with Graph
Neo4j Jesus Barrasa The Art of the Possible with GraphNeo4j Jesus Barrasa The Art of the Possible with Graph
Neo4j Jesus Barrasa The Art of the Possible with Graph
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Graph Analytics: Graph Algorithms Inside Neo4j

  • 1. Optimized Graph Algorithms in Neo4j Use the Power of Connections to Drive Discovery January 2018 Mark Needham Amy Hodler
  • 2. Mark Needham Software Engineer, Neo4j mark.needham@neo4j.com @markhneedham Next 50 Minutes • Why Use Graph Analytics • Randomness vs. Reality • Graph Analytics Takes Off • How to Run Graph Analytics • Neo4j Graph Analytics and Algorithms • Demos and Implementation Graph Algorithms Real-World Networks Amy E. Hodler Analytics Marketing, Neo4j amy.hodler@neo4j.com @amyhodler
  • 4. Forecast Complex Network Behavior and Prescribe Action
  • 5. Cascading Failures Airline Congestion - 2010 Source: “Systemic delay propagation in the US airport network” – Fleurquin, Ramasco, Eguiluz
  • 7. Bridge Points Languages – Telecom Network Source: “Fast unfolding of communities in large networks” – Blondel, Guillaume, Lambiotte, Lefebvre
  • 8. Extract Structure and Model Processes
  • 10. Preferential Attachment Nodes tend to link to nodes that already have a lot of links Origins Debated • Local Mechanisms • Global Optimization • Mixed or Other Network Structures are Inseparable from Development
  • 11. Concentrated Distribution Source: “How Stuff Spreads” – Pulsar Platform NodeswithkLinks Number of links (k) Many nodes with only a few links A few hubs with a large number of links Power Law Distribution
  • 12. “There is No Network in Nature that we know of that would be described by the Random network model.” - Albert-László Barabási
  • 13. Small-World High local clustering and short average path lengths. Hub and spoke architecture. Scale-Free Hub and spoke architecture preserved at multiple scales. High power law distribution. Random Average distributions. No structure or hierarchical patterns.
  • 15. The Lure of Averages Source: Network Science - Barabasi Art: Ulysses and the Sirens – Herbert James Draper NodeswithkLinks Number of Links (k) Average Distribution - Random - Most nodes have the same number of links No highly connected nodes
  • 16. Resist The Lure of AveragesNodeswithkLinks Number of Links (k) Average Distribution - Random - Most nodes have the same number of links No highly connected nodes NodeswithkLinks Number of links (k) Power Law Distribution - Scale-Free - Many nodes with only a few links A few hubs with a large number of links Source: Network Science - Barabasi
  • 17. Resist The Lure of AveragesNodeswithkLinks Number of Links (k) Average Distribution - Random - Art: Ulysses and the Sirens – Herbert James Draper Most nodes have the same number of links No highly connected nodes You’ll Miss the Structure Hidden in Your Networks - Scale-Free - - Small World -
  • 21. Critical Mass • Collect, share and analyze massive connected data • Discovered common principles and structures • Existing mathematical tools • Unfulfilled promises of big data
  • 23. Insights from Algorithms Graph Algorithms • Metrics • Relevance • Clustering • Structural Insights Machine Learning • Classification, Regression • NLP, Structural/Content Predictions • Neural Networks as Graphs • Graph As Compute Fabric
  • 24. Structures Can Hide Source: “Communities, modules and large-scale structure in networks“ - Mark Newman Source: “Hierarchical structure and the prediction of missing links in networks”; ”Structure and inference in annotated networks” - A. Clauset, C. Moore, and M.E.J. Newman.
  • 25. Graph of Thrones A. Beveridge: GoT - Interaction Graph from Books
  • 26. Graph of Thrones A. Beveridge: GoT - Interaction Graph from Books
  • 27. How to Run Graph Analytics?
  • 28. Existing Options (so far) •Data Processing •Spark with GraphX, Flink with Gelly •Dedicated Graph Processing • Urika, GraphLab, Giraph, Mosaic, GPS, Signal-Collect, Gradoop •Data Scientist Toolkit • igraph, NetworkX, Boost(graph-tool) in Python, R, C
  • 29. Drawbacks • Manage several tools • Selection -> learning -> installation -> operation • Data selection, projection and transfer • Tedious and time consuming • Scalability • Especially classic data science tools
  • 30. An Example From Past GraphConnect
  • 31. Source: John Swain - Twitter Analytics Right Relevance Talk
  • 32. Many Moving Parts! Example Workflow Pipeline Twitter Streaming API Python Tweet Collection (includes user data) Rabbit MQ MongoDB Neo4j R Scripts -Graph Stats -Community Detection MySQL Graph .graphml Tableau Graph Visualization Moved from Twitter Search API to Streaming API Replaced Python Twitter libraries (Tweepy) with raw API calls Streaming tweets in message queue Full tweets and user data stored in MongoDB Built graph for analysis in Neo4j from tweets persisted in MongoDB Analysis in R iGraph libraries for algorithms Some text analysis e.g. LDA topics Results published in MySQL for Tableau Graphml for import to Gephi with stats precalculated
  • 33. Our Goal Twitter Streaming API Python Tweet Collection (includes user data) Rabbit MQ MongoDB Neo4j R Scripts -Graph Stats -Community Detection MySQL Graph .graphml Tableau Graph Visualization Example Workflow Pipeline
  • 35. Neo4j Native Graph Database Analytics Integrations Cypher Query Language Wide Range of APOC Procedures Optimized Graph Algorithms
  • 36. Finds the optimal path or evaluates route availability and quality Evaluates how a group is clustered or partitioned Determines the importance of distinct nodes in the network
  • 37. 1. Call as Cypher procedure 2. Pass in specification (Label, Prop, Query) and configuration 3. ~.stream variant returns (a lot) of results CALL algo.<name>.stream('Label','TYPE',{conf}) YIELD nodeId, score 4. non-stream variant writes results to graph returns statistics CALL algo.<name>('Label','TYPE',{conf}) Usage
  • 38. Pass in Cypher statement for node- and relationship-lists. CALL algo.<name>( 'MATCH ... RETURN id(n)', 'MATCH (n)-->(m) RETURN id(n) as source, id(m) as target', {graph:'cypher'}) Cypher Projection
  • 39. • PageRank (baseline) • Betweeness • Closeness • Degree Algorithms - Centralities Pathfinding Centrality Community Detection
  • 40. • Label Propagation • Union Find / WCC • Strongly Connected Components • Louvain • Triangle-Count / Clustering Coefficent Algorithms – Communitity Detection Pathfinding Community Detection Centrality
  • 41. • Single Source Short Path • All-Nodes SSP • Parallel BFS / DFS Algorithms - Pathfinding Centrality Community Detection Pathfinding
  • 42. Iterate Quickly • Combine data from sources into one graph • Project to relevant subgraphs • Enrich data with algorithms • Traverse, collect, filter aggregate with queries • Visualize, Explore, Decide, Export • From all APIs and Tools
  • 44. Datasets Yelp Business Graph • 5m nodes • 17m relationships Bitcoin • 1.7bn nodes, • 2.7bn rels DBPedia • 11m nodes • 116m relationships
  • 46. DBPedia Shallow Copy of Wikipedia: (Page) -[:Link]-> (Page) CALL algo.pageRank.stream('Page', 'Link', {iterations:5}) YIELD node, score WITH * ORDER BY score DESC LIMIT 5 RETURN node.title, score; +--------------------------------------+ | node.title | score | +--------------------------------------+ | "United States" | 13349.2 | | "Animal" | 6077.77 | | "France" | 5025.61 | | "List of sovereign states" | 4913.92 | | "Germany" | 4662.32 | +--------------------------------------+ 5 rows 46 seconds
  • 47. DBPedia – Largest Clusters CALL algo.labelPropagation(); // First 1M pages by Rank MATCH (n:Page) WITH n ORDER BY n.pagerank DESC LIMIT 1000000 // group by partition WITH n.partition AS partition, count(*) AS clusterSize, collect(n.title) AS pages // return most influential node for largest clusters RETURN pages[0] AS mainPage, pages[1..10] AS otherPages ORDER BY clusterSize DESC LIMIT 20
  • 48. Yelp
  • 49. Yelp • Business Reviews by Users •Businesses have Categories and Locations •Users have Friends •Bi-partite-Network (:User)-->(:Business) projections (:User)<-->(:User) & (:Business)<-->(:Business)
  • 50. Yelp – Social - Statistics MATCH (u:User) where exists ( (u)-[:FRIENDS]-() ) WITH u.average_stars as stars, u.review_count as reviews, u.funny as funny RETURN max(stars),avg(stars),stdev(stars),max(reviews),avg(reviews),stdev(reviews),max(funny),avg(funny),stdev(funny); +-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | max(stars) | avg(stars) | stdev(stars) | max(reviews) | avg(reviews) | stdev(reviews) | max(funny) | avg(funny) | stdev(funny) | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 5.0 | 3.8238072950764947 | 0.8862511758625753 | 11284 | 45.81704314022204 | 120.52419266925014 | 170896 | 36.26637835535585 | 731.6024752545679 | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ MATCH (u:User) where exists ( (u)-[:FRIENDS]-() ) WITH u.yelping_since as since RETURN substring(since,0,4) as year, count(*) as total ORDER BY year asc limit 10; +----------------+ | year | total | +----------------+ | "2004" | 64 | | "2005" | 844 | | "2006" | 4504 | | "2007" | 11833 | | "2008" | 20729 | | "2009" | 33965 | | "2010" | 53046 | | "2011" | 70331 | | "2012" | 62596 | | "2013" | 57330 | +----------------+
  • 51. Yelp – Social - PageRank call algo.pageRank.stream('User','FRIENDS') yield node,score with node,score order by score desc limit 10 return node {.name, .review_count, .average_stars,.useful,.yelping_since,.funny}, score, size( (node)<-[:FRIENDS]-()<-[:FRIENDS]-()) as in, size( (node)-[:FRIENDS]->()-[:FRIENDS]->()) as out; +-----------------------------------------------------------------------------------------------------------------------------------------------------+ | node | score | +-----------------------------------------------------------------------------------------------------------------------------------------------------+ | {funny -> 61200, name -> "Philip", average_stars -> 3.93, review_count -> 788, useful -> 69448, yelping_since -> "2007-06-09"} | 208.31336799999994 | | {funny -> 21432, name -> "Des", average_stars -> 3.88, review_count -> 78, useful -> 140024, yelping_since -> "2014-04-01"} | 201.28600150000003 | | {funny -> 465, name -> "Dallas", average_stars -> 4.17, review_count -> 330, useful -> 5517, yelping_since -> "2010-11-07"} | 192.164762 | | {funny -> 1019, name -> "Cara", average_stars -> 3.96, review_count -> 842, useful -> 11738, yelping_since -> "2010-07-21"} | 184.01898249999996 | | {funny -> 1233, name -> "Walker", average_stars -> 3.91, review_count -> 462, useful -> 12332, yelping_since -> "2007-01-25"} | 180.48898350000005 | | {funny -> 13432, name -> "Gabi", average_stars -> 4.05, review_count -> 1730, useful -> 20759, yelping_since -> "2007-08-10"} | 163.29424850000004 | | {funny -> 12848, name -> "Ruggy", average_stars -> 3.92, review_count -> 2118, useful -> 72265, yelping_since -> "2007-07-31"} | 161.87635500000002 | | {funny -> 9997, name -> "Bill", average_stars -> 3.38, review_count -> 595, useful -> 12074, yelping_since -> "2014-04-05"} | 157.0438075 | | {funny -> 1544, name -> "Ashley", average_stars -> 3.7, review_count -> 224, useful -> 1610, yelping_since -> "2009-09-29"} | 150.21423599999997 | | {funny -> 3599, name -> "Risa", average_stars -> 4.08, review_count -> 1044, useful -> 22121, yelping_since -> "2011-07-30"} | 138.20863199999997 | +-----------------------------------------------------------------------------------------------------------------------------------------------------+ 10 rows 3236 ms
  • 52. Yelp •Inferred network of users, via jointly reviewed businesses • (u1:User)-[:WROTE]->(review1)-[:REVIEWS]->(business)<-[:REVIEWS]-(review2)<-[:WROTE]-(u2:User) • 1,3bn paths • Inferred network of businesses, via jointly reviewed by user • (b1:Business)<-[:REVIEWS]-()<-[:WROTE]-(u)-[:WROTE]->()-[:REVIEWS]->(b2:Business) • 214m paths • subset: (b1:Business)-[:CO_OCCURENT_REVIEWS]-(b2:Business)
  • 53. Yelp •Inferred network of users, via jointly reviewed businesses • (u1:User)-[:WROTE]->(review1)-[:REVIEWS]->(business)<-[:REVIEWS]-(review2)<-[:WROTE]-(u2:User) • 1.3bn paths • Inferred network of businesses, via jointly reviewed by user • (b1:Business)<-[:REVIEWS]-()<-[:WROTE]-(u)-[:WROTE]->()-[:REVIEWS]->(b2:Business) • 214m paths
  • 54. Yelp – Business – Co-Occurrence •Find clusters of "similar" businesses •Find peer groups of similar people •Clusters of "interests"
  • 55. Yelp – Business – Co-Occurrence CALL apoc.periodic.iterate( 'MATCH (b:Business) WHERE size((b)<-[:REVIEWS]-()) > 5 AND b.city="Las Vegas" RETURN b', 'MATCH (b)<-[:REVIEWS]-(r1)<-[:WROTE]-(u)-[:WROTE]->(r2)-[:REVIEWS]->(b2) WHERE id(b) < id(b2) AND b2.city="Las Vegas" AND size((b2)<-[:REVIEWS]-()) > 5 AND r1.stars = r2.stars WITH b, b2, count(*) AS weight, avg(r1.stars) as rating where weight > 5 MERGE (b)-[cr:B2B]-(b2) ON CREATE SET cr.weight = weight, cr.rating = rating SET b:Marked, b2:Marked', {batchSize: 1});
  • 56. Yelp - Clustering Union Find CALL algo.unionFind.stream( 'MATCH (b:Business:Marked) RETURN id(b) as id’, 'MATCH (b1:Business:Marked)-[r:B2B]-(b2) RETURN id(b1) as source, id(b2) as target, count(r) as value', {graph:'cypher'}) YIELD setId as cluster, nodeId RETURN cluster, count(*) as size ORDER BY size DESC LIMIT 10; +--------------+ |cluster| size | +--------------+ | 3 | 5625 | | 1876 | 3 | | 155 | 2 | | 1091 | 2 | | 1728 | 2 | | 1177 | 2 | | 337 | 2 | | 3046 | 2 | | 674 | 2 | | 1948 | 2 | +--------------+ 10 rows 6615 ms
  • 57. Yelp - PageRank CALL algo.pageRank.stream( 'MATCH (b:Business:Marked) RETURN id(b) as id', 'MATCH (b1:Business:Marked)-[r:B2B]-(b2) RETURN id(b1) as source, id(b2) as target', {graph:'cypher'}) YIELD node, score RETURN node.name, score ORDER BY score DESC LIMIT 10; +-------------------------------------------------------+ | node.name | score | +-------------------------------------------------------+ | "McCarran International Airport" | 27.49973599999999 | | "Hash House A Go Go" | 19.062398000000005 | | "Bachi Burger" | 18.1494385 | | "Mon Ami Gabi" | 17.720350000000003 | | "Bacchanal Buffet" | 15.783480500000003 | | "Yard House Town Square" | 14.427296999999998 | | "Secret Pizza" | 13.156547 | | "Rollin Smoke Barbeque" | 12.808718499999998 | | "Wicked Spoon" | 12.639942499999997 | | "Monta Ramen" | 12.3904845 | +-------------------------------------------------------+ 10 rows 6979 ms
  • 59. BitCoin Graph • Full Copy of the BitCoin BlockChain • from learnmeabitcoin.com (Greg Walker) • 1.7 billion nodes, 2.7 billion rels • 474k blocks, 240m tx, 280m addresses, 650m outputs • 600 GB on disk
  • 61. BitCoin Graph Distribution of "locked" relationships for "addresses" (participation in transactions) call apoc.stats.degrees('<locked'); +--------------------------------------------------------------------------------------------------------------+ | type | direction | total | p50 | p75 | p90 | p95 | p99 | p999 | max | min | mean | +--------------------------------------------------------------------------------------------------------------+ | "locked" | "INCOMING" | 654662356 | 0 | 0 | 1 | 1 | 2 | 28 | 1891327 | 0 | 0.37588608290716047 | +--------------------------------------------------------------------------------------------------------------+ 1 row 308 seconds
  • 62. BitCoin Graph Inferred network of addresses, via transaction and output (a1)<-[:locked]-(o1)-[:in]->(tx)-[:out]->(o2)-[:locked]->(a2) CALL algo.unionFind.stream( 'match (o:output)-[:locked]->(a) with a limit 10000000 return id(a) as id', 'match (o:output)-[:locked]->(a) with o limit 10000000 match (o)-[:in]->(tx)-[:out]->(o2)-[:locked]->(a2) return id(a) as source, id(a2) as target, count(tx) as weight', {graph:'cypher'}) YIELD setId as cluster, nodeId RETURN cluster, count(*) AS size ORDER BY size DESC LIMIT 10; +-------------------+ | cluster | size | +-------------------+ | 5036 | 4409420 | | 6295282 | 1999 | | 5839746 | 1488 | | 9356302 | 833 | | 6560901 | 733 | | 6370777 | 637 | | 8101710 | 392 | | 5945867 | 369 | | 2489036 | 264 | | 1703620 | 203 | +-------------------+ 10 rows, 296 seconds
  • 64. Design Considerations • Ease of Use – Call as Procedures • Parallelize everything: load, compute, write • Efficiency: Use direct access, efficient datastructures, provide high-level API • Scale to billions of nodes and relationships Use up to hundreds of CPUs and Terabytes of RAM
  • 65. 1. Load Data in parallel from Neo4j 2. Store in efficient data structures 3. Run Graph Algorithm in parallel using Graph API 4. Write data back in parallel Neo4j 1, 2 Algorithm Datastructures 4 3 Graph API Architecture
  • 67. Neo4j Graph Platform with Neo4j Algorithms vs. Apache Spark’s GraphX 0 50 100 150 200 250 300 350 400 450 Union-Find (Connected Components) PageRank 251 Seconds 152 416 124 Neo4j is Significantly Faster Spark GraphX results publicly available • Amazon EC2 cluster running 64-bit Linux • 128 CPUs with 68 GB of memory, 2 hard disks Neo4j Configuration • Physical machine running 64-bit Linux • 128 CPUs with 55 GB RAM, SSDs Twitter 2010 Dataset • 1.47 Billion Relationships • 41.65 Million Nodes GraphX Neo4j Neo4j GraphX
  • 68. Compute At Scale – Payment Graph 3,000,000,000 nodes and 18,000,000,000 relationships (600G) PageRank (20 iterations) on 1 machine, 20 threads, 900G RAM call algo.pageRank('Account','SENT', {graph:'huge',iterations:20,write:true,concurrency:20}); +-------------------------------------------------------------------+ | nodes | iterations | loadMillis | computeMillis | writeMillis | +-------------------------------------------------------------------+ | 300000000 | 20 | 401404 | 6024994 | 47106 | +-------------------------------------------------------------------+ 1 row 6473526 ms -> 1h 47min
  • 69. We Need Your Feedback • neo4j.com/slack at #neo4j-graph-algorithms • github.com/neo4j-contrib/neo4j-graph-algorithms • Whitepaper on neo4j.com/graph-analytics
  • 70. Graphs are one of the Unifying Themes of computer science . . . That so many different structures can be modeled using a single formalism is a Source of Great Power to the educated programmer.” - Steven S. Skiena, The Algorithm Design Manual “
  • 71. Kudos: Paul Horn Martin Knobloch from Avantgarde Labs Tomasz Bratanic (docs)