SlideShare a Scribd company logo
1 of 26
MILAN 20/21.11.2015
Graphs are everywhere!
Distributed graph computing with Spark GraphX
Andrea Iacono
MILAN 20/21.11.2015 - Andrea Iacono
Agenda:
●
Graph definitions and usages
●
GraphX introduction
●
Pregel
●
Code examples
The main focus will be the programming model
The code is available at:
https://github.com/andreaiacono/TalkGraphX
MILAN 20/21.11.2015 - Andrea Iacono
A graph is a set of vertices and edges that connect them:
Graphs are used for modeling very different domains.
Edge
Verte
x
MILAN 20/21.11.2015 - Andrea Iacono
Network
s
MILAN 20/21.11.2015 - Andrea Iacono
Routing
MILAN 20/21.11.2015 - Andrea Iacono
Page Rank
MILAN 20/21.11.2015 - Andrea Iacono
Definitions
Undirected Directed
MILAN 20/21.11.2015 - Andrea Iacono
Definitions
Connected Disconnected
MILAN 20/21.11.2015 - Andrea Iacono
Definitions
K5
K2,3
Complete Bipartite (and complete)
MILAN 20/21.11.2015 - Andrea Iacono
Definitions
Cyclic Acyclic
MILAN 20/21.11.2015 - Andrea Iacono
Definitions
Multigraph Pseudograph
MILAN 20/21.11.2015 - Andrea Iacono
Definitions
An undirected acyclic connected graph is a tree!
MILAN 20/21.11.2015 - Andrea Iacono
What's wrong with MapReduce?
Every run of MapReduce reads from disk (e.g. HDFS) the initial data,
computes the results and then stores them on disk; since most
algorithms on graphs are iterative, this means that for every iteration
the whole data must be read and written from/to disk.
It's better to use a distributed dataflow framework
MILAN 20/21.11.2015 - Andrea Iacono
GraphX is a graph processing system
built on top of Apache Spark
“Graph processing systems represent graph structured data as a property
graph, which associates user-defined properties with each vertex and edge.”
“The Spark storage abstraction called Resilient Distributed Datasets (RDDs)
enables applications to keep data in memory, which is essential for iterative
graph algorithms.”
“RDDs permit user-defined data partitioning, and the execution engine can
exploit this to co-partition RDDs and co-schedule tasks to avoid data
movement. This is essential for encoding partitioned graphs.”
Excerpt from GraphX: Graph Processing in a Distributed Dataflow Framework
https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf
MILAN 20/21.11.2015 - Andrea Iacono
GraphX / Spark software stack
(image source: Spark site)
MILAN 20/21.11.2015 - Andrea Iacono
Graph Databases
●
Storage
●
Query Language
●
Transactions
●
Examples:
●
Neo4j
●
OrientDB
●
Titan
●
APIs for traversing and
processing
●
Better performance
(in-memory data)
●
Examples:
●
GraphX
●
Giraphe
●
GraphLab
Graph Processing
Systems
MILAN 20/21.11.2015 - Andrea Iacono
Pregel
is a computational model designed by Google
(https://kowshik.github.io/JPregel/pregel_paper.pdf)
It consists of a sequence of supersteps until termination. In each superstep,
every vertex can:
●
modify its state or the one of any of its neighbours
●
receive the messages sent to it during the previous superstep
●
send messages to its neighbours (that will be received in next superstep)
●
vote to halt
When a node votes to halt, it goes to inactive state; if in a later superstep it
receives a message, the framework will awake it changing its state to active.
When all the nodes have voted to halt, the computation stops; otherwise it can be
set a maximum number of iteration.
Edges don't have any computation.
When writing algorithms, you have to think as a vertex.
MILAN 20/21.11.2015 - Andrea Iacono
Pregel sample
Image source: Pregel paper
MILAN 20/21.11.2015 - Andrea Iacono
GraphX implementation of Pregel
GraphX uses three functions for implementing Pregel:
●
vprog: the vertex program computed for each vertex that receives the
incoming message and computes a new vertex value
●
sendMsg: the function used for sending messages to other vertices
●
mergeMsg: a function that takes two incoming messages and merges
them into a single message
Unlike Google's Pregel, GraphX implementation of Pregel:
●
leave the message construction out of the vertex-program, so to have
a more efficient distributed execution
●
permits access to both vertices attributes of an edge while building the
messages
●
contraints sending messages to graph structure (only to neighbours)
MILAN 20/21.11.2015 - Andrea Iacono
GraphX Pregel communication diagram
MILAN 20/21.11.2015 - Andrea Iacono
GraphX is well suited for algorithms that:
●
respect the neighborhood structure
GraphX is NOT well suited for algorithms that:
●
need iteration among distant vertices
●
change the structure of the graph
When to use GraphX
MILAN 20/21.11.2015 - Andrea Iacono
Algorithms out of the
box:
(as of Spark v1.5.1)
- Connected Components
- Label Propagation
- PageRank
- SVD++
- Shortest Paths
- Strongly Connected Components
- Triangle Count
MILAN 20/21.11.2015 - Andrea Iacono
Now some code!
MILAN 20/21.11.2015 - Andrea Iacono
Questions & Answers
MILAN 20/21.11.2015
Andrea Iacono
The code is available at:
https://github.com/andreaiacono/TalkGraphX
MILAN 20/21.11.2015 - Andrea Iacono
Leave your feedback on Joind.in!
https://m.joind.in/event/codemotion-milan-2015

More Related Content

What's hot

GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014鉄平 土佐
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseMo Patel
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big dataSigmoid
 
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jGraphAware
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer spaceGraphAware
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
 
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jExtending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jDatabricks
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
GraphAware Framework Intro
GraphAware Framework IntroGraphAware Framework Intro
GraphAware Framework IntroMichal Bachman
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisJen Aman
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
New Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastDatabricks
 
Graph-Powered Machine Learning
Graph-Powered Machine Learning Graph-Powered Machine Learning
Graph-Powered Machine Learning GraphAware
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Introjeykottalam
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and GiraphDoug Needham
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon
 

What's hot (20)

GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jExtending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
GraphAware Framework Intro
GraphAware Framework IntroGraphAware Framework Intro
GraphAware Framework Intro
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph Analysis
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
New Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit East
 
Graph-Powered Machine Learning
Graph-Powered Machine Learning Graph-Powered Machine Learning
Graph-Powered Machine Learning
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 
Power of Polyglot Search
Power of Polyglot SearchPower of Polyglot Search
Power of Polyglot Search
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
 

Viewers also liked

Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormAndrea Iacono
 
Graph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPopGraph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPopJason Plurad
 
Quantum Processes in Graph Computing
Quantum Processes in Graph ComputingQuantum Processes in Graph Computing
Quantum Processes in Graph ComputingMarko Rodriguez
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataMarko Rodriguez
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraMatthias Broecheler
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
Faunus: Graph Analytics Engine
Faunus: Graph Analytics EngineFaunus: Graph Analytics Engine
Faunus: Graph Analytics EngineMarko Rodriguez
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Spark Summit
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analyticsSigmoid
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosSpark Summit
 
Neo, Titan & Cassandra
Neo, Titan & CassandraNeo, Titan & Cassandra
Neo, Titan & Cassandrajohnrjenson
 
Titan: Scaling Graphs and TinkerPop3
Titan: Scaling Graphs and TinkerPop3Titan: Scaling Graphs and TinkerPop3
Titan: Scaling Graphs and TinkerPop3Matthias Broecheler
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
Graph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXGraph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXAmir Payberah
 
Graph Processing with Titan and Scylla
Graph Processing with Titan and ScyllaGraph Processing with Titan and Scylla
Graph Processing with Titan and ScyllaJason Plurad
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXAndrea Iacono
 
Graph Processing Applications @ HUG
Graph Processing Applications @ HUGGraph Processing Applications @ HUG
Graph Processing Applications @ HUGPraveen Sripati
 
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)Andrei Nikolaenko
 
Improving personalized recommendations through temporal overlapping community...
Improving personalized recommendations through temporal overlapping community...Improving personalized recommendations through temporal overlapping community...
Improving personalized recommendations through temporal overlapping community...Mani kandan
 
Graph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph AnalyticsGraph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph AnalyticsNesreen K. Ahmed
 

Viewers also liked (20)

Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
 
Graph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPopGraph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPop
 
Quantum Processes in Graph Computing
Quantum Processes in Graph ComputingQuantum Processes in Graph Computing
Quantum Processes in Graph Computing
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph Data
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with Cassandra
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Faunus: Graph Analytics Engine
Faunus: Graph Analytics EngineFaunus: Graph Analytics Engine
Faunus: Graph Analytics Engine
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analytics
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
 
Neo, Titan & Cassandra
Neo, Titan & CassandraNeo, Titan & Cassandra
Neo, Titan & Cassandra
 
Titan: Scaling Graphs and TinkerPop3
Titan: Scaling Graphs and TinkerPop3Titan: Scaling Graphs and TinkerPop3
Titan: Scaling Graphs and TinkerPop3
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Graph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXGraph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphX
 
Graph Processing with Titan and Scylla
Graph Processing with Titan and ScyllaGraph Processing with Titan and Scylla
Graph Processing with Titan and Scylla
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Graph Processing Applications @ HUG
Graph Processing Applications @ HUGGraph Processing Applications @ HUG
Graph Processing Applications @ HUG
 
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
 
Improving personalized recommendations through temporal overlapping community...
Improving personalized recommendations through temporal overlapping community...Improving personalized recommendations through temporal overlapping community...
Improving personalized recommendations through temporal overlapping community...
 
Graph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph AnalyticsGraph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph Analytics
 

Similar to Distributed Graph Computing with Spark GraphX

Andrea Iacono - Graphs are everywhere!
 Andrea Iacono - Graphs are everywhere! Andrea Iacono - Graphs are everywhere!
Andrea Iacono - Graphs are everywhere!Codemotion
 
PDE2011 pythonOCC project status and plans
PDE2011 pythonOCC project status and plansPDE2011 pythonOCC project status and plans
PDE2011 pythonOCC project status and plansThomas Paviot
 
mago3D FOSS4G NA 2018
mago3D FOSS4G NA 2018mago3D FOSS4G NA 2018
mago3D FOSS4G NA 2018정대 천
 
g-Eclipse Made Cloud Easy
g-Eclipse Made Cloud Easyg-Eclipse Made Cloud Easy
g-Eclipse Made Cloud Easygueste98511
 
g-Eclipse made Cloud Easy!
g-Eclipse made Cloud Easy!g-Eclipse made Cloud Easy!
g-Eclipse made Cloud Easy!Vyshnavi Chandu
 
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Taro L. Saito
 
Introduction to Aneka, Aneka Model is explained
Introduction to Aneka, Aneka Model is explainedIntroduction to Aneka, Aneka Model is explained
Introduction to Aneka, Aneka Model is explainedDr Neelesh Jain
 
Remix & GraphQL: A match made in heaven with type-safety DX
Remix & GraphQL:  A match made in heaven with type-safety DXRemix & GraphQL:  A match made in heaven with type-safety DX
Remix & GraphQL: A match made in heaven with type-safety DXJesus Manuel Olivas
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2Kaxil Naik
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0datamantra
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...SANGHEE SHIN
 
GraphTech Ecosystem - part 3: Graph Visualization
GraphTech Ecosystem - part 3: Graph VisualizationGraphTech Ecosystem - part 3: Graph Visualization
GraphTech Ecosystem - part 3: Graph VisualizationLinkurious
 
Polyline download and visualization over terrain models
Polyline download and visualization over terrain modelsPolyline download and visualization over terrain models
Polyline download and visualization over terrain modelsgraphitech
 
STAF/ICGT 2018 Introduction to graph-oriented programming
STAF/ICGT 2018 Introduction to graph-oriented programmingSTAF/ICGT 2018 Introduction to graph-oriented programming
STAF/ICGT 2018 Introduction to graph-oriented programmingOlivier REY
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0datamantra
 
State of GeoServer 2.10
State of GeoServer 2.10State of GeoServer 2.10
State of GeoServer 2.10Jody Garnett
 

Similar to Distributed Graph Computing with Spark GraphX (20)

Andrea Iacono - Graphs are everywhere!
 Andrea Iacono - Graphs are everywhere! Andrea Iacono - Graphs are everywhere!
Andrea Iacono - Graphs are everywhere!
 
PDE2011 pythonOCC project status and plans
PDE2011 pythonOCC project status and plansPDE2011 pythonOCC project status and plans
PDE2011 pythonOCC project status and plans
 
mago3D FOSS4G NA 2018
mago3D FOSS4G NA 2018mago3D FOSS4G NA 2018
mago3D FOSS4G NA 2018
 
CS267_Graph_Lab
CS267_Graph_LabCS267_Graph_Lab
CS267_Graph_Lab
 
g-Eclipse Made Cloud Easy
g-Eclipse Made Cloud Easyg-Eclipse Made Cloud Easy
g-Eclipse Made Cloud Easy
 
g-Eclipse made Cloud Easy!
g-Eclipse made Cloud Easy!g-Eclipse made Cloud Easy!
g-Eclipse made Cloud Easy!
 
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
 
Introduction to Aneka, Aneka Model is explained
Introduction to Aneka, Aneka Model is explainedIntroduction to Aneka, Aneka Model is explained
Introduction to Aneka, Aneka Model is explained
 
CityEngine-OpenDS
CityEngine-OpenDSCityEngine-OpenDS
CityEngine-OpenDS
 
Remix & GraphQL: A match made in heaven with type-safety DX
Remix & GraphQL:  A match made in heaven with type-safety DXRemix & GraphQL:  A match made in heaven with type-safety DX
Remix & GraphQL: A match made in heaven with type-safety DX
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
 
GraphTech Ecosystem - part 3: Graph Visualization
GraphTech Ecosystem - part 3: Graph VisualizationGraphTech Ecosystem - part 3: Graph Visualization
GraphTech Ecosystem - part 3: Graph Visualization
 
Polyline download and visualization over terrain models
Polyline download and visualization over terrain modelsPolyline download and visualization over terrain models
Polyline download and visualization over terrain models
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
STAF/ICGT 2018 Introduction to graph-oriented programming
STAF/ICGT 2018 Introduction to graph-oriented programmingSTAF/ICGT 2018 Introduction to graph-oriented programming
STAF/ICGT 2018 Introduction to graph-oriented programming
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
State of GeoServer 2.10
State of GeoServer 2.10State of GeoServer 2.10
State of GeoServer 2.10
 

Recently uploaded

LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 

Recently uploaded (20)

LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 

Distributed Graph Computing with Spark GraphX

  • 1. MILAN 20/21.11.2015 Graphs are everywhere! Distributed graph computing with Spark GraphX Andrea Iacono
  • 2. MILAN 20/21.11.2015 - Andrea Iacono Agenda: ● Graph definitions and usages ● GraphX introduction ● Pregel ● Code examples The main focus will be the programming model The code is available at: https://github.com/andreaiacono/TalkGraphX
  • 3. MILAN 20/21.11.2015 - Andrea Iacono A graph is a set of vertices and edges that connect them: Graphs are used for modeling very different domains. Edge Verte x
  • 4. MILAN 20/21.11.2015 - Andrea Iacono Network s
  • 5. MILAN 20/21.11.2015 - Andrea Iacono Routing
  • 6. MILAN 20/21.11.2015 - Andrea Iacono Page Rank
  • 7. MILAN 20/21.11.2015 - Andrea Iacono Definitions Undirected Directed
  • 8. MILAN 20/21.11.2015 - Andrea Iacono Definitions Connected Disconnected
  • 9. MILAN 20/21.11.2015 - Andrea Iacono Definitions K5 K2,3 Complete Bipartite (and complete)
  • 10. MILAN 20/21.11.2015 - Andrea Iacono Definitions Cyclic Acyclic
  • 11. MILAN 20/21.11.2015 - Andrea Iacono Definitions Multigraph Pseudograph
  • 12. MILAN 20/21.11.2015 - Andrea Iacono Definitions An undirected acyclic connected graph is a tree!
  • 13. MILAN 20/21.11.2015 - Andrea Iacono What's wrong with MapReduce? Every run of MapReduce reads from disk (e.g. HDFS) the initial data, computes the results and then stores them on disk; since most algorithms on graphs are iterative, this means that for every iteration the whole data must be read and written from/to disk. It's better to use a distributed dataflow framework
  • 14. MILAN 20/21.11.2015 - Andrea Iacono GraphX is a graph processing system built on top of Apache Spark “Graph processing systems represent graph structured data as a property graph, which associates user-defined properties with each vertex and edge.” “The Spark storage abstraction called Resilient Distributed Datasets (RDDs) enables applications to keep data in memory, which is essential for iterative graph algorithms.” “RDDs permit user-defined data partitioning, and the execution engine can exploit this to co-partition RDDs and co-schedule tasks to avoid data movement. This is essential for encoding partitioned graphs.” Excerpt from GraphX: Graph Processing in a Distributed Dataflow Framework https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf
  • 15. MILAN 20/21.11.2015 - Andrea Iacono GraphX / Spark software stack (image source: Spark site)
  • 16. MILAN 20/21.11.2015 - Andrea Iacono Graph Databases ● Storage ● Query Language ● Transactions ● Examples: ● Neo4j ● OrientDB ● Titan ● APIs for traversing and processing ● Better performance (in-memory data) ● Examples: ● GraphX ● Giraphe ● GraphLab Graph Processing Systems
  • 17. MILAN 20/21.11.2015 - Andrea Iacono Pregel is a computational model designed by Google (https://kowshik.github.io/JPregel/pregel_paper.pdf) It consists of a sequence of supersteps until termination. In each superstep, every vertex can: ● modify its state or the one of any of its neighbours ● receive the messages sent to it during the previous superstep ● send messages to its neighbours (that will be received in next superstep) ● vote to halt When a node votes to halt, it goes to inactive state; if in a later superstep it receives a message, the framework will awake it changing its state to active. When all the nodes have voted to halt, the computation stops; otherwise it can be set a maximum number of iteration. Edges don't have any computation. When writing algorithms, you have to think as a vertex.
  • 18. MILAN 20/21.11.2015 - Andrea Iacono Pregel sample Image source: Pregel paper
  • 19. MILAN 20/21.11.2015 - Andrea Iacono GraphX implementation of Pregel GraphX uses three functions for implementing Pregel: ● vprog: the vertex program computed for each vertex that receives the incoming message and computes a new vertex value ● sendMsg: the function used for sending messages to other vertices ● mergeMsg: a function that takes two incoming messages and merges them into a single message Unlike Google's Pregel, GraphX implementation of Pregel: ● leave the message construction out of the vertex-program, so to have a more efficient distributed execution ● permits access to both vertices attributes of an edge while building the messages ● contraints sending messages to graph structure (only to neighbours)
  • 20. MILAN 20/21.11.2015 - Andrea Iacono GraphX Pregel communication diagram
  • 21. MILAN 20/21.11.2015 - Andrea Iacono GraphX is well suited for algorithms that: ● respect the neighborhood structure GraphX is NOT well suited for algorithms that: ● need iteration among distant vertices ● change the structure of the graph When to use GraphX
  • 22. MILAN 20/21.11.2015 - Andrea Iacono Algorithms out of the box: (as of Spark v1.5.1) - Connected Components - Label Propagation - PageRank - SVD++ - Shortest Paths - Strongly Connected Components - Triangle Count
  • 23. MILAN 20/21.11.2015 - Andrea Iacono Now some code!
  • 24. MILAN 20/21.11.2015 - Andrea Iacono Questions & Answers
  • 25. MILAN 20/21.11.2015 Andrea Iacono The code is available at: https://github.com/andreaiacono/TalkGraphX
  • 26. MILAN 20/21.11.2015 - Andrea Iacono Leave your feedback on Joind.in! https://m.joind.in/event/codemotion-milan-2015

Editor's Notes

  1. Question to public: - Who knows what a graph is? - Who ever used it? - Who knows the most used algorithms? (BFS, DFS, Dijkstra) - Who knows Scala?
  2. Vertici e archi
  3. Conteggio dei triangoli x raggruppare Interesse commerciale x proposte mirate a gruppi con stessi interessi
  4. Vertici = incroci Archi = strade Algoritmo cammino minimo (Dijkstra), dove gli archi hanno più pesi: tipicamente distanza, traffico, pagamento di un pedaggio, etc
  5. Pagine = vertici Archi = link in entrata Ogni arco in uscita ha un pesao legato a quello del suo vertice; maggiore la sommatoria dei valori degli archi in ingresso, maggiore il peso del vertice. Algoritmo iterativo
  6. Orientato / non orientato
  7. Connesso / Non connesso
  8. K è la nomeclatura standard x indicare questo tipo di grafi A bipartite graph is useful for e-commerce, when you a all the user nodes that can buy any of the product nodes.
  9. Ciclico / Aciclico (o senza cicli)
  10. Multi grafo: quando si possono avere più archi che hanno la stessa sorgente e la stessa destinazione Pseudo grafo: quando un arco può avere lo stesso vertice come sorgente e come destinazione
  11. Quando dicevo che gli archi sono dappertutto, è soprattuto per questo!
  12. Qui si parla di grafi di grosse dimensioni, che non stanno nella RAM di un solo PC.
  13. Il grafo rappresentato è un multi-pseduo grafo. ????? rappresentazione interna?
  14. A differenza di spark, che offre le API in scala, Java e python, GraphX le offre solo in Scala; tuttavia in un prossimo futuro dovrebbero essere disponibili.
  15. Gremlin graph query language (tinkerpop) Gremlin is a DSL for traversing property graphs Neo4j uses (proprietary) cypher as native query language Titan a graph database che supporta come backend di storage: - cassandra (column) - hbase (column) - berkeleyDB (key-value)
  16. Immaginiamo di avere un valore per ogni vertice e di voler trovare il valore massimo di tutto il grafo. Con questo modello di computazione, l'idea è che dobbiamo propagare le informazioni fra i nodi. In ogni superstep, ogni vertice che ha ricevuto un valore più alto del suo, lo manda a tutti i suoi vicini. Quando nessun vertice cambia più, l'agoritmo è terminato.
  17. Commutativa: 2 + 3 == 3 + 2 Associativa: (2 + 3) + 4 = 2 + (3 + 4)
  18. Estrazione JetBrains