Distributed Graph Computing with Spark GraphX

MILAN 20/21.11.2015
Graphs are everywhere!
Distributed graph computing with Spark GraphX
Andrea Iacono

MILAN 20/21.11.2015 - Andrea Iacono
Agenda:
●
Graph definitions and usages
●
GraphX introduction
●
Pregel
●
Code examples
The main focus will be the programming model
The code is available at:
https://github.com/andreaiacono/TalkGraphX

A graph is a set of vertices and edges that connect them:
Graphs are used for modeling very different domains.
Edge
Verte
x

Network
s

Routing

Page Rank

Definitions
Undirected Directed

Definitions
Connected Disconnected

Definitions
K5
K2,3
Complete Bipartite (and complete)

Definitions
Cyclic Acyclic

Definitions
Multigraph Pseudograph

Definitions
An undirected acyclic connected graph is a tree!

What's wrong with MapReduce?
Every run of MapReduce reads from disk (e.g. HDFS) the initial data,
computes the results and then stores them on disk; since most
algorithms on graphs are iterative, this means that for every iteration
the whole data must be read and written from/to disk.
It's better to use a distributed dataflow framework

GraphX is a graph processing system
built on top of Apache Spark
“Graph processing systems represent graph structured data as a property
graph, which associates user-defined properties with each vertex and edge.”
“The Spark storage abstraction called Resilient Distributed Datasets (RDDs)
enables applications to keep data in memory, which is essential for iterative
graph algorithms.”
“RDDs permit user-defined data partitioning, and the execution engine can
exploit this to co-partition RDDs and co-schedule tasks to avoid data
movement. This is essential for encoding partitioned graphs.”
Excerpt from GraphX: Graph Processing in a Distributed Dataflow Framework
https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf

GraphX / Spark software stack
(image source: Spark site)

Graph Databases
●
Storage
●
Query Language
●
Transactions
●
Examples:
●
Neo4j
●
OrientDB
●
Titan
●
APIs for traversing and
processing
●
Better performance
(in-memory data)
●
Examples:
●
GraphX
●
Giraphe
●
GraphLab
Graph Processing
Systems

Pregel
is a computational model designed by Google
(https://kowshik.github.io/JPregel/pregel_paper.pdf)
It consists of a sequence of supersteps until termination. In each superstep,
every vertex can:
●
modify its state or the one of any of its neighbours
●
receive the messages sent to it during the previous superstep
●
send messages to its neighbours (that will be received in next superstep)
●
vote to halt
When a node votes to halt, it goes to inactive state; if in a later superstep it
receives a message, the framework will awake it changing its state to active.
When all the nodes have voted to halt, the computation stops; otherwise it can be
set a maximum number of iteration.
Edges don't have any computation.
When writing algorithms, you have to think as a vertex.

Pregel sample
Image source: Pregel paper

GraphX implementation of Pregel
GraphX uses three functions for implementing Pregel:
●
vprog: the vertex program computed for each vertex that receives the
incoming message and computes a new vertex value
●
sendMsg: the function used for sending messages to other vertices
●
mergeMsg: a function that takes two incoming messages and merges
them into a single message
Unlike Google's Pregel, GraphX implementation of Pregel:
●
leave the message construction out of the vertex-program, so to have
a more efficient distributed execution
●
permits access to both vertices attributes of an edge while building the
messages
●
contraints sending messages to graph structure (only to neighbours)

GraphX Pregel communication diagram

GraphX is well suited for algorithms that:
●
respect the neighborhood structure
GraphX is NOT well suited for algorithms that:
●
need iteration among distant vertices
●
change the structure of the graph
When to use GraphX

Algorithms out of the
box:
(as of Spark v1.5.1)
- Connected Components
- Label Propagation
- PageRank
- SVD++
- Shortest Paths
- Strongly Connected Components
- Triangle Count

Now some code!

Questions & Answers

MILAN 20/21.11.2015
Andrea Iacono
The code is available at:
https://github.com/andreaiacono/TalkGraphX

Leave your feedback on Joind.in!
https://m.joind.in/event/codemotion-milan-2015

Distributed Graph Computing with Spark GraphX

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Distributed Graph Computing with Spark GraphX

Similar to Distributed Graph Computing with Spark GraphX (20)

Recently uploaded

Recently uploaded (20)

Distributed Graph Computing with Spark GraphX

Editor's Notes