SlideShare a Scribd company logo
1 of 31
Download to read offline
Spark GraphX & Pregel
Challenges and Best Practices
Ashutosh Trivedi (IIIT Bangalore)
Kaushik Ranjan (IIIT Bangalore)
Sigmoid-Meetup Bangalore
https://github.com/anantasty/SparkAlgorithms
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Agenda
•Introduction to GraphX
– How to describe a graph
– RDDs to store Graph
– Algorithms available
•Application in graph algorithms
– Feedback Vertex Set of a Graph
– Identifying parallel parts of the solution.
•Challenges we faced
•Best practices
2
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
33
GraphX - Representation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Graph Representation
4
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
• The VertexRDD[A] extends RDD[(VertexID, A)] and adds the additional
constraint that each VertexID occurs only once.
• Moreover, VertexRDD[A] represents a set of vertices each with an
attribute of type A
• The EdgeRDD[ED], extends RDD[Edge[ED]]
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 5
GraphX - Representation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 6
A BA
Vertex and Edges
Vertex Edge
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Triplets Join Vertices and Edges
• The triplets operator joins vertices and edges:
TripletsVertices
B
A
C
D
Edges
A B
A C
B C
C D
A BA
B A C
B C
C D
7
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
88
Triplets elements
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 9
Subgraphs
Predicates vpred and epred
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 10
Feedback Vertex Set
• A feedback vertex set of a graph is a set of vertices
whose removal leaves a graph without cycles.
• Each feedback vertex set contains at least one vertex of
any cycle in the graph.
• The feedback vertex set problem is an NP-
complete problem in computational complexity theory
• Enumerate each simple cycle.
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 11
1 2
34
5
6
7
8
9
10
Strongly Connected Components
Each strongly connected component can be
considered in parallel since they do not share
any cycle
SC1 – (1) SC2 – (5) SC3 – (8) SC4 – (9)
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 12
FVS Algorithm
#Greedy recursive solution
FVS(G)
sccGraph = scc(G)
For each graph in sccGraph
For each vertex
remove vertex and again calculate scc,
vertex V = vertex which give max number of scc
#which means it kills maximum cycles
subGraph = subgraph(remove V )
FVS (subGraph )
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 13
1 2
4 3
2
4 3
Graph Iteration SCC count
3
1
4 3
1
1 2
4
3
1 2
4 3
1 2
4 3
Remove 2
Remove 1
Remove 3
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 14
1
5
8 9
1 5 8 9Feedback Vertex Set
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 15
FVS – Spark Implementation
sccGraph has one more property sccID on each vertices, extract it
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 16
FVS – Spark Implementation
sccGraph = scc(G)
For each graph in sccGraph
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 17
FVS – Spark Implementation
#Greedy recursive function
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 18
FVS – Spark Implementation
For each vertex
remove vertex and again calculate scc,
# Z is a list of scc count after removing each vertex
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 19
vertex V = vertex which give max number of scc
#which means it kills maximum cycles
FVS – Spark Implementation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 20
subGraph = subgraph(remove V )
FVS (subGraph )
FVS – Spark Implementation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 21
Pregel
• Graph DB
– Data Storage
– Data Mining
• Advantages
– Large-scale distributed computations
– Parallel-algorithms for graphs on multiple machines
– Fault tolerance and distributability
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 22
Oldest Follower
What is the age of oldest follower of each user ?
Val oldestFollowerAge = graph
.aggregateMessages(
#map word => (word.dst.id, word.src.age),
#reduce (a,b) => max(a, b)
)
.vertices
mapReduceTriplets is now aggregateMessages
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 23
In aggregateMessages :
• EdgeContext which exposes the triplet fields .
• functions to explicitly send messages to the source and
destination vertex.
• It require the user to indicate what fields in the triplet are
actually required.
New in GraphX
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Theory – it’s Good
How it works – that’s awesome
24
Graph’s are recursive data-structures, where the
property of a vertex is dependent on the properties of
it’s neighbors, which in turn are dependent on the
properties of their neighbors.
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Graph.Pregel ( initialMessage ) (
#message consumption
( vertexID, initialProperty, message ) → compute new property
,
#message generation
triplet → .. code ..
Iterator( vertexID, message )
Iterator.empty
,
#message aggregation
( existing message set, new message ) → NEW message set
)
25
Architecture
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 26
1 2
4 3
1030
30 20
1 2
4 3
10
30
30 20
max [30,10,20]
max [20] max [10]
1 2
4 3
100
10 10
1 2
4 3
10
0
10 10
max [10] max [10]
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 27
Example - output
1 2
4 3
100
0 0
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Applications - GIS
• Algorithm – to compute all vertices in a directed graph, that can
reach out to a given vertex.
• Can be used for watershed delineation in Geographic Information
Systems
28
Vertices that can reach out to E are A and B
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Algorithm
Graph.Pregel( Seq[vertexID’s] ) (
#message consumption
if vertex.state == 1
vertex.state → 2
else if vertex.state == 0
if ( vertex.adjacentVertices ∩ Seq[ vertexID’s ] ) isNotEmpty
vertex.state → 2
#message aggregator
Seq[existing vertex ID’s] U Seq[new vertex ID]
)
29
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 30
#message generation
for each triplet
if destinationVertex.state == 1
message( sourceVertexID, Seq[destinationVertexID] )
message( destinationVertexID, Seq[destinationVertexID] )
else if sourceVertex.state == 1 and destinationVertex.state == 2
message( sourceVertexID, Seq[destinationVertexID] )
else message( empty )
Algorithm
References
• Fork our repository at
• https://github.com/anantasty/SparkAlgorithms
• Follow us at
• https://github.com/codeAshu
• https://github.com/kaushikranjan
• https://spark.apache.org/docs/latest/graphx-programming-guide.html
31

More Related Content

What's hot

The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersSATOSHI TAGOMORI
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?confluent
 
High Availability Storage (susecon2016)
High Availability Storage (susecon2016)High Availability Storage (susecon2016)
High Availability Storage (susecon2016)Roger Zhou 周志强
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
OpenStack Architecture
OpenStack ArchitectureOpenStack Architecture
OpenStack ArchitectureMirantis
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Databricks
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...StreamNative
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
How Prometheus Store the Data
How Prometheus Store the DataHow Prometheus Store the Data
How Prometheus Store the DataHao Chen
 
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowFrom Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowDatabricks
 
Percona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database SystemPercona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database SystemFrederic Descamps
 
Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources confluent
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureJoey Bolduc-Gilbert
 
Add Redis to Postgres to Make Your Microservices Go Boom!
Add Redis to Postgres to Make Your Microservices Go Boom!Add Redis to Postgres to Make Your Microservices Go Boom!
Add Redis to Postgres to Make Your Microservices Go Boom!Dave Nielsen
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 

What's hot (20)

The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
InnoDB Locking Explained with Stick Figures
InnoDB Locking Explained with Stick FiguresInnoDB Locking Explained with Stick Figures
InnoDB Locking Explained with Stick Figures
 
Elk
Elk Elk
Elk
 
High Availability Storage (susecon2016)
High Availability Storage (susecon2016)High Availability Storage (susecon2016)
High Availability Storage (susecon2016)
 
Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
OpenStack Architecture
OpenStack ArchitectureOpenStack Architecture
OpenStack Architecture
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
How Prometheus Store the Data
How Prometheus Store the DataHow Prometheus Store the Data
How Prometheus Store the Data
 
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowFrom Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
 
Percona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database SystemPercona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database System
 
Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
 
Add Redis to Postgres to Make Your Microservices Go Boom!
Add Redis to Postgres to Make Your Microservices Go Boom!Add Redis to Postgres to Make Your Microservices Go Boom!
Add Redis to Postgres to Make Your Microservices Go Boom!
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 

Viewers also liked

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosSpark Summit
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Spark Summit
 
Graph x pregel
Graph x pregelGraph x pregel
Graph x pregelSigmoid
 
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike GualtieriSpark Summit
 
Fighting financial crime with graph analysis at BIWA Summit 2017
Fighting financial crime with graph analysis at BIWA Summit 2017Fighting financial crime with graph analysis at BIWA Summit 2017
Fighting financial crime with graph analysis at BIWA Summit 2017Linkurious
 
Using graphs technologies for intelligence analysis.
Using graphs technologies for intelligence analysis. Using graphs technologies for intelligence analysis.
Using graphs technologies for intelligence analysis. Linkurious
 
Xia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLXia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLMLconf
 
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...Spark Summit
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...Spark Summit
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkKrishna Sankar
 
Lambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandLambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandSpark Summit
 
Social Network Analysis with Spark
Social Network Analysis with SparkSocial Network Analysis with Spark
Social Network Analysis with SparkGhulam Imaduddin
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analyticsSigmoid
 

Viewers also liked (20)

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
 
Ashutosh pycon
Ashutosh pyconAshutosh pycon
Ashutosh pycon
 
Spark algorithms
Spark algorithmsSpark algorithms
Spark algorithms
 
Graph x pregel
Graph x pregelGraph x pregel
Graph x pregel
 
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
 
Fighting financial crime with graph analysis at BIWA Summit 2017
Fighting financial crime with graph analysis at BIWA Summit 2017Fighting financial crime with graph analysis at BIWA Summit 2017
Fighting financial crime with graph analysis at BIWA Summit 2017
 
Using graphs technologies for intelligence analysis.
Using graphs technologies for intelligence analysis. Using graphs technologies for intelligence analysis.
Using graphs technologies for intelligence analysis.
 
Xia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLXia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATL
 
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache Spark
 
Lambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandLambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie Strickland
 
Social Network Analysis with Spark
Social Network Analysis with SparkSocial Network Analysis with Spark
Social Network Analysis with Spark
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analytics
 

Similar to GraphX and Pregel - Apache Spark

Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Roger Huang
 
Multiple Graphs: Updatable Views
Multiple Graphs: Updatable ViewsMultiple Graphs: Updatable Views
Multiple Graphs: Updatable ViewsopenCypher
 
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS RasterStockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS RasterACSG Section Montréal
 
Learn basics of Clojure/script and Reagent
Learn basics of Clojure/script and ReagentLearn basics of Clojure/script and Reagent
Learn basics of Clojure/script and ReagentMaty Fedak
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGMapR Technologies
 
Roadmap y Novedades de producto
Roadmap y Novedades de productoRoadmap y Novedades de producto
Roadmap y Novedades de productoNeo4j
 
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021Peng Cheng
 
Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud confamarsri
 
GraphQL & DGraph with Go
GraphQL & DGraph with GoGraphQL & DGraph with Go
GraphQL & DGraph with GoJames Tan
 
GraphQL Meetup Bangkok 3.0
GraphQL Meetup Bangkok 3.0GraphQL Meetup Bangkok 3.0
GraphQL Meetup Bangkok 3.0Tobias Meixner
 
RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)Daniel Nüst
 
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...Thejaka Amila Kanewala, Ph.D.
 
Multiple graphs in openCypher
Multiple graphs in openCypherMultiple graphs in openCypher
Multiple graphs in openCypheropenCypher
 
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl HilleslandPG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl HilleslandAMD Developer Central
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
2017 RM-URISA Track: Spatial SQL - The Best Kept Secret in the Geospatial World
2017 RM-URISA Track:  Spatial SQL - The Best Kept Secret in the Geospatial World2017 RM-URISA Track:  Spatial SQL - The Best Kept Secret in the Geospatial World
2017 RM-URISA Track: Spatial SQL - The Best Kept Secret in the Geospatial WorldGIS in the Rockies
 
Loom & Functional Graphs in Clojure @ LambdaConf 2015
Loom & Functional Graphs in Clojure @ LambdaConf 2015Loom & Functional Graphs in Clojure @ LambdaConf 2015
Loom & Functional Graphs in Clojure @ LambdaConf 2015Aysylu Greenberg
 

Similar to GraphX and Pregel - Apache Spark (20)

Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
 
Scala 20140715
Scala 20140715Scala 20140715
Scala 20140715
 
Multiple Graphs: Updatable Views
Multiple Graphs: Updatable ViewsMultiple Graphs: Updatable Views
Multiple Graphs: Updatable Views
 
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS RasterStockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
 
Learn basics of Clojure/script and Reagent
Learn basics of Clojure/script and ReagentLearn basics of Clojure/script and Reagent
Learn basics of Clojure/script and Reagent
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 
Roadmap y Novedades de producto
Roadmap y Novedades de productoRoadmap y Novedades de producto
Roadmap y Novedades de producto
 
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021
 
Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud conf
 
GraphQL & DGraph with Go
GraphQL & DGraph with GoGraphQL & DGraph with Go
GraphQL & DGraph with Go
 
GraphQL Meetup Bangkok 3.0
GraphQL Meetup Bangkok 3.0GraphQL Meetup Bangkok 3.0
GraphQL Meetup Bangkok 3.0
 
RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)
 
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
 
Multiple graphs in openCypher
Multiple graphs in openCypherMultiple graphs in openCypher
Multiple graphs in openCypher
 
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl HilleslandPG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
2017 RM-URISA Track: Spatial SQL - The Best Kept Secret in the Geospatial World
2017 RM-URISA Track:  Spatial SQL - The Best Kept Secret in the Geospatial World2017 RM-URISA Track:  Spatial SQL - The Best Kept Secret in the Geospatial World
2017 RM-URISA Track: Spatial SQL - The Best Kept Secret in the Geospatial World
 
Graph.pptx
Graph.pptxGraph.pptx
Graph.pptx
 
Loom & Functional Graphs in Clojure @ LambdaConf 2015
Loom & Functional Graphs in Clojure @ LambdaConf 2015Loom & Functional Graphs in Clojure @ LambdaConf 2015
Loom & Functional Graphs in Clojure @ LambdaConf 2015
 

Recently uploaded

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 

Recently uploaded (20)

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 

GraphX and Pregel - Apache Spark

  • 1. Spark GraphX & Pregel Challenges and Best Practices Ashutosh Trivedi (IIIT Bangalore) Kaushik Ranjan (IIIT Bangalore) Sigmoid-Meetup Bangalore https://github.com/anantasty/SparkAlgorithms
  • 2. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Agenda •Introduction to GraphX – How to describe a graph – RDDs to store Graph – Algorithms available •Application in graph algorithms – Feedback Vertex Set of a Graph – Identifying parallel parts of the solution. •Challenges we faced •Best practices 2
  • 3. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 33 GraphX - Representation
  • 4. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Graph Representation 4 class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) • The VertexRDD[A] extends RDD[(VertexID, A)] and adds the additional constraint that each VertexID occurs only once. • Moreover, VertexRDD[A] represents a set of vertices each with an attribute of type A • The EdgeRDD[ED], extends RDD[Edge[ED]]
  • 5. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 5 GraphX - Representation
  • 6. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 6 A BA Vertex and Edges Vertex Edge
  • 7. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Triplets Join Vertices and Edges • The triplets operator joins vertices and edges: TripletsVertices B A C D Edges A B A C B C C D A BA B A C B C C D 7
  • 8. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 88 Triplets elements
  • 9. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 9 Subgraphs Predicates vpred and epred
  • 10. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 10 Feedback Vertex Set • A feedback vertex set of a graph is a set of vertices whose removal leaves a graph without cycles. • Each feedback vertex set contains at least one vertex of any cycle in the graph. • The feedback vertex set problem is an NP- complete problem in computational complexity theory • Enumerate each simple cycle.
  • 11. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 11 1 2 34 5 6 7 8 9 10 Strongly Connected Components Each strongly connected component can be considered in parallel since they do not share any cycle SC1 – (1) SC2 – (5) SC3 – (8) SC4 – (9)
  • 12. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 12 FVS Algorithm #Greedy recursive solution FVS(G) sccGraph = scc(G) For each graph in sccGraph For each vertex remove vertex and again calculate scc, vertex V = vertex which give max number of scc #which means it kills maximum cycles subGraph = subgraph(remove V ) FVS (subGraph )
  • 13. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 13 1 2 4 3 2 4 3 Graph Iteration SCC count 3 1 4 3 1 1 2 4 3 1 2 4 3 1 2 4 3 Remove 2 Remove 1 Remove 3
  • 14. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 14 1 5 8 9 1 5 8 9Feedback Vertex Set
  • 15. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 15 FVS – Spark Implementation sccGraph has one more property sccID on each vertices, extract it
  • 16. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 16 FVS – Spark Implementation sccGraph = scc(G) For each graph in sccGraph
  • 17. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 17 FVS – Spark Implementation #Greedy recursive function
  • 18. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 18 FVS – Spark Implementation For each vertex remove vertex and again calculate scc, # Z is a list of scc count after removing each vertex
  • 19. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 19 vertex V = vertex which give max number of scc #which means it kills maximum cycles FVS – Spark Implementation
  • 20. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 20 subGraph = subgraph(remove V ) FVS (subGraph ) FVS – Spark Implementation
  • 21. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 21 Pregel • Graph DB – Data Storage – Data Mining • Advantages – Large-scale distributed computations – Parallel-algorithms for graphs on multiple machines – Fault tolerance and distributability
  • 22. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 22 Oldest Follower What is the age of oldest follower of each user ? Val oldestFollowerAge = graph .aggregateMessages( #map word => (word.dst.id, word.src.age), #reduce (a,b) => max(a, b) ) .vertices mapReduceTriplets is now aggregateMessages
  • 23. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 23 In aggregateMessages : • EdgeContext which exposes the triplet fields . • functions to explicitly send messages to the source and destination vertex. • It require the user to indicate what fields in the triplet are actually required. New in GraphX
  • 24. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Theory – it’s Good How it works – that’s awesome 24 Graph’s are recursive data-structures, where the property of a vertex is dependent on the properties of it’s neighbors, which in turn are dependent on the properties of their neighbors.
  • 25. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Graph.Pregel ( initialMessage ) ( #message consumption ( vertexID, initialProperty, message ) → compute new property , #message generation triplet → .. code .. Iterator( vertexID, message ) Iterator.empty , #message aggregation ( existing message set, new message ) → NEW message set ) 25 Architecture
  • 26. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 26 1 2 4 3 1030 30 20 1 2 4 3 10 30 30 20 max [30,10,20] max [20] max [10] 1 2 4 3 100 10 10 1 2 4 3 10 0 10 10 max [10] max [10]
  • 27. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 27 Example - output 1 2 4 3 100 0 0
  • 28. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Applications - GIS • Algorithm – to compute all vertices in a directed graph, that can reach out to a given vertex. • Can be used for watershed delineation in Geographic Information Systems 28 Vertices that can reach out to E are A and B
  • 29. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Algorithm Graph.Pregel( Seq[vertexID’s] ) ( #message consumption if vertex.state == 1 vertex.state → 2 else if vertex.state == 0 if ( vertex.adjacentVertices ∩ Seq[ vertexID’s ] ) isNotEmpty vertex.state → 2 #message aggregator Seq[existing vertex ID’s] U Seq[new vertex ID] ) 29
  • 30. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 30 #message generation for each triplet if destinationVertex.state == 1 message( sourceVertexID, Seq[destinationVertexID] ) message( destinationVertexID, Seq[destinationVertexID] ) else if sourceVertex.state == 1 and destinationVertex.state == 2 message( sourceVertexID, Seq[destinationVertexID] ) else message( empty ) Algorithm
  • 31. References • Fork our repository at • https://github.com/anantasty/SparkAlgorithms • Follow us at • https://github.com/codeAshu • https://github.com/kaushikranjan • https://spark.apache.org/docs/latest/graphx-programming-guide.html 31