SlideShare a Scribd company logo
1 of 48
Download to read offline
Scalable graph analysis with
Apache Giraph and Spark GraphX
Roman Shaposhnik rvs@apache.org @rhatr
Director of Open Source, Pivotal Inc.
Introduction into
scalable graph analysis with
Apache Giraph and Spark GraphX
Roman Shaposhnik rvs@apache.org @rhatr
Director of Open Source, Pivotal Inc.
Shameless plug #1
Shameless plug #1
Agenda:
Lets define some terms
•  Graph is a G = (V, E), where E VxV
•  Directed multigraphs with properties attached to each vertex and edge

foo
bar
fee
Lets define some terms
•  Graph is a G = (V, E), where E VxV
•  Directed multigraphs with properties attached to each vertex and edge

foo
bar
fee
Lets define some terms
•  Graph is a G = (V, E), where E VxV
•  Directed multigraphs with properties attached to each vertex and edge

foo
bar
fee
2
1
Lets define some terms
•  Graph is a G = (V, E), where E VxV
•  Directed multigraphs with properties attached to each vertex and edge

foo
bar
fee
2
1 foo
bar
fee
42
fum
What kind of graphs are we talking about?
• Page ranking on Facebook social graph (mid 2013)
•  10^9 (billions) vertices
•  10^12 (trillion) edges
•  10^15 (petabtybe) cold storage data scale
•  200 servers
•  …all in under 4 minutes!
“On day one Doug created
HDFS and MapReduce”
Google papers that started it all
• GFS (file system)
•  distributed
•  replicated
•  non-POSIX"

• MapReduce (computational framework)
•  distributed
•  batch-oriented (long jobs; final results)
•  data-gravity aware
•  designed for “embarrassingly parallel” algorithms
HDFS pools and abstracts direct-attached storage
…
HDFS
MR MR
A Unix analogy
§ It is as though instead of:
$	
  grep	
  foo	
  bar.txt	
  |	
  tr	
  “,”	
  “	
  “	
  |	
  sort	
  -­‐u	
  
	
  
§ We are doing:
$	
  grep	
  foo	
  <	
  bar.txt	
  >	
  /tmp/1.txt	
  
$	
  tr	
  “,”	
  “	
  “	
  	
  <	
  /tmp/1.txt	
  >	
  /tmp/2.txt	
  
$	
  sort	
  –u	
  <	
  /tmp/2.txt	
  
Enter Apache Spark
RAM is the new disk, Disk is the new tape
Source: UC Berkeley Spark project (just the image)
RDDs instead of HDFS files, RAM instead of Disk
warnings = textFile(…).filter(_.contains(“warning”))
.map(_.split(‘ ‘)(1))
HadoopRDD
path = hdfs://
FilteredRDD
contains…
MappedRDD
split…
pooled RAM
RDDs: resilient, distributed, datasets
§ Distributed on a cluster in RAM
§ Immutable (mostly)
§ Can be evicted, snapshotted, etc.
§ Manipulated via parallel operators (map, etc.)
§ Automatically rebuilt on failure
§ A parallel ecosystem
§ A solution to iterative and multi-stage apps
What’s so special about Graphs and
big data?
Graph relationships
§ Entities in your data: tuples
-  customer data
-  product data
-  interaction data
§ Connection between entities: graphs
-  social network or my customers
-  clustering of customers vs. products
A word about Graph databases
§  Plenty available
-  Neo4J, Titan, etc.
§  Benefits
-  Query language
-  Tightly integrate systems with few moving parts
-  High performance on known data sets
§  Shortcomings
-  Not easy to scale horizontally
-  Don’t integrate with HDFS
-  Combine storage and computational layers
-  A sea of APIs
What’s the key API?
§ Directed multi-graph with labels attached to vertices and edges
§ Defining vertices and edges dynamically
§ Selecting sub-graphs
§ Mutating the topology of the graph
§ Partitioning the graph
§ Computing model that is
-  iterative
-  scalable (shared nothing)
-  resilient
-  easy to manage at scale
Bulk Synchronous Parallel
BSP compute model
BSP in a nutshell
time
communications
local
processing
barrier #1
barrier #2
barrier #3
Vertex-centric BSP application
@rhatr
@TheASF
@c0sin
“Think like a vertex”
•  I know my local state
•  I know my neighbors
•  I can send messages to vertices
•  I can declare that I am done
•  I can mutate graph topology
Local state, global messaging
time
communications
vertices are
doing local
computing
and pooling 
messages
superstep #1
all vertices are
done computing
superstep #2
Lets put it all together
Hadoop ecosystem view
HDFS
Pig
Sqoop Flume
MR
Hive
Tez
Giraph
Mahout
Spark
SparkSQL
MLib
GraphX
HAWQ
Kafka
YARN
MADlib
Spark view
HDFS, Ceph, GlusterFS, S3
Hive
Spark
SparkSQL
MLib
GraphX
Kafka
YARN, Mesos, MR
Enough boxology!
Lets look at some code
Our toy for the rest of this talk
Adjacency lists stored on HDFS
$ hadoop fs –cat /tmp/graph/1.txt
1
2 1 3
3 1 2
@rhatr
@TheASF
@c0sin
3
1
2
Graph modeling in GraphX
§  The property graph is parameterized over the vertex (VD) and edge (ED) types
class Graph[VD, ED] {
val vertices: VertexRDD[VD]
val edges: EdgeRDD[ED]
}
§  Graph[(String, String), String]
Hello world in GraphX
$ spark*/bin/spark-shell
scala val inputFile = sc.textFile(“hdfs:///tmp/graph/1.txt”)
scala val edges = inputFile.flatMap(s = { // “2 1 3”
val l = s.split(t); // [ “2”, “1”, “3” ]
l.drop(1).map(x = (l.head.toLong, x.toLong)) // [ (2, 1), (2, 3) ]
})
scala val graph = Graph.fromEdgeTuples(edges, ) // Graph[String, Int]
scala val result = graph.collectNeighborIds(EdgeDirection.Out).map(x =
println(Hello world from the:  + x._1 +  :  + x._2.mkString( )) )
scala result.collect() // don’t try this @home
Hello world from the: 1 :
Hello world from the: 2 : 1 3
Hello world from the: 3 : 1 2
Graph modeling in Giraph
BasicComputationI	
  extends	
  WritableComparable,	
  	
  	
  	
  	
  //	
  VertexID	
  	
  	
  -­‐-­‐	
  vertex	
  ref	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  V	
  extends	
  Writable,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  VertexData	
  -­‐-­‐	
  a	
  vertex	
  datum	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  E	
  extends	
  Writable,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  EdgeData	
  	
  	
  -­‐-­‐	
  an	
  edge	
  label	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  M	
  extends	
  Writable	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  MessageData-­‐–	
  message	
  payload	
  
	
  
	
  
V	
  is	
  sort	
  of	
  like	
  VD	
  
E	
  is	
  sort	
  of	
  like	
  ED	
  
Hello world in Giraph
public class GiraphHelloWorld extends
BasicComputationIntWritable, IntWritable, NullWritable, NullWritable {
public void compute(VertexIntWritable, IntWritable, NullWritable vertex,
IterableNullWritable messages) {
System.out.print(“Hello world from the: “ + vertex.getId() + “ : “);
for (EdgeIntWritable, NullWritable e : vertex.getEdges()) {
System.out.print(“ “ + e.getTargetVertexId());
}
System.out.println(“”);
vertex.voteToHalt();
}
}
How to run it
$ giraph target/*.jar giraph.GiraphHelloWorld 
-vip /tmp/graph/ 
-vif org.apache.giraph.io.formats.IntIntNullTextInputFormat 
-w 1 
-ca giraph.SplitMasterWorker=false,giraph.logLevel=error
Hello world from the: 1 :
Hello world from the: 2 : 1 3
Hello world from the: 3 : 1 2
Anatomy of Giraph run
BSP assumes an exclusively vertex view
Turning Twitter into Facebook
@rhatr
@TheASF
@c0sin
@rhatr
@TheASF
@c0sin
Hello world in Giraph
public void compute(VertexText, DoubleWritable, DoubleWritable vertex, IterableText ms ){
if (getSuperstep() == 0) {
sendMessageToAllEdges(vertex, vertex.getId());
} else {
for (Text m : ms) {
if (vertex.getEdgeValue(m) == null) {
vertex.addEdge(EdgeFactory.create(m, SYNTHETIC_EDGE));
}
}
}
vertex.voteToHalt();
}
BSP in GraphX
Single source shortest path
scala val sssp = graph.pregel(Double.PositiveInfinity) // Initial message
((id, dist, newDist) = math.min(dist, newDist), // Vertex Program
triplet = { // Send Message
if (triplet.srcAttr + triplet.attr  triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
(a,b) = math.min(a,b)) // Merge Messages
scala println(sssp.vertices.collect.mkString(n))
2
42
0
3
Single source shortest path
scala val sssp = graph.pregel(Double.PositiveInfinity) // Initial message
((id, dist, newDist) = math.min(dist, newDist), // Vertex Program
triplet = { // Send Message
if (triplet.srcAttr + triplet.attr  triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
(a,b) = math.min(a,b)) // Merge Messages
scala println(sssp.vertices.collect.mkString(n))
2
5
0
3
Operational views of the graph
Masking instead of mutation
§ def subgraph(
epred: EdgeTriplet[VD,ED] = Boolean = (x = true),
vpred: (VertexID, VD) = Boolean = ((v, d) = true))
: Graph[VD, ED]
§ def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
Built-in algorithms
§  def pageRank(tol: Double, resetProb: Double = 0.15):
Graph[Double, Double]
§  def connectedComponents(): Graph[VertexID, ED]
§  def triangleCount(): Graph[Int, ED]
§  def stronglyConnectedComponents(numIter: Int): Graph[VertexID, ED]
Final thoughts
Giraph
§ An unconstrained BSP framework
§ Specialized fully mutable,
dynamically balanced in-memory
graph representation
§ Very procedural, vertex-centric
programming model
§ Genuine part of Hadoop ecosystem
§ Definitely a 1.0
GraphX
§ An RDD framework
§ Graphs are “views” on RDDs and
thus immutable
§ Functional-like, “declarative”
programming model
§ Genuine part of Spark ecosystem
§ Technically still an alpha
QA
Thanks!

More Related Content

What's hot

Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Spark Summit
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Apache Giraph: Large-scale graph processing done better
Apache Giraph: Large-scale graph processing done betterApache Giraph: Large-scale graph processing done better
Apache Giraph: Large-scale graph processing done better🧑‍💻 Manuel Coppotelli
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in SearchAmund Tveit
 
Graph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBGraph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBMohamed Taher Alrefaie
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks DataWorks Summit/Hadoop Summit
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache CalciteDataWorks Summit
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Yves Raimond
 

What's hot (20)

Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Apache Giraph: Large-scale graph processing done better
Apache Giraph: Large-scale graph processing done betterApache Giraph: Large-scale graph processing done better
Apache Giraph: Large-scale graph processing done better
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Graph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBGraph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DB
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache Giraph
Apache GiraphApache Giraph
Apache Giraph
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
 

Viewers also liked

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphDataWorks Summit
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with SparkSandy Ryza
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 

Viewers also liked (13)

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with Spark
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 

Similar to Introduction into scalable graph analysis with Apache Giraph and Spark GraphX

Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph ProcessingVasia Kalavri
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Gabriele Modena
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for MobileBugSense
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 

Similar to Introduction into scalable graph analysis with Apache Giraph and Spark GraphX (20)

Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph Processing
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for Mobile
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Scala+data
Scala+dataScala+data
Scala+data
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

More from rhatr

Unikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemUnikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemrhatr
 
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...rhatr
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?rhatr
 
OSv: probably the best OS for cloud workloads you've never hear of
OSv: probably the best OS for cloud workloads you've never hear ofOSv: probably the best OS for cloud workloads you've never hear of
OSv: probably the best OS for cloud workloads you've never hear ofrhatr
 
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platformApache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platformrhatr
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloudrhatr
 

More from rhatr (8)

Unikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemUnikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystem
 
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
OSv: probably the best OS for cloud workloads you've never hear of
OSv: probably the best OS for cloud workloads you've never hear ofOSv: probably the best OS for cloud workloads you've never hear of
OSv: probably the best OS for cloud workloads you've never hear of
 
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platformApache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloud
 

Recently uploaded

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 

Recently uploaded (20)

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 

Introduction into scalable graph analysis with Apache Giraph and Spark GraphX

  • 1. Scalable graph analysis with Apache Giraph and Spark GraphX Roman Shaposhnik rvs@apache.org @rhatr Director of Open Source, Pivotal Inc.
  • 2. Introduction into scalable graph analysis with Apache Giraph and Spark GraphX Roman Shaposhnik rvs@apache.org @rhatr Director of Open Source, Pivotal Inc.
  • 6. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee
  • 7. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee
  • 8. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee 2 1
  • 9. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee 2 1 foo bar fee 42 fum
  • 10. What kind of graphs are we talking about? • Page ranking on Facebook social graph (mid 2013) •  10^9 (billions) vertices •  10^12 (trillion) edges •  10^15 (petabtybe) cold storage data scale •  200 servers •  …all in under 4 minutes!
  • 11. “On day one Doug created HDFS and MapReduce”
  • 12. Google papers that started it all • GFS (file system) •  distributed •  replicated •  non-POSIX" • MapReduce (computational framework) •  distributed •  batch-oriented (long jobs; final results) •  data-gravity aware •  designed for “embarrassingly parallel” algorithms
  • 13. HDFS pools and abstracts direct-attached storage … HDFS MR MR
  • 14. A Unix analogy § It is as though instead of: $  grep  foo  bar.txt  |  tr  “,”  “  “  |  sort  -­‐u     § We are doing: $  grep  foo  <  bar.txt  >  /tmp/1.txt   $  tr  “,”  “  “    <  /tmp/1.txt  >  /tmp/2.txt   $  sort  –u  <  /tmp/2.txt  
  • 16. RAM is the new disk, Disk is the new tape Source: UC Berkeley Spark project (just the image)
  • 17. RDDs instead of HDFS files, RAM instead of Disk warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1)) HadoopRDD path = hdfs:// FilteredRDD contains… MappedRDD split… pooled RAM
  • 18. RDDs: resilient, distributed, datasets § Distributed on a cluster in RAM § Immutable (mostly) § Can be evicted, snapshotted, etc. § Manipulated via parallel operators (map, etc.) § Automatically rebuilt on failure § A parallel ecosystem § A solution to iterative and multi-stage apps
  • 19. What’s so special about Graphs and big data?
  • 20. Graph relationships § Entities in your data: tuples -  customer data -  product data -  interaction data § Connection between entities: graphs -  social network or my customers -  clustering of customers vs. products
  • 21. A word about Graph databases §  Plenty available -  Neo4J, Titan, etc. §  Benefits -  Query language -  Tightly integrate systems with few moving parts -  High performance on known data sets §  Shortcomings -  Not easy to scale horizontally -  Don’t integrate with HDFS -  Combine storage and computational layers -  A sea of APIs
  • 22. What’s the key API? § Directed multi-graph with labels attached to vertices and edges § Defining vertices and edges dynamically § Selecting sub-graphs § Mutating the topology of the graph § Partitioning the graph § Computing model that is -  iterative -  scalable (shared nothing) -  resilient -  easy to manage at scale
  • 24. BSP in a nutshell time communications local processing barrier #1 barrier #2 barrier #3
  • 25. Vertex-centric BSP application @rhatr @TheASF @c0sin “Think like a vertex” •  I know my local state •  I know my neighbors •  I can send messages to vertices •  I can declare that I am done •  I can mutate graph topology
  • 26. Local state, global messaging time communications vertices are doing local computing and pooling messages superstep #1 all vertices are done computing superstep #2
  • 27. Lets put it all together
  • 28. Hadoop ecosystem view HDFS Pig Sqoop Flume MR Hive Tez Giraph Mahout Spark SparkSQL MLib GraphX HAWQ Kafka YARN MADlib
  • 29. Spark view HDFS, Ceph, GlusterFS, S3 Hive Spark SparkSQL MLib GraphX Kafka YARN, Mesos, MR
  • 31. Our toy for the rest of this talk Adjacency lists stored on HDFS $ hadoop fs –cat /tmp/graph/1.txt 1 2 1 3 3 1 2 @rhatr @TheASF @c0sin 3 1 2
  • 32. Graph modeling in GraphX §  The property graph is parameterized over the vertex (VD) and edge (ED) types class Graph[VD, ED] { val vertices: VertexRDD[VD] val edges: EdgeRDD[ED] } §  Graph[(String, String), String]
  • 33. Hello world in GraphX $ spark*/bin/spark-shell scala val inputFile = sc.textFile(“hdfs:///tmp/graph/1.txt”) scala val edges = inputFile.flatMap(s = { // “2 1 3” val l = s.split(t); // [ “2”, “1”, “3” ] l.drop(1).map(x = (l.head.toLong, x.toLong)) // [ (2, 1), (2, 3) ] }) scala val graph = Graph.fromEdgeTuples(edges, ) // Graph[String, Int] scala val result = graph.collectNeighborIds(EdgeDirection.Out).map(x = println(Hello world from the: + x._1 + : + x._2.mkString( )) ) scala result.collect() // don’t try this @home Hello world from the: 1 : Hello world from the: 2 : 1 3 Hello world from the: 3 : 1 2
  • 34. Graph modeling in Giraph BasicComputationI  extends  WritableComparable,          //  VertexID      -­‐-­‐  vertex  ref                                                                                                        V  extends  Writable,                              //  VertexData  -­‐-­‐  a  vertex  datum                                    E  extends  Writable,                              //  EdgeData      -­‐-­‐  an  edge  label                                    M  extends  Writable                              //  MessageData-­‐–  message  payload       V  is  sort  of  like  VD   E  is  sort  of  like  ED  
  • 35. Hello world in Giraph public class GiraphHelloWorld extends BasicComputationIntWritable, IntWritable, NullWritable, NullWritable { public void compute(VertexIntWritable, IntWritable, NullWritable vertex, IterableNullWritable messages) { System.out.print(“Hello world from the: “ + vertex.getId() + “ : “); for (EdgeIntWritable, NullWritable e : vertex.getEdges()) { System.out.print(“ “ + e.getTargetVertexId()); } System.out.println(“”); vertex.voteToHalt(); } }
  • 36. How to run it $ giraph target/*.jar giraph.GiraphHelloWorld -vip /tmp/graph/ -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -w 1 -ca giraph.SplitMasterWorker=false,giraph.logLevel=error Hello world from the: 1 : Hello world from the: 2 : 1 3 Hello world from the: 3 : 1 2
  • 38. BSP assumes an exclusively vertex view
  • 39. Turning Twitter into Facebook @rhatr @TheASF @c0sin @rhatr @TheASF @c0sin
  • 40. Hello world in Giraph public void compute(VertexText, DoubleWritable, DoubleWritable vertex, IterableText ms ){ if (getSuperstep() == 0) { sendMessageToAllEdges(vertex, vertex.getId()); } else { for (Text m : ms) { if (vertex.getEdgeValue(m) == null) { vertex.addEdge(EdgeFactory.create(m, SYNTHETIC_EDGE)); } } } vertex.voteToHalt(); }
  • 42. Single source shortest path scala val sssp = graph.pregel(Double.PositiveInfinity) // Initial message ((id, dist, newDist) = math.min(dist, newDist), // Vertex Program triplet = { // Send Message if (triplet.srcAttr + triplet.attr triplet.dstAttr) { Iterator((triplet.dstId, triplet.srcAttr + triplet.attr)) } else { Iterator.empty } }, (a,b) = math.min(a,b)) // Merge Messages scala println(sssp.vertices.collect.mkString(n)) 2 42 0 3
  • 43. Single source shortest path scala val sssp = graph.pregel(Double.PositiveInfinity) // Initial message ((id, dist, newDist) = math.min(dist, newDist), // Vertex Program triplet = { // Send Message if (triplet.srcAttr + triplet.attr triplet.dstAttr) { Iterator((triplet.dstId, triplet.srcAttr + triplet.attr)) } else { Iterator.empty } }, (a,b) = math.min(a,b)) // Merge Messages scala println(sssp.vertices.collect.mkString(n)) 2 5 0 3
  • 44. Operational views of the graph
  • 45. Masking instead of mutation § def subgraph( epred: EdgeTriplet[VD,ED] = Boolean = (x = true), vpred: (VertexID, VD) = Boolean = ((v, d) = true)) : Graph[VD, ED] § def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
  • 46. Built-in algorithms §  def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double] §  def connectedComponents(): Graph[VertexID, ED] §  def triangleCount(): Graph[Int, ED] §  def stronglyConnectedComponents(numIter: Int): Graph[VertexID, ED]
  • 47. Final thoughts Giraph § An unconstrained BSP framework § Specialized fully mutable, dynamically balanced in-memory graph representation § Very procedural, vertex-centric programming model § Genuine part of Hadoop ecosystem § Definitely a 1.0 GraphX § An RDD framework § Graphs are “views” on RDDs and thus immutable § Functional-like, “declarative” programming model § Genuine part of Spark ecosystem § Technically still an alpha