Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab

Beyond Hadoop Map-Reduce

Dr. Vijay Srinivas Agneeswaran,
Director and Head, Big-data R&D,
Innovation Labs, Impetus
1

Contents
Big Data Computations

Hadoop 2.0 (Hadoop
YARN)

Berkeley data
analytics stack

• BDAS Spark
• BDAS Discretized
Streams

Real-time
analytics
with Storm
PMML • PMML Primer
Scoring
for Naïve • Naïve Bayes Primer
Bayes

2

Big Data Computations
Computations/Operations

Giant 1 (simple stats) is perfect
for Hadoop 1.0.

Giants 2 (linear algebra), 3 (Nbody), 4 (optimization) Spark
from UC Berkeley is efficient.

Interactive/On-the-fly data
processing – Storm.

Logistic regression, kernel SVMs,
conjugate gradient descent,
collaborative filtering, Gibbs
sampling, alternating least squares.

Example is social group-first
approach for consumer churn
analysis [1]

OLAP – data cube operations.
Dremel/Drill

Data sets – not embarrassingly
parallel?
Machine vision from Google
Deep Learning

Artificial Neural Networks
Speech analysis from Microsoft

Giant 5 – Graph processing –
GraphLab, Pregel, Giraph

3

[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.
[2] RICHTER, Yossi ; YOM-TOV, Elad ; SLONIM, Noam: Predicting Customer Churn in Mobile Networks through Analysis of
Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741

Hadoop YARN Requirements or 1.0 shortcomings
R1: Scalability

R2: Multi-tenancy

• single cluster limitation

• Addressed by Hadoopon-Demand
• Security, Quotas

R3: Locality
awareness

R4: Shared cluster
utilization

• Shuffle of records

• Hogging by users
• Typed slots

R5:
Reliability/Availability
• Job Tracker bugs

R6: Iterative
Machine Learning

4

Vinod Kumar Vavilapalli, Arun C Murthy , Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves,
Jason Lowe , Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and
Eric Baldeschwieler, “Apache Hadoop YARN: Yet Another Resource Negotiator”, ACM Symposium on Cloud Computing,
Oct 2013, ACM Press.

YARN Internals

Application Master

• Sends
ResourceRequests
to the YARN RM
• Captures
containers,
resources per
container, locality
preferences.

YARN RM

• Generates tokens
and containers
• Global view of
cluster – monolithic
scheduling.

Node Manager

• Node health
monitoring,
advertise available
resources through
heartbeats to RM.

6

Berkeley Big-data Analytics Stack (BDAS)

7

BDAS: Spark
Transformations/Actions
Map(function f1)
Filter(function f2)
flatMap(function f3)
Union(RDD r1)
Sample(flag, p, seed)
groupByKey(noTasks)

Description
Pass each element of the RDD through f1 in parallel and return the resulting RDD.
Select elements of RDD that return true when passed through f2.
Similar to Map, but f3 returns a sequence to facilitate mapping single input to multiple
outputs.
Returns result of union of the RDD r1 with the self.
Returns a randomly sampled (with seed) p percentage of the RDD.
Can only be invoked on key-value paired data – returns data grouped by value. No. of
parallel tasks is given as an argument (default is 8).
Aggregates result of applying f4 on elements with same key. No. of parallel tasks is the
second argument.
Joins RDD r2 with self – computes all possible pairs for given key.
Joins RDD r3 with self and groups by key.

reduceByKey(function f4,
noTasks)
Join(RDD r2, noTasks)
groupWith(RDD r3,
noTasks)
sortByKey(flag)
Sorts the self RDD in ascending or descending based on flag.
Reduce(function f5)
Aggregates result of applying function f5 on all elements of self RDD
Collect()
Return all elements of the RDD as an array.
Count()
Count no. of elements in RDD
take(n)
Get first n elements of RDD.
First()
Equivalent to take(1)
saveAsTextFile(path)
Persists RDD in a file in HDFS or other Hadoop supported file system at given path.
saveAsSequenceFile(path Persist RDD as a Hadoop sequence file. Can be invoked only on key-value paired RDDs
)
that implement Hadoop writable interface or equivalent.
foreach(function f6)
Run f6 in parallel on elements of self Ankur
[MZ12] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,RDD. Dave, Justin Ma, Murphy McCauley, Michael

J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for inmemory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and
Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.

BDAS: Use Cases
Ooyala

Conviva

Uses Cassandra for
video data
personalization.

Uses Hive for
repeatedly running
ad-hoc queries on
video data.

Pre-compute
aggregates VS onthe-fly queries.

Optimized ad-hoc
queries using Spark
RDDs – found Spark
is 30 times faster
than Hive

Moved to Spark for
ML and computing
views.

ML for connection
analysis and video
streaming
optimization.

9

Moved to Shark for on-the-fly
queries – C* OLAP aggregate
queries on Cassandra 130 secs, 60
ms in Spark

Yahoo
Advertisement
targeting: 30K nodes
on Hadoop Yarn

Hadoop – batch processing
Spark – iterative processing
Storm – on-the-fly processing

Content
recommendation –
collaborative
filtering

Real-time Analytics: R over Storm

11

Real-time Analytics UC 1: Internet Traffic Analysis

Real-time Analysis UC2: Arrhythmia Detection

13

GraphLab: Ideal Engine for Processing Natural Graphs [YL12]
Goals – targeted at machine
learning.
• Model graph dependencies, be
asynchronous, iterative, dynamic.

Data associated with edges
(weights, for instance) and
vertices (user profile data, current
interests etc.).

Update functions – lives on each
vertex

Consistency is important in ML
algorithms (some do not even
converge when there are
inconsistent updates –
collaborative filtering).

• Transforms data in scope of vertex.
• Can choose to trigger neighbours (for
example only if Rank changes drastically)
• Run asynchronously till convergence –
no global barrier.

• GraphLab – provides varying level of
consistency. Parallelism VS consistency.

Implemented several algorithms,
including ALS, K-means, SVM,
Belief propagation, matrix
factorization, Gibbs sampling,
SVD, CoEM etc.
• Co-EM (Expectation Maximization)
algorithm 15x faster than Hadoop MR –
on distributed GraphLab, only 0.3% of
Hadoop execution time.
[YL12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed
GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716-727.

GraphLab 2: PowerGraph – Modeling Natural Graphs [1]

GraphLab could not
scale to Altavista web
graph 2002, 1.4B
vertices, 6.7B edges.

Powergraph provides
new way of
partitioning power law
graphs

• Most graph parallel
abstractions assume small
neighbourhoods – low
degree vertices
• But natural graphs
(LinkedIn, Facebook,
Twitter) – power law
graphs.
• Hard to partition power law
graphs, high degree
vertices limit parallelism.

• Edges are tied to
machines, vertices (esp.
high degree ones) span
machines
• Execution split into 3
phases:
• Gather, apply and
scatter.

Triangle counting on
Twitter graph
• Hadoop MR took 423
minutes on 1536 machines
• GraphLab 2 took 1.5
minutes on 1024 cores (64
machines)

[1] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin (2012). "PowerGraph:
Distributed Graph-Parallel Computation on Natural Graphs." Proceedings of the 10th USENIX Symposium
on Operating Systems Design and Implementation (OSDI '12).

PMML Primer

Predictive Model Markup
Language

Developed by DMG (Data
Mining Group)

PMML offers a standard
to define a model, so that
a model generated in
tool-A can be directly
used in tool-B.

XML representation of a
model.

May contain a myriad of
data transformations
(pre- and post-processing)
as well as one or more
predictive models.

16

Naïve Bayes Primer
A simple probabilistic
classifier based on
Bayes Theorem

Given features
X1,X2,…,Xn, predict a
label Y by calculating
the probability for all
possible Y value

Likelihood

Normalization Constant

Prior

17

PMML Scoring for Naïve Bayes
Wrote a PMML based
scoring engine for
Naïve Bayes
algorithm.

This can theoretically
be used in any
framework for data
processing by
invoking the API

Deployed a Naïve
Bayes PMML
generated from R into
Storm / Spark and
Samza frameworks

Real time predictions
with the above APIs

18

Header
• Version and timestamp
• Model development
environment information

Data Dictionary
• Variable types, missing
valid and invalid values,

Data
Munging/Transformation
• Normalization, mapping,
discretization

Model
• Model specifi attributes
• Mining Schema
• Treatment for missing
and outlier values
• Targets
• Prior probability and
default
• Outputs
• List of computer output
fields
• Post-processing
• Definition of model
architecture/parameters.

19

<DataDictionary numberOfFields="4">
<DataField name="Class" optype="categorical" dataType="string">
<Value value="democrat"/>
<Value value="republican"/>
</DataField>
<DataField name="V1" optype="categorical" dataType="string">
<Value value="n"/>
<Value value="y"/>
</DataField>
<Value value="n"/>
<Value value="y"/>
</DataField>
<Value value="n"/>
<Value value="y"/>
</DataField>
</DataDictionary>

(ctd on the next slide)

20

<NaiveBayesModel modelName="naiveBayes_Model" functionName="classification"
threshold="0.003">
<MiningSchema>
<MiningField name="Class" usageType="predicted"/>
<MiningField name="V1" usageType="active"/>
</MiningSchema>
<Output>
<OutputField name="Predicted_Class" feature="predictedValue"/>
<OutputField name="Probability_democrat" optype="continuous" dataType="double"
feature="probability" value="democrat"/>
<OutputField name="Probability_republican" optype="continuous" dataType="double"
feature="probability" value="republican"/>
</Output>
<BayesInputs>
(ctd on the next page)

21


22

<BayesInputs>
<BayesInput fieldName="V1">
<PairCounts value="n">
<TargetValueCounts>
<TargetValueCount value="democrat" count="51"/>
<TargetValueCount value="republican" count="85"/>
</TargetValueCounts>
</PairCounts>
<PairCounts value="y">
<TargetValueCounts>
</PairCounts>
</BayesInput>
*
*
</BayesInputs>
<BayesOutput fieldName="Class">
<TargetValueCounts>
</BayesOutput>

Definition Of Elements:DataDictionary :
Definitions for fields as used in mining models
( Class, V1, V2, V3 )
NaiveBayesModel :
Indicates that this is a NaiveBayes PMML
MiningSchema : lists fields as used in that model.
Class is “predicted” field,
V1,V2,V3 are “active” predictor fields
Output:
Describes a set of result values that can be returned
from a model
23

Definition Of Elements (ctd .. ) :BayesInputs:
For each type of inputs, contains the counts of
outputs
BayesOutput:
Contains the counts associated with the values of the
target field

24

Sample Input
Eg1 - n y y n y y n n n n n n y y y y
Eg2 - n y n y y y n n n n n y y y n y

• 1st , 2nd and 3rd Columns:
Predictor variables ( Attribute “name” in element MiningField )

• Using these we predict whether the Output is Democrat or
Republican ( PMML element BayesOutput)

25

• 3 Node Xeon Machines Storm cluster ( 8
quad code CPUs, 32 GB RAM, 32 GB
Swap space, 1 Nimbus, 2 Supervisors )
Number of records ( in
millions )

Time Taken (seconds)

0.1

4

0.4

7

1.0

12

2.0

21

10

129

25

310

26

• 3 Node Xeon Machines Spark cluster( 8
quad code CPUs, 32 GB RAM and 32
GB Swap space )
Number of records ( in
millions )

Time Taken (

0.1

1 min 47 sec

0.2

3 min 35 src

0.4

6 min 40 secs

1.0

35 mins 17 sec

10

More than 3 hrs

27

Conclusion
• Beyond Hadoop Map-Reduce philosophy
• Optimization and other problems.

• Real-time computation
• Processing specialized data structures

• PMML scoring
• Spark for batch computations

• Spark streaming and Storm for real-time.
28

• Allows traditional analytical tools/algorithms to be
re-used.

Thank You!

Mail
LinkedIn

• vijay.sa@impetus.co.in
• http://in.linkedin.com/in/vijaysrinivasagneeswaran

Blogs

• blogs.impetus.com

Twitter

• @a_vijaysrinivas.

Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (9)

Similar to Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab

Similar to Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab (20)

More from Vijay Srinivas Agneeswaran, Ph.D

More from Vijay Srinivas Agneeswaran, Ph.D (6)

Recently uploaded

Recently uploaded (20)

Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab