SlideShare a Scribd company logo
1 of 29
Beyond Hadoop Map-Reduce

Dr. Vijay Srinivas Agneeswaran,
Director and Head, Big-data R&D,
Innovation Labs, Impetus
1
Contents
Big Data Computations

Hadoop 2.0 (Hadoop
YARN)

Berkeley data
analytics stack

• BDAS Spark
• BDAS Discretized
Streams

Real-time
analytics
with Storm
PMML • PMML Primer
Scoring
for Naïve • Naïve Bayes Primer
Bayes

2
Big Data Computations
Computations/Operations

Giant 1 (simple stats) is perfect
for Hadoop 1.0.

Giants 2 (linear algebra), 3 (Nbody), 4 (optimization) Spark
from UC Berkeley is efficient.

Interactive/On-the-fly data
processing – Storm.

Logistic regression, kernel SVMs,
conjugate gradient descent,
collaborative filtering, Gibbs
sampling, alternating least squares.

Example is social group-first
approach for consumer churn
analysis [1]

OLAP – data cube operations.
Dremel/Drill

Data sets – not embarrassingly
parallel?
Machine vision from Google
Deep Learning

Artificial Neural Networks
Speech analysis from Microsoft

Giant 5 – Graph processing –
GraphLab, Pregel, Giraph

3

[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.
[2] RICHTER, Yossi ; YOM-TOV, Elad ; SLONIM, Noam: Predicting Customer Churn in Mobile Networks through Analysis of
Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741
Hadoop YARN Requirements or 1.0 shortcomings
R1: Scalability

R2: Multi-tenancy

• single cluster limitation

• Addressed by Hadoopon-Demand
• Security, Quotas

R3: Locality
awareness

R4: Shared cluster
utilization

• Shuffle of records

• Hogging by users
• Typed slots

R5:
Reliability/Availability
• Job Tracker bugs

R6: Iterative
Machine Learning

4

Vinod Kumar Vavilapalli, Arun C Murthy , Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves,
Jason Lowe , Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and
Eric Baldeschwieler, “Apache Hadoop YARN: Yet Another Resource Negotiator”, ACM Symposium on Cloud Computing,
Oct 2013, ACM Press.
5

Hadoop YARN Architecture
YARN Internals

Application Master

• Sends
ResourceRequests
to the YARN RM
• Captures
containers,
resources per
container, locality
preferences.

YARN RM

• Generates tokens
and containers
• Global view of
cluster – monolithic
scheduling.

Node Manager

• Node health
monitoring,
advertise available
resources through
heartbeats to RM.

6
Berkeley Big-data Analytics Stack (BDAS)

7
BDAS: Spark
Transformations/Actions
Map(function f1)
Filter(function f2)
flatMap(function f3)
Union(RDD r1)
Sample(flag, p, seed)
groupByKey(noTasks)

Description
Pass each element of the RDD through f1 in parallel and return the resulting RDD.
Select elements of RDD that return true when passed through f2.
Similar to Map, but f3 returns a sequence to facilitate mapping single input to multiple
outputs.
Returns result of union of the RDD r1 with the self.
Returns a randomly sampled (with seed) p percentage of the RDD.
Can only be invoked on key-value paired data – returns data grouped by value. No. of
parallel tasks is given as an argument (default is 8).
Aggregates result of applying f4 on elements with same key. No. of parallel tasks is the
second argument.
Joins RDD r2 with self – computes all possible pairs for given key.
Joins RDD r3 with self and groups by key.

reduceByKey(function f4,
noTasks)
Join(RDD r2, noTasks)
groupWith(RDD r3,
noTasks)
sortByKey(flag)
Sorts the self RDD in ascending or descending based on flag.
Reduce(function f5)
Aggregates result of applying function f5 on all elements of self RDD
Collect()
Return all elements of the RDD as an array.
Count()
Count no. of elements in RDD
take(n)
Get first n elements of RDD.
First()
Equivalent to take(1)
saveAsTextFile(path)
Persists RDD in a file in HDFS or other Hadoop supported file system at given path.
saveAsSequenceFile(path Persist RDD as a Hadoop sequence file. Can be invoked only on key-value paired RDDs
)
that implement Hadoop writable interface or equivalent.
foreach(function f6)
Run f6 in parallel on elements of self Ankur
[MZ12] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,RDD. Dave, Justin Ma, Murphy McCauley, Michael

J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for inmemory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and
Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.
BDAS: Use Cases
Ooyala

Conviva

Uses Cassandra for
video data
personalization.

Uses Hive for
repeatedly running
ad-hoc queries on
video data.

Pre-compute
aggregates VS onthe-fly queries.

Optimized ad-hoc
queries using Spark
RDDs – found Spark
is 30 times faster
than Hive

Moved to Spark for
ML and computing
views.

ML for connection
analysis and video
streaming
optimization.

9

Moved to Shark for on-the-fly
queries – C* OLAP aggregate
queries on Cassandra 130 secs, 60
ms in Spark

Yahoo
Advertisement
targeting: 30K nodes
on Hadoop Yarn

Hadoop – batch processing
Spark – iterative processing
Storm – on-the-fly processing

Content
recommendation –
collaborative
filtering
10
Real-time Analytics: R over Storm

11
Real-time Analytics UC 1: Internet Traffic Analysis
Real-time Analysis UC2: Arrhythmia Detection

13
GraphLab: Ideal Engine for Processing Natural Graphs [YL12]
Goals – targeted at machine
learning.
• Model graph dependencies, be
asynchronous, iterative, dynamic.

Data associated with edges
(weights, for instance) and
vertices (user profile data, current
interests etc.).

Update functions – lives on each
vertex

Consistency is important in ML
algorithms (some do not even
converge when there are
inconsistent updates –
collaborative filtering).

• Transforms data in scope of vertex.
• Can choose to trigger neighbours (for
example only if Rank changes drastically)
• Run asynchronously till convergence –
no global barrier.

• GraphLab – provides varying level of
consistency. Parallelism VS consistency.

Implemented several algorithms,
including ALS, K-means, SVM,
Belief propagation, matrix
factorization, Gibbs sampling,
SVD, CoEM etc.
• Co-EM (Expectation Maximization)
algorithm 15x faster than Hadoop MR –
on distributed GraphLab, only 0.3% of
Hadoop execution time.
[YL12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed
GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716-727.
GraphLab 2: PowerGraph – Modeling Natural Graphs [1]

GraphLab could not
scale to Altavista web
graph 2002, 1.4B
vertices, 6.7B edges.

Powergraph provides
new way of
partitioning power law
graphs

• Most graph parallel
abstractions assume small
neighbourhoods – low
degree vertices
• But natural graphs
(LinkedIn, Facebook,
Twitter) – power law
graphs.
• Hard to partition power law
graphs, high degree
vertices limit parallelism.

• Edges are tied to
machines, vertices (esp.
high degree ones) span
machines
• Execution split into 3
phases:
• Gather, apply and
scatter.

Triangle counting on
Twitter graph
• Hadoop MR took 423
minutes on 1536 machines
• GraphLab 2 took 1.5
minutes on 1024 cores (64
machines)

[1] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin (2012). "PowerGraph:
Distributed Graph-Parallel Computation on Natural Graphs." Proceedings of the 10th USENIX Symposium
on Operating Systems Design and Implementation (OSDI '12).
PMML Primer

Predictive Model Markup
Language

Developed by DMG (Data
Mining Group)

PMML offers a standard
to define a model, so that
a model generated in
tool-A can be directly
used in tool-B.

XML representation of a
model.

May contain a myriad of
data transformations
(pre- and post-processing)
as well as one or more
predictive models.

16
Naïve Bayes Primer
A simple probabilistic
classifier based on
Bayes Theorem

Given features
X1,X2,…,Xn, predict a
label Y by calculating
the probability for all
possible Y value

Likelihood

Normalization Constant

Prior

17
PMML Scoring for Naïve Bayes
Wrote a PMML based
scoring engine for
Naïve Bayes
algorithm.

This can theoretically
be used in any
framework for data
processing by
invoking the API

Deployed a Naïve
Bayes PMML
generated from R into
Storm / Spark and
Samza frameworks

Real time predictions
with the above APIs

18
Header
• Version and timestamp
• Model development
environment information

Data Dictionary
• Variable types, missing
valid and invalid values,

Data
Munging/Transformation
• Normalization, mapping,
discretization

Model
• Model specifi attributes
• Mining Schema
• Treatment for missing
and outlier values
• Targets
• Prior probability and
default
• Outputs
• List of computer output
fields
• Post-processing
• Definition of model
architecture/parameters.

19
PMML Scoring for Naïve Bayes
<DataDictionary numberOfFields="4">
<DataField name="Class" optype="categorical" dataType="string">
<Value value="democrat"/>
<Value value="republican"/>
</DataField>
<DataField name="V1" optype="categorical" dataType="string">
<Value value="n"/>
<Value value="y"/>
</DataField>
<DataField name="V2" optype="categorical" dataType="string">
<Value value="n"/>
<Value value="y"/>
</DataField>
<DataField name="V3" optype="categorical" dataType="string">
<Value value="n"/>
<Value value="y"/>
</DataField>
</DataDictionary>

(ctd on the next slide)

20
PMML Scoring for Naïve Bayes
<NaiveBayesModel modelName="naiveBayes_Model" functionName="classification"
threshold="0.003">
<MiningSchema>
<MiningField name="Class" usageType="predicted"/>
<MiningField name="V1" usageType="active"/>
<MiningField name="V2" usageType="active"/>
<MiningField name="V3" usageType="active"/>
</MiningSchema>
<Output>
<OutputField name="Predicted_Class" feature="predictedValue"/>
<OutputField name="Probability_democrat" optype="continuous" dataType="double"
feature="probability" value="democrat"/>
<OutputField name="Probability_republican" optype="continuous" dataType="double"
feature="probability" value="republican"/>
</Output>
<BayesInputs>
(ctd on the next page)

21
PMML Scoring for Naïve Bayes

22

<BayesInputs>
<BayesInput fieldName="V1">
<PairCounts value="n">
<TargetValueCounts>
<TargetValueCount value="democrat" count="51"/>
<TargetValueCount value="republican" count="85"/>
</TargetValueCounts>
</PairCounts>
<PairCounts value="y">
<TargetValueCounts>
<TargetValueCount value="democrat" count="73"/>
<TargetValueCount value="republican" count="23"/>
</TargetValueCounts>
</PairCounts>
</BayesInput>
<BayesInput fieldName="V2">
*
<BayesInput fieldName="V3">
*
</BayesInputs>
<BayesOutput fieldName="Class">
<TargetValueCounts>
<TargetValueCount value="democrat" count="124"/>
<TargetValueCount value="republican" count="108"/>
</TargetValueCounts>
</BayesOutput>
PMML Scoring for Naïve Bayes
Definition Of Elements:DataDictionary :
Definitions for fields as used in mining models
( Class, V1, V2, V3 )
NaiveBayesModel :
Indicates that this is a NaiveBayes PMML
MiningSchema : lists fields as used in that model.
Class is “predicted” field,
V1,V2,V3 are “active” predictor fields
Output:
Describes a set of result values that can be returned
from a model
23
PMML Scoring for Naïve Bayes
Definition Of Elements (ctd .. ) :BayesInputs:
For each type of inputs, contains the counts of
outputs
BayesOutput:
Contains the counts associated with the values of the
target field

24
PMML Scoring for Naïve Bayes
Sample Input
Eg1 - n y y n y y n n n n n n y y y y
Eg2 - n y n y y y n n n n n y y y n y

• 1st , 2nd and 3rd Columns:
Predictor variables ( Attribute “name” in element MiningField )

• Using these we predict whether the Output is Democrat or
Republican ( PMML element BayesOutput)

25
PMML Scoring for Naïve Bayes
• 3 Node Xeon Machines Storm cluster ( 8
quad code CPUs, 32 GB RAM, 32 GB
Swap space, 1 Nimbus, 2 Supervisors )
Number of records ( in
millions )

Time Taken (seconds)

0.1

4

0.4

7

1.0

12

2.0

21

10

129

25

310

26
PMML Scoring for Naïve Bayes
• 3 Node Xeon Machines Spark cluster( 8
quad code CPUs, 32 GB RAM and 32
GB Swap space )
Number of records ( in
millions )

Time Taken (

0.1

1 min 47 sec

0.2

3 min 35 src

0.4

6 min 40 secs

1.0

35 mins 17 sec

10

More than 3 hrs

27
Conclusion
• Beyond Hadoop Map-Reduce philosophy
• Optimization and other problems.

• Real-time computation
• Processing specialized data structures

• PMML scoring
• Spark for batch computations

• Spark streaming and Storm for real-time.
28

• Allows traditional analytical tools/algorithms to be
re-used.
Thank You!

Mail
LinkedIn

• vijay.sa@impetus.co.in
• http://in.linkedin.com/in/vijaysrinivasagneeswaran

Blogs

• blogs.impetus.com

Twitter

• @a_vijaysrinivas.

More Related Content

What's hot

Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working setsJinxinTang
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010BOSC 2010
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...Bikash Chandra Karmokar
 
Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC ConvergenceGeoffrey Fox
 
Dynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using ContainersDynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using ContainersIRJET Journal
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Gabriele Modena
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceHortonworks
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLMLconf
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingGabriele Modena
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce cscpconf
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox
 
An efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence apiAn efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence apiJoão Gabriel Lima
 

What's hot (19)

Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...
 
Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC Convergence
 
Dynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using ContainersDynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using Containers
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel 
 
An efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence apiAn efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence api
 

Viewers also liked

Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with SparkChris Fregly
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsDataWorks Summit
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)Claudiu Barbura
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...Kai Wähner
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 

Viewers also liked (9)

Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Similar to Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab

Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONijcsit
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?IJCSIS Research Publications
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analyticsFarheen Nilofer
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache SparkElvis Saravia
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningGianvito Siciliano
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performanceijcsa
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSSKevin Crocker
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Topic modeling using big data analytics
Topic modeling using big data analytics Topic modeling using big data analytics
Topic modeling using big data analytics Farheen Nilofer
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
 

Similar to Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab (20)

Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analytics
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Hadoop
HadoopHadoop
Hadoop
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Spark
SparkSpark
Spark
 
Topic modeling using big data analytics
Topic modeling using big data analytics Topic modeling using big data analytics
Topic modeling using big data analytics
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
 

More from Vijay Srinivas Agneeswaran, Ph.D

More from Vijay Srinivas Agneeswaran, Ph.D (6)

Dl surface statistical_regularities_vs_high_level_concepts_draft_v0.1
Dl surface statistical_regularities_vs_high_level_concepts_draft_v0.1Dl surface statistical_regularities_vs_high_level_concepts_draft_v0.1
Dl surface statistical_regularities_vs_high_level_concepts_draft_v0.1
 
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
 
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
 
Open problems big_data_19_feb_2015_ver_0.1
Open problems big_data_19_feb_2015_ver_0.1Open problems big_data_19_feb_2015_ver_0.1
Open problems big_data_19_feb_2015_ver_0.1
 
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
 

Recently uploaded

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Recently uploaded (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab

  • 1. Beyond Hadoop Map-Reduce Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D, Innovation Labs, Impetus 1
  • 2. Contents Big Data Computations Hadoop 2.0 (Hadoop YARN) Berkeley data analytics stack • BDAS Spark • BDAS Discretized Streams Real-time analytics with Storm PMML • PMML Primer Scoring for Naïve • Naïve Bayes Primer Bayes 2
  • 3. Big Data Computations Computations/Operations Giant 1 (simple stats) is perfect for Hadoop 1.0. Giants 2 (linear algebra), 3 (Nbody), 4 (optimization) Spark from UC Berkeley is efficient. Interactive/On-the-fly data processing – Storm. Logistic regression, kernel SVMs, conjugate gradient descent, collaborative filtering, Gibbs sampling, alternating least squares. Example is social group-first approach for consumer churn analysis [1] OLAP – data cube operations. Dremel/Drill Data sets – not embarrassingly parallel? Machine vision from Google Deep Learning Artificial Neural Networks Speech analysis from Microsoft Giant 5 – Graph processing – GraphLab, Pregel, Giraph 3 [1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013. [2] RICHTER, Yossi ; YOM-TOV, Elad ; SLONIM, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741
  • 4. Hadoop YARN Requirements or 1.0 shortcomings R1: Scalability R2: Multi-tenancy • single cluster limitation • Addressed by Hadoopon-Demand • Security, Quotas R3: Locality awareness R4: Shared cluster utilization • Shuffle of records • Hogging by users • Typed slots R5: Reliability/Availability • Job Tracker bugs R6: Iterative Machine Learning 4 Vinod Kumar Vavilapalli, Arun C Murthy , Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe , Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler, “Apache Hadoop YARN: Yet Another Resource Negotiator”, ACM Symposium on Cloud Computing, Oct 2013, ACM Press.
  • 6. YARN Internals Application Master • Sends ResourceRequests to the YARN RM • Captures containers, resources per container, locality preferences. YARN RM • Generates tokens and containers • Global view of cluster – monolithic scheduling. Node Manager • Node health monitoring, advertise available resources through heartbeats to RM. 6
  • 8. BDAS: Spark Transformations/Actions Map(function f1) Filter(function f2) flatMap(function f3) Union(RDD r1) Sample(flag, p, seed) groupByKey(noTasks) Description Pass each element of the RDD through f1 in parallel and return the resulting RDD. Select elements of RDD that return true when passed through f2. Similar to Map, but f3 returns a sequence to facilitate mapping single input to multiple outputs. Returns result of union of the RDD r1 with the self. Returns a randomly sampled (with seed) p percentage of the RDD. Can only be invoked on key-value paired data – returns data grouped by value. No. of parallel tasks is given as an argument (default is 8). Aggregates result of applying f4 on elements with same key. No. of parallel tasks is the second argument. Joins RDD r2 with self – computes all possible pairs for given key. Joins RDD r3 with self and groups by key. reduceByKey(function f4, noTasks) Join(RDD r2, noTasks) groupWith(RDD r3, noTasks) sortByKey(flag) Sorts the self RDD in ascending or descending based on flag. Reduce(function f5) Aggregates result of applying function f5 on all elements of self RDD Collect() Return all elements of the RDD as an array. Count() Count no. of elements in RDD take(n) Get first n elements of RDD. First() Equivalent to take(1) saveAsTextFile(path) Persists RDD in a file in HDFS or other Hadoop supported file system at given path. saveAsSequenceFile(path Persist RDD as a Hadoop sequence file. Can be invoked only on key-value paired RDDs ) that implement Hadoop writable interface or equivalent. foreach(function f6) Run f6 in parallel on elements of self Ankur [MZ12] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,RDD. Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for inmemory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.
  • 9. BDAS: Use Cases Ooyala Conviva Uses Cassandra for video data personalization. Uses Hive for repeatedly running ad-hoc queries on video data. Pre-compute aggregates VS onthe-fly queries. Optimized ad-hoc queries using Spark RDDs – found Spark is 30 times faster than Hive Moved to Spark for ML and computing views. ML for connection analysis and video streaming optimization. 9 Moved to Shark for on-the-fly queries – C* OLAP aggregate queries on Cassandra 130 secs, 60 ms in Spark Yahoo Advertisement targeting: 30K nodes on Hadoop Yarn Hadoop – batch processing Spark – iterative processing Storm – on-the-fly processing Content recommendation – collaborative filtering
  • 10. 10
  • 11. Real-time Analytics: R over Storm 11
  • 12. Real-time Analytics UC 1: Internet Traffic Analysis
  • 13. Real-time Analysis UC2: Arrhythmia Detection 13
  • 14. GraphLab: Ideal Engine for Processing Natural Graphs [YL12] Goals – targeted at machine learning. • Model graph dependencies, be asynchronous, iterative, dynamic. Data associated with edges (weights, for instance) and vertices (user profile data, current interests etc.). Update functions – lives on each vertex Consistency is important in ML algorithms (some do not even converge when there are inconsistent updates – collaborative filtering). • Transforms data in scope of vertex. • Can choose to trigger neighbours (for example only if Rank changes drastically) • Run asynchronously till convergence – no global barrier. • GraphLab – provides varying level of consistency. Parallelism VS consistency. Implemented several algorithms, including ALS, K-means, SVM, Belief propagation, matrix factorization, Gibbs sampling, SVD, CoEM etc. • Co-EM (Expectation Maximization) algorithm 15x faster than Hadoop MR – on distributed GraphLab, only 0.3% of Hadoop execution time. [YL12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716-727.
  • 15. GraphLab 2: PowerGraph – Modeling Natural Graphs [1] GraphLab could not scale to Altavista web graph 2002, 1.4B vertices, 6.7B edges. Powergraph provides new way of partitioning power law graphs • Most graph parallel abstractions assume small neighbourhoods – low degree vertices • But natural graphs (LinkedIn, Facebook, Twitter) – power law graphs. • Hard to partition power law graphs, high degree vertices limit parallelism. • Edges are tied to machines, vertices (esp. high degree ones) span machines • Execution split into 3 phases: • Gather, apply and scatter. Triangle counting on Twitter graph • Hadoop MR took 423 minutes on 1536 machines • GraphLab 2 took 1.5 minutes on 1024 cores (64 machines) [1] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin (2012). "PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs." Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12).
  • 16. PMML Primer Predictive Model Markup Language Developed by DMG (Data Mining Group) PMML offers a standard to define a model, so that a model generated in tool-A can be directly used in tool-B. XML representation of a model. May contain a myriad of data transformations (pre- and post-processing) as well as one or more predictive models. 16
  • 17. Naïve Bayes Primer A simple probabilistic classifier based on Bayes Theorem Given features X1,X2,…,Xn, predict a label Y by calculating the probability for all possible Y value Likelihood Normalization Constant Prior 17
  • 18. PMML Scoring for Naïve Bayes Wrote a PMML based scoring engine for Naïve Bayes algorithm. This can theoretically be used in any framework for data processing by invoking the API Deployed a Naïve Bayes PMML generated from R into Storm / Spark and Samza frameworks Real time predictions with the above APIs 18
  • 19. Header • Version and timestamp • Model development environment information Data Dictionary • Variable types, missing valid and invalid values, Data Munging/Transformation • Normalization, mapping, discretization Model • Model specifi attributes • Mining Schema • Treatment for missing and outlier values • Targets • Prior probability and default • Outputs • List of computer output fields • Post-processing • Definition of model architecture/parameters. 19
  • 20. PMML Scoring for Naïve Bayes <DataDictionary numberOfFields="4"> <DataField name="Class" optype="categorical" dataType="string"> <Value value="democrat"/> <Value value="republican"/> </DataField> <DataField name="V1" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V2" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V3" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> </DataDictionary> (ctd on the next slide) 20
  • 21. PMML Scoring for Naïve Bayes <NaiveBayesModel modelName="naiveBayes_Model" functionName="classification" threshold="0.003"> <MiningSchema> <MiningField name="Class" usageType="predicted"/> <MiningField name="V1" usageType="active"/> <MiningField name="V2" usageType="active"/> <MiningField name="V3" usageType="active"/> </MiningSchema> <Output> <OutputField name="Predicted_Class" feature="predictedValue"/> <OutputField name="Probability_democrat" optype="continuous" dataType="double" feature="probability" value="democrat"/> <OutputField name="Probability_republican" optype="continuous" dataType="double" feature="probability" value="republican"/> </Output> <BayesInputs> (ctd on the next page) 21
  • 22. PMML Scoring for Naïve Bayes 22 <BayesInputs> <BayesInput fieldName="V1"> <PairCounts value="n"> <TargetValueCounts> <TargetValueCount value="democrat" count="51"/> <TargetValueCount value="republican" count="85"/> </TargetValueCounts> </PairCounts> <PairCounts value="y"> <TargetValueCounts> <TargetValueCount value="democrat" count="73"/> <TargetValueCount value="republican" count="23"/> </TargetValueCounts> </PairCounts> </BayesInput> <BayesInput fieldName="V2"> * <BayesInput fieldName="V3"> * </BayesInputs> <BayesOutput fieldName="Class"> <TargetValueCounts> <TargetValueCount value="democrat" count="124"/> <TargetValueCount value="republican" count="108"/> </TargetValueCounts> </BayesOutput>
  • 23. PMML Scoring for Naïve Bayes Definition Of Elements:DataDictionary : Definitions for fields as used in mining models ( Class, V1, V2, V3 ) NaiveBayesModel : Indicates that this is a NaiveBayes PMML MiningSchema : lists fields as used in that model. Class is “predicted” field, V1,V2,V3 are “active” predictor fields Output: Describes a set of result values that can be returned from a model 23
  • 24. PMML Scoring for Naïve Bayes Definition Of Elements (ctd .. ) :BayesInputs: For each type of inputs, contains the counts of outputs BayesOutput: Contains the counts associated with the values of the target field 24
  • 25. PMML Scoring for Naïve Bayes Sample Input Eg1 - n y y n y y n n n n n n y y y y Eg2 - n y n y y y n n n n n y y y n y • 1st , 2nd and 3rd Columns: Predictor variables ( Attribute “name” in element MiningField ) • Using these we predict whether the Output is Democrat or Republican ( PMML element BayesOutput) 25
  • 26. PMML Scoring for Naïve Bayes • 3 Node Xeon Machines Storm cluster ( 8 quad code CPUs, 32 GB RAM, 32 GB Swap space, 1 Nimbus, 2 Supervisors ) Number of records ( in millions ) Time Taken (seconds) 0.1 4 0.4 7 1.0 12 2.0 21 10 129 25 310 26
  • 27. PMML Scoring for Naïve Bayes • 3 Node Xeon Machines Spark cluster( 8 quad code CPUs, 32 GB RAM and 32 GB Swap space ) Number of records ( in millions ) Time Taken ( 0.1 1 min 47 sec 0.2 3 min 35 src 0.4 6 min 40 secs 1.0 35 mins 17 sec 10 More than 3 hrs 27
  • 28. Conclusion • Beyond Hadoop Map-Reduce philosophy • Optimization and other problems. • Real-time computation • Processing specialized data structures • PMML scoring • Spark for batch computations • Spark streaming and Storm for real-time. 28 • Allows traditional analytical tools/algorithms to be re-used.
  • 29. Thank You! Mail LinkedIn • vijay.sa@impetus.co.in • http://in.linkedin.com/in/vijaysrinivasagneeswaran Blogs • blogs.impetus.com Twitter • @a_vijaysrinivas.