Let Spark Fly: Advantages and Use Cases for Spark on Hadoop

© 2014 MapR Technologies 2
About MapR and Databricks
• Project leads for Spark,
formerly with UC Berkeley’s
AMPLab
• Founded in June 2013 and
backed by Andreessen
Horowitz
• Strong Engineering focus
* Forrester Wave Big Data Hadoop Solutions, Q1 2014
• Top Ranked distribution for
Hadoop*
• Hundreds of deployments
– 17 of Fortune 100
– Largest deployment in FSI
(1000+ nodes)
• Strong focus on making
Hadoop resilient and
enterprise grade
• Worldwide Presence

Hadoop Evolves
Make it solid
• HA: eliminate SPOFs
• Data Protection: recover
from application/user
errors
• Disaster Recovery: data
center outages
• Enterprise Integration:
breaking the wall that
separates Hadoop from
the rest
• Security & Multi-
tenancy: sharing the
cluster and meeting
SLA’s, secure
authorization, data
governance
Make it do more
(easily)
• Interactive apps (i.e.
SQL)
• Iterative programs
• Streaming apps
• Medium/Small Data
• Architecture: using
memory efficiently
• How many different tools
should it take?
– It’s hard to get
interoperability amongst
different data-parallel models
right
– Learning curves and
operational costs increase
with each new tool

MapR – Top ranked Hadoop distribution
Management
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Batch
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisioning /
coordination
Savannah*
Mahout
ML, Graph
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governance
Tez*
Accumulo*
Hive
Impala
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integratio
n
& Access
HttpFS
Hue
* Certification/support planned for 2014
Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability
• High availability
• Data protection
• Disaster recovery
• Standard file
access
• Standard
database access
• Pluggable
services
• Broad developer
support
• Enterprise
security
authorization
• Wire-level
authentication
• Data governance
• Ability to support
predictive
analytics, real-
time database
operations, and
support high
arrival rate data
• Ability to logically
divide a cluster to
support different
use cases, job
types, user
groups, and
administrators
• 2X to 7X higher
performance
• Consistent, low
latency
* Forrester Wave Big Data Hadoop Solutions, Q1 2014

MapR – The Only Distribution to Integrate
the Complete Apache Spark Stack
Management
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Batch
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisioning
&
coordination
Savannah*
Mahout
ML, Graph
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governan
ce
Tez*
Accumulo*
Hive
Impala
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integratio
n
& Access
HttpFS
Hue
* Certification/support planned for 2014
Shark
(SQL)
Spark
Streaming
(Streaming)
MLLib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computation)
Spark
Spark
Streaming
MLLib
GraphX Shark

Spark on MapR
World-record performance on
disk coupled with in-memory
processing advantages
High Performance
Industry-leading enterprise-grade
High Availability, Data Protection
and Disaster Recovery
Enterprise-grade dependability for
Spark
Strategic partnership with
Databricks to ensure enterprise
support for the entire stack
24/7 Best-in-class Global Support
Spark stack can also be deployed
natively as an independent
standalone service on the MapR
cluster
Can Run Natively on MapR

Apache Spark
spark.apache.org
github.com/apache/spark
user@spark.apache.org
• Originally developed in 2009
in UC Berkeley’s AMP Lab
• Fully open sourced in 2010
• Top-level Apache Project as of
2014

Spark is The Most Active Open Source
Project in Big Data
Giraph
Storm
Tez
0
20
40
60
80
100
120
140
Projectcontributorsinpastyear

Spark: Easy and Fast Big Data
Easy to Develop
> Rich APIs in
Java, Scala, Pytho
n
> Interactive shell
Fast to Run
> General execution
graphs
> In-memory storage

Spark: Easy and Fast Big Data
Easy to Develop
> Rich APIs in
Java, Scala, Pytho
n
> Interactive shell
Fast to Run
> General execution
graphs
> In-memory storage
2-5× less code Up to 10× faster on disk,
100× in memory

Easy: Get Started Immediately
• Multi-language support
• Interactive Shell
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();

Easy: Get Started Immediately
• Multi-language support
• Interactive Shell
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
Java 8 (Coming Soon)
JavaRDD<String> lines = sc.textFile(...)
lines.filter(x -> x.contains(“ERROR”)).count()

Easy: Clean API
Resilient Distributed Datasets
• Collections of objects spread
across a cluster, stored in RAM
or on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
Operations
• Transformations
(e.g.
map, filter, groupBy)
• Actions
(e.g.
count, collect, save)
Write programs in terms of transformations on
distributed datasets

Easy: Expressive API
map reduce

Easy: Expressive API
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save ...

Easy: Example – Word Count
Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Hadoop MapReduce
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

Easy: Works Well With Hadoop
Data Compatibility
• Access your existing
Hadoop Data
• Use the same data
formats
• Adheres to data locality
for efficient processing
Deployment Models
• “Standalone” deployment
• YARN-based deployment
• Mesos-based deployment
• Deploy on existing
Hadoop cluster or side-
by-side

Easy: User-Driven Roadmap
Language support
> Improved Python
support
> SparkR
> Java 8
> Integrated Schema
and SQL support in
Spark’s APIs
Better ML
> Sparse Data Support
> Model Evaluation
Framework
> Performance Testing

Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
print “Final w: %s” % w

Fast: Logistic Regression Performance
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
further iterations 1 s

Fast: Using RAM, Operator Graphs
In-memory Caching
• Data Partitions read from
RAM instead of disk
Operator Graphs
• Scheduling Optimizations
• Fault Tolerance
= cached partition
= RDD
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map

Fast: Scaling Down
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully
cached
Executiontime(s)
% of working set in cache

Easy: Fault Recovery
RDDs track lineage information that can be used to
efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))

Easy: Unified Platform
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
GraphX
(Graph
computation)
Continued innovation bringing new functionality, e.g.:
• BlinkDB (Approximate Queries)
• SparkR (R wrapper for Spark)
• Tachyon (off-heap RDD caching)

Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
GraphX
(Graph
computation)

Hive Compatibility
• Interfaces to access data and code in the Hive
ecosystem:
o Support for writing queries in HQL
o Catalog for that interfaces with the
Hive MetaStore
o Tablescan operator that uses Hive SerDes
o Wrappers for Hive UDFs, UDAFs, UDTFs

Parquet Support
Native support for reading data stored in
Parquet:
• Columnar storage avoids reading
unneeded data.
• Currently only supports flat structures
(nested data on short-term roadmap).
• RDDs can be written to parquet
files, preserving the schema.

Mixing SQL and Machine Learning
val trainingDataTable = sql(""" SELECT
e.action, u.age, u.latitude, u.logitude FROM Users u
JOIN Events e ON u.userId = e.userId""")// Since `sql`
returns an RDD, the results of can be easily used in MLlib
val trainingData = trainingDataTable.map { row =>
val features = Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model = new
LogisticRegressionWithSGD().run(trainingData)

Relationship to
Borrows
• Hive data loading code / in-
memory columnar
representation
• hardened spark execution
engine
Adds
• RDD-aware optimizer /
query planner
• execution engine
• language interfaces.
Catalyst/SparkSQL is a nearly from scratch
rewrite that leverages the best parts of Shark

Shark
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
GraphX
(Graph
computation)

Spark Streaming
Run a streaming computation as a series of very
small, deterministic batch jobs
34
Spark
Spark
Streaming
batches of X
seconds
live data stream
processed
results
• Chop up the live stream into batches of
½ second or more, leverage RDDs for
micro-batch processing
• Use the same familiar Spark APIs to
process streams
• Combine your batch and online
processing in a single system
• Guarantee exactly-once semantics

DStream of data
Window-based Transformations
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()
sliding window
operation
window length sliding interval
window length
sliding interval

MLlib – Machine Learning library
Logis] c*Regression,*Linear*SVM*(+L1,*L2),*Decision*
Trees,*Naive*Bayes"
Linear*Regression*(+Lasso,*Ridge)*
Alterna] ng*Least*Squares*
KZMeans,*SVD*
SGD,*Parallel*Gradient*
Scala,*Java,*PySpark*(0.9)
MLlib
Classifica. on:"
Regression:"
Collabora. ve"Filtering:"
Clustering"/"Explora. on:"
Op. miza. on"Primi. ves:"
Interopera. lity:"

Enabling users to easily and efficiently
express the entire graph analytics
pipeline
New API
Blurs the distinction
between Tables and
Graphs
New System
Combines Data-Parallel
Graph-Parallel Systems
The GraphX Unified Approach

Easy: Unified Platform
Shark
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
GraphX
(Graph
computation)
Continued innovation bringing new functionality, e.g.,:
• BlinkDB (Approximate Queries)
• SparkR (R wrapper for Spark)
• Tachyon (off-heap RDD caching)

Interactive Exploratory Analytics
• Leverage Spark’s in-memory caching and efficient
execution to explore large distributed datasets
• Use Spark’s APIs to explore any kind of data
(structured, unstructured, semi-structured, etc.) and
combine programming models
• Execute arbitrary code using a fully-functional interactive
programming environment
• Connect external tools via SQL Drivers

Machine Learning
• Improve performance of iterative algorithms by caching
frequently accessed datasets
• Develop programs that are easy to reason using a fully-
capable functional programming style
• Refine algorithms using the interactive REPL
• Use carefully-curated algorithms out-of-the-box with
MLlib

Power Real-time Dashboards
• Use Spark Streaming to perform low-latency window-
based aggregations
• Combine offline models with streaming data for online
clustering and classification within the dashboard
• Use Spark’s core APIs and/or Spark SQL to give users
large-scale, low-latency drill-down capabilities in
exploring dashboard data

Faster ETL
• Leverage Spark’s optimized scheduling for more efficient
I/O on large datasets, and in-memory processing for
aggregations, shuffles, and more
• Use Spark SQL to perform ETL using a familiar SQL
interface
• Easily port PIG scripts to Spark’s API
• Run existing HIVE queries directly on Spark SQL or
Shark

San Francisco
June 30 – July 2
• Use Cases
• Tech Talks
• Training
http://spark-summit.org/

Q&A
@mapr maprtech
adawar@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Let Spark Fly: Advantages and Use Cases for Spark on Hadoop

Similar to Let Spark Fly: Advantages and Use Cases for Spark on Hadoop (20)

More from MapR Technologies

More from MapR Technologies (20)

Recently uploaded

Recently uploaded (20)

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop

Editor's Notes