http://bit.ly/1BTaXZP – Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal and reveal the benefits for running Spark on Hadoop
This presentation was given at a webinar hosted by Data Science Central and co-presented by MapR + Databricks.
To see the webinar, please go to: http://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop
10. Spark is The Most Active Open Source
Project in Big Data
Giraph
Storm
Tez
0
20
40
60
80
100
120
140
Projectcontributorsinpastyear
11. Spark: Easy and Fast Big Data
Easy to Develop
> Rich APIs in
Java, Scala, Pytho
n
> Interactive shell
Fast to Run
> General execution
graphs
> In-memory storage
12. Spark: Easy and Fast Big Data
Easy to Develop
> Rich APIs in
Java, Scala, Pytho
n
> Interactive shell
Fast to Run
> General execution
graphs
> In-memory storage
2-5× less code Up to 10× faster on disk,
100× in memory
13. Easy: Get Started Immediately
• Multi-language support
• Interactive Shell
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
15. Easy: Clean API
Resilient Distributed Datasets
• Collections of objects spread
across a cluster, stored in RAM
or on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
Operations
• Transformations
(e.g.
map, filter, groupBy)
• Actions
(e.g.
count, collect, save)
Write programs in terms of transformations on
distributed datasets
18. Easy: Example – Word Count
Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Hadoop MapReduce
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
19. Easy: Example – Word Count
Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Hadoop MapReduce
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
20. Easy: Works Well With Hadoop
Data Compatibility
• Access your existing
Hadoop Data
• Use the same data
formats
• Adheres to data locality
for efficient processing
Deployment Models
• “Standalone” deployment
• YARN-based deployment
• Mesos-based deployment
• Deploy on existing
Hadoop cluster or side-
by-side
21. Easy: User-Driven Roadmap
Language support
> Improved Python
support
> SparkR
> Java 8
> Integrated Schema
and SQL support in
Spark’s APIs
Better ML
> Sparse Data Support
> Model Evaluation
Framework
> Performance Testing
22. Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
print “Final w: %s” % w
23. Fast: Logistic Regression Performance
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
further iterations 1 s
24. Fast: Using RAM, Operator Graphs
In-memory Caching
• Data Partitions read from
RAM instead of disk
Operator Graphs
• Scheduling Optimizations
• Fault Tolerance
= cached partition
= RDD
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
26. Easy: Fault Recovery
RDDs track lineage information that can be used to
efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
29. Hive Compatibility
• Interfaces to access data and code in the Hive
ecosystem:
o Support for writing queries in HQL
o Catalog for that interfaces with the
Hive MetaStore
o Tablescan operator that uses Hive SerDes
o Wrappers for Hive UDFs, UDAFs, UDTFs
30. Parquet Support
Native support for reading data stored in
Parquet:
• Columnar storage avoids reading
unneeded data.
• Currently only supports flat structures
(nested data on short-term roadmap).
• RDDs can be written to parquet
files, preserving the schema.
31. Mixing SQL and Machine Learning
val trainingDataTable = sql(""" SELECT
e.action, u.age, u.latitude, u.logitude FROM Users u
JOIN Events e ON u.userId = e.userId""")// Since `sql`
returns an RDD, the results of can be easily used in MLlib
val trainingData = trainingDataTable.map { row =>
val features = Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model = new
LogisticRegressionWithSGD().run(trainingData)
32. Relationship to
Borrows
• Hive data loading code / in-
memory columnar
representation
• hardened spark execution
engine
Adds
• RDD-aware optimizer /
query planner
• execution engine
• language interfaces.
Catalyst/SparkSQL is a nearly from scratch
rewrite that leverages the best parts of Shark
34. Spark Streaming
Run a streaming computation as a series of very
small, deterministic batch jobs
34
Spark
Spark
Streaming
batches of X
seconds
live data stream
processed
results
• Chop up the live stream into batches of
½ second or more, leverage RDDs for
micro-batch processing
• Use the same familiar Spark APIs to
process streams
• Combine your batch and online
processing in a single system
• Guarantee exactly-once semantics
35. DStream of data
Window-based Transformations
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()
sliding window
operation
window length sliding interval
window length
sliding interval
39. Enabling users to easily and efficiently
express the entire graph analytics
pipeline
New API
Blurs the distinction
between Tables and
Graphs
New System
Combines Data-Parallel
Graph-Parallel Systems
The GraphX Unified Approach
42. Interactive Exploratory Analytics
• Leverage Spark’s in-memory caching and efficient
execution to explore large distributed datasets
• Use Spark’s APIs to explore any kind of data
(structured, unstructured, semi-structured, etc.) and
combine programming models
• Execute arbitrary code using a fully-functional interactive
programming environment
• Connect external tools via SQL Drivers
43. Machine Learning
• Improve performance of iterative algorithms by caching
frequently accessed datasets
• Develop programs that are easy to reason using a fully-
capable functional programming style
• Refine algorithms using the interactive REPL
• Use carefully-curated algorithms out-of-the-box with
MLlib
44. Power Real-time Dashboards
• Use Spark Streaming to perform low-latency window-
based aggregations
• Combine offline models with streaming data for online
clustering and classification within the dashboard
• Use Spark’s core APIs and/or Spark SQL to give users
large-scale, low-latency drill-down capabilities in
exploring dashboard data
45. Faster ETL
• Leverage Spark’s optimized scheduling for more efficient
I/O on large datasets, and in-memory processing for
aggregations, shuffles, and more
• Use Spark SQL to perform ETL using a familiar SQL
interface
• Easily port PIG scripts to Spark’s API
• Run existing HIVE queries directly on Spark SQL or
Shark
46. San Francisco
June 30 – July 2
• Use Cases
• Tech Talks
• Training
http://spark-summit.org/
The power of MapR begins with the power of open source innovation and community participation.In some cases MapR leads the community in projects like Apache Mahout (machine learning) or Apache Drill (SQL on Hadoop)In other areas, MapR contributes, integrates Apache and other open source software (OSS) projects into the MapR distribution, delivering a more reliable and performant system with lower overall TCO and easier system management.MapR releases a new version with the latest OSS innovations on a monthly basis. We add 2-4 new Apache projects annually as new projects become production ready and based on customer demand.
The power of MapR begins with the power of open source innovation and community participation.In some cases MapR leads the community in projects like Apache Mahout (machine learning) or Apache Drill (SQL on Hadoop)In other areas, MapR contributes, integrates Apache and other open source software (OSS) projects into the MapR distribution, delivering a more reliable and performant system with lower overall TCO and easier system management.MapR releases a new version with the latest OSS innovations on a monthly basis. We add 2-4 new Apache projects annually as new projects become production ready and based on customer demand.
You can find Project Resources on the Apache Incubator siteYou’ll also find information about the mailing list there (including archives)
One of the most exciting things you’ll findGrowing all the timeNASCAR slideIncluding several sponsors of this event are just starting to get involved…If your logo is not up here, forgive us – it’s hard to keep up!
Key idea: add “variables” to the “functions” in functional programming
This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)