Apache Spark

2
24.11.2014
uweseiler
Apache Spark

2 About me
24.11.2014
Big Data Nerd
Hadoop Trainer NoSQL Fan Boy
Photography Enthusiast Travelpirate

2 About us
24.11.2014
specializes on...
Big Data Nerds Agile Ninjas Continuous Delivery Gurus
Enterprise Java Specialists Performance Geeks
Join us!

2 Agenda
24.11.2014
• Why?
• How?
• What else?
• Who?
• Future?

2 Spark: In a tweet
24.11.2014
“Spark … is what you might
call a Swiss Army knife of Big
Data analytics tools”
– Reynold Xin (@rxin), Berkeley AmpLab Shark Development Lead

2 Spark: In a nutshell
24.11.2014
• Fast and general engine for large scale data
processing
• Advanced DAG execution engine with support for
in-memory storage
data locality
(micro) batch streaming support
• Improves usability via
Rich APIs in Scala, Java, Python
Interactive shell
• Runs Standalone, on YARN, on Mesos, and on
Amazon EC2

2 Spark is also…
24.11.2014
• Came out of AMPLab at UCB in 2009
• A top-level Apache project as of 2014
– http://spark.apache.org
• Backed by a commercial entity: Databricks
• A toolset for Data Scientist / Analysts
• Implementation of Resilient Distributed Dataset
(RDD) in Scala
• Hadoop Compatible

2 Spark: Trends
24.11.2014
Apache Drill Apache Storm Apache Spark Apache YARN Apache Tez
Generated using http://www.google.com/trends/

2 Spark: Community
24.11.2014
https://github.com/apache/spark/pulse

2 Spark: Performance
24.11.2014
3X faster using 10X fewer machines
http://finance.yahoo.com/news/apache-spark-beats-world-record-130000796.html
http://www.wired.com/2014/10/startup-crunches-100-terabytes-data-record-23-minutes/

2
24.11.2014
BlinkDB
MapReduce
Cluster resource mgmt. + data
processing
HDFS
Spark: Ecosystem
Redundant, reliable storage
Spark Core
Spark
SQL
SQL
Spark
Streaming
Streaming
MLlib
Machine
Learning
SparkR
R on Spark
GraphX
Graph
Computation

2 Spark: Core Concept
24.11.2014
• Resilient Distributed Dataset (RDD)
Conceptually, RDDs can be roughly
viewed as partitioned, locality aware
distributed vectors
RDD
A11
A12
A13
• Read-only collection of objects spread across a
cluster
• Built through parallel transformations actions
• Computation can be represented by lazy evaluated
lineage DAGs composed by connected RDDs
• Automatically rebuilt on failure
• Controllable persistence

2 Spark: RDD Example
24.11.2014
Base RDD from HDFS
lines = spark.textFile(“hdfs://...”)
errors =
lines.filter(_.startsWith(Error))
messages = errors.map(_.split('t')(2))
messages.cache()
RDD in memory
Iterative Processing
for (str - Array(“foo”, “bar”))
messages.filter(_.contains(str)).count()

2 Spark: Transformations
24.11.2014
Transformations -
Create new datasets from existing ones
map

2 Spark: Transformations
24.11.2014
Transformations -
Create new datasets from existing ones
map(func)
filter(func)
flatMap(func)
mapPartitions(func)
mapPartitionsWithIndex(func)
union(otherDataset)
intersection(otherDataset)
distinct([numTasks]))
groupByKey([numTasks])
sortByKey([ascending], [numTasks])
reduceByKey(func, [numTasks])
aggregateByKey(zeroValue)(seqOp,
combOp, [numTasks])
join(otherDataset, [numTasks])
cogroup(otherDataset, [numTasks])
cartesian(otherDataset)
pipe(command, [envVars])
coalesce(numPartitions)
sample(withReplacement,fraction, seed)
repartition(numPartitions)

2 Spark: Actions
24.11.2014
Actions -
Return a value to the client after running a
computation on the dataset
reduce

2 Spark: Actions
24.11.2014
Actions -
Return a value to the client after running a
computation on the dataset
reduce(func)
collect()
count()
first()
countByKey()
foreach(func)
take(n)
takeSample(withReplacement,num, [seed])
takeOrdered(n, [ordering])
saveAsTextFile(path)
saveAsSequenceFile(path)
(Only Java and Scala)
saveAsObjectFile(path)
(Only Java and Scala)

2 Spark: Dataflow
24.11.2014
All transformations in Spark are lazy and are only
computed when an actions requires it.

2 Spark: Persistence
24.11.2014
One of the most important capabilities in Spark is
caching a dataset in-memory across operations
• cache() MEMORY_ONLY
• persist() MEMORY_ONLY

2 Spark: Storage Levels
24.11.2014
• persist(Storage Level)
Storage Level Meaning
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does
not fit in memory, some partitions will not be cached and will be
recomputed on the fly each time they're needed. This is the default
level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does
not fit in memory, store the partitions that don't fit on disk, and
read them from there when they're needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition).
This is generally more space-efficient than deserialized objects,
especially when using a fast serializer, but more CPU-intensive to
read.
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in
memory to disk instead of recomputing them on the fly each time
they're needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2,
… … …
Same as the levels above, but replicate each partition on two cluster
nodes.

2 Spark: Parallelism
24.11.2014
Can be specified in a number of different ways
• RDD partition number
• sc.textFile(input, minSplits = 10)
• sc.parallelize(1 to 10000, numSlices = 10)
• Mapper side parallelism
• Usually inherited from parent RDD(s)
• Reducer side parallelism
• rdd.reduceByKey(_ + _, numPartitions = 10)
• rdd.reduceByKey(partitioner = p, _ + _)
• “Zoom in/out”
• rdd.repartition(numPartitions: Int)
• rdd.coalesce(numPartitions: Int, shuffle: Boolean)

2 Spark: Example
24.11.2014
Text Processing Example
Top words by frequency

2 Spark: Frequency Example
24.11.2014
Create RDD from external data
Data Sources supported by
Hadoop
Cassandra ElasticSearch
HDFS S3 HBase
Mongo
DB
…
I/O via Hadoop optional
// Step 1. Create RDD from Hadoop text files
val docs = spark.textFile(“hdfs://docs/“)

24.11.2014
Function map
Hello World
This is
Spark
Spark
The end
hello world
this is
spark
spark
the end
RDD[String]
.map(line = line.ToLowerCase)
RDD[String]

24.11.2014
Function map
Hello World
This is
Spark
Spark
The end
hello world
this is
spark
spark
the end
RDD[String]
RDD[String]
=
.map(_.ToLowerCase)

24.11.2014
Function map
Hello World
This is
Spark
Spark
The end
=
// Step 2. Convert lines to lower case
val lower = docs.map(line = line.ToLowerCase)
hello world
this is
spark
spark
the end
RDD[String]
RDD[String]
.map(_.ToLowerCase)

24.11.2014
map vs. flatMap
RDD[String]
hello world
this is
spark
spark
the end
.map(…)
RDD[Array[String]]
hello
spark
_.split(s+)
world
this is spark
the end

24.11.2014
map vs. flatMap
RDD[String]
hello world
this is
spark
spark
the end
.map(…)
RDD[String]
RDD[Array[String]]
hello
spark
.flatten*
_.split(s+)
world
this is spark
hello
world
this
the end
end

24.11.2014
map vs. flatMap
RDD[String]
hello world
this is
spark
spark
the end
.map(…)
RDD[String]
RDD[Array[String]]
hello
world
this is spark
spark
.flatten*
_.split(s+)
the end
.flatMap(line = line.split(“s+“))
hello
world
this
end

24.11.2014
map vs. flatMap
RDD[String]
hello world
this is
spark
spark
the end
.map(…)
RDD[String]
RDD[Array[String]]
hello
world
this is spark
spark
.flatten*
_.split(s+)
hello
world
this
the end
end
.flatMap(line = line.split(“s+“))
// Step 3. Split lines into words
val words = lower.flatMap(line = line.split(“s+“))

24.11.2014
Key-Value Pairs
RDD[String]
hello
world
spark
end
.map(word = Tuple2(word,1))
RDD[(String, Int)]
hello
world
spark
end
spark
1
1
spark
1
1
1

24.11.2014
Key-Value Pairs
RDD[String]
hello
world
spark
end
=
.map(word = (word,1))
RDD[(String, Int)]
hello
world
spark
end
spark
1
1
spark
1
1
1

24.11.2014
Key-Value Pairs
RDD[String]
hello
world
spark
end
=
.map(word = (word,1))
// Step 4. Convert into tuples
val counts = words.map(word = (word,1))
RDD[(String, Int)]
hello
world
spark
end
spark
1
1
spark
1
1
1

24.11.2014
Shuffling
RDD[(String, Int)]
hello
world
spark
end
1
1
spark
1
1
1
RDD[(String, Iterator(Int))]
.groupByKey
end 1
hello 1
spark 1 1
world 1

24.11.2014
Shuffling
RDD[(String, Int)]
hello
world
spark
end
1
1
spark
1
1
1
RDD[(String, Iterator(Int))] RDD[(String, Int)]
.groupByKey
end 1
hello 1
spark 1 1
world 1
end 1
hello 1
spark 2
world 1
.mapValues
_.reduce…
(a,b) = a+b

24.11.2014
Shuffling
RDD[(String, Int)]
hello
world
spark
end
1
1
spark
1
1
1
.groupByKey
end 1
hello 1
spark 1 1
world 1
end 1
hello 1
spark 2
world 1
.mapValues
_.reduce…
(a,b) = a+b
.reduceByKey((a,b) = a+b)

24.11.2014
Shuffling
RDD[(String, Int)]
hello
world
spark
spark
end
1
1
1
1
1
.groupByKey
end 1
hello 1
spark 1 1
world 1
// Step 5. Count all words
val freq = counts.reduceByKey(_ + _)
end 1
hello 1
spark 2
world 1
.mapValues
_.reduce…
(a,b) = a+b

24.11.2014
Top N (Prepare data)
RDD[(String, Int)]
end 1
hello 1
spark 2
world 1
// Step 6. Swap tupels (Partial code)
freq.map(_.swap)
RDD[(Int, String)]
1 end
1 hello
2 spark
1 world
.map(_.swap)

24.11.2014
Top N (First Attempt)
RDD[(Int, String)]
1 end
1 hello
2 spark
1 world
RDD[(Int, String)]
2 spark
1 end
1 hello
1 world
.sortByKey

24.11.2014
Top N
RDD[(Int, String)]
1 end
1 hello
2 spark
1 world
RDD[(Int, String)]
2 spark
1 end
1 hello
1 world
local top N
.top(N)
local top N

24.11.2014
Top N
RDD[(Int, String)]
1 end
1 hello
2 spark
1 world
RDD[(Int, String)]
2 spark
1 end
1 hello
1 world
.top(N)
Array[(Int, String)]
2 spark
1 end
local top N
local top N
reduction

24.11.2014
Top N
RDD[(Int, String)]
1 end
1 hello
2
spark
1 world
RDD[(Int, String)]
spark
2
1 end
1 hello
1 world
.top(N)
Array[(Int, String)]
2 spark
1 end
local top N
local top N
reduction
// Step 6. Swap tupels (Complete code)
val top = freq.map(_.swap).top(N)

24.11.2014
val spark = new SparkContext()
// Create RDD from Hadoop text file
val docs = spark.textFile(“hdfs://docs/“)
// Split lines into words and process
val lower = docs.map(line = line.ToLowerCase)
val words = lower.flatMap(line = line.split(“s+“))
val counts = words.map(word = (word,1))
// Count all words
val freq = counts.reduceByKey(_ + _)
// Swap tupels and get top results
val top = freq.map(_.swap).top(N)
top.foreach(println)

2 Spark: Streaming
24.11.2014
• Real-time computation
• Similar to Apache Storm…
• Streaming input split into sliding windows of
RDD‘s
• Input distributed to memory for fault
tolerance
• Supports input from Kafka, Flume, ZeroMQ,
HDFS, S3, Kinesis, Twitter, …

2 Spark: Streaming
24.11.2014
Discretized Stream
Windowed Computations

2 Spark: Streaming
24.11.2014
TwitterUtils.createStream()
.filter(_.getText.contains(Spark))
.countByWindow(Seconds(5))

2 Spark: SQL
24.11.2014
• Spark SQL allows relational queries
expressed in SQL, HiveQL or Scala
• Uses SchemaRDD’s composed of Row objects
(= table in a traditional RDBMS)
• SchemaRDD can be created from an
• Existing RDD
• Parquet File
• JSON dataset
• By running HiveQL against data stored in Apache Hive
• Supports a domain specific language for
writing queries

2 Spark: SQL
24.11.2014
registerFunction(LEN, (_: String).length)
val queryRdd = sql(
SELECT * FROM counts
WHERE LEN(word) = 10
ORDER BY total DESC
LIMIT 10
)
queryRdd
.map( c = sword: ${c(0)} t| total: ${c(1)})
.collect()
.foreach(println)

2 Spark: GraphX
24.11.2014
• GraphX is the Spark API for graphs
and graph-parallel computation
• API’s to join and traverse graphs
• Optimally partitions and indexes
vertices edges (represented as RDD’s)
• Supports PageRank, connected
components, triangle counting, …

2 Spark: GraphX
24.11.2014
val graph = Graph(userIdRDD, assocRDD)
val ranks = graph.pageRank(0.0001).vertices
val userRDD = sc.textFile(graphx/data/users.txt)
val users = userRdd. map {line =
val fields = line.split(,)
(fields(0).toLong, fields( 1))
}
val ranksByUsername = users.join(ranks).map {
case (id, (username, rank)) = (username, rank)
}

2 Spark: MLlib
24.11.2014
• Machine learning library similar to
Apache Mahout
• Supports statistics, regression, decision
trees, clustering, PCA, gradient
descent, …
• Iterative algorithms much faster due to
in-memory processing

2 Spark: MLlib
24.11.2014
val data = sc.textFile(data.txt)
val parsedData = data.map {line =
val parts = line.split(',')
LabeledPoint(
parts( 0). toDouble,
Vectors.dense(parts(1).split(' ').map(_.toDouble)) )
}
val model = LinearRegressionWithSGD.train(
parsedData, 100
)
val valuesAndPreds = parsedData.map {point =
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds
.map{case(v, p) = math.pow((v - p), 2)}.mean()

2 Use Case: Yahoo Native Ads
24.11.2014
Logistic regression
algorithm
• 120 LOC in Spark/Scala
• 30 min. on model creation for
100M samples and 13K
features
Initial version launched
within 2 hours after Spark-on-
YARN announcement
• Compared: Several days on
hardware acquisition, system
setup and data movement
http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster

2 Use Case: Yahoo Mobile Ads
24.11.2014
Learn from mobile search
ads clicks data
• 600M labeled examples on
HDFS
• 100M sparse features
Spark programs for
Gradient Boosting Decision
Trees
• 6 hours for model training
with 100 workers
• Model with accuracy very
close to heavily-manually-tuned
Logistic Regression
models
http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster

2 Spark-on-YARN (Current)
24.11.2014
Hadoop 2 Spark as YARN App
Pig … In-
Hive Stream
Tez
Spark MapReduce
Execution Engine
Execution Engine
YARN
Memory
Cluster resource management
HDFS
Redundant, reliable storage
ing
Storm
…

2 Spark-on-YARN (Future)
24.11.2014
Hadoop 2 Spark as Execution Engine
Hive … Mahout
YARN
HDFS
Pig
MapReduce
Execution Engine
Stream
ing
Storm
…
Tez
Execution Engine
Spark
Execution Engine
Slider

2 Spark: Future work
24.11.2014
• Spark Core
• Focus on maturity, optimization
pluggability
• Enable long-running services (Slider)
• Give resources back to cluster when idle
• Integrate with Hadoop enhancements
• Timeline server
• ORC File Format
• Spark Eco System
• Focus on adding capabilities

2 One more thing…
24.11.2014
Let’s get started with
Spark!

2 Hortonworks Sandbox 2.2
24.11.2014
http://hortonworks.com/hdp/downloads/

2 Hortonworks Sandbox 2.2
24.11.2014
// 1. Download
wget http://public-repo-1.hortonworks.com/HDP-LABS/
Projects/spark/1.1.1/spark-1.1.0.2.1.5.0-701-bin-
2.4.0.tgz
// 2. Untar
tar xvfz spark-1.1.0.2.1.5.0-701-bin-2.4.0.tgz
// 3. Start Spark Shell
./bin/spark-shell

2 Thanks for listening
24.11.2014
Twitter:
@uweseiler
Mail:
uwe.seiler@codecentric.de
XING:
https://www.xing.com/profile
/Uwe_Seiler

Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Apache Spark

Similar to Apache Spark (20)

More from Uwe Printz

More from Uwe Printz (18)

Recently uploaded

Recently uploaded (20)

Apache Spark