This document provides an overview of Apache Spark, including its core concepts, transformations and actions, persistence, parallelism, and examples. Spark is introduced as a fast and general engine for large-scale data processing, with advantages like in-memory computing, fault tolerance, and rich APIs. Key concepts covered include its resilient distributed datasets (RDDs) and lazy evaluation approach. The document also discusses Spark SQL, streaming, and integration with other tools.
6. 2 Spark: In a tweet
24.11.2014
“Spark … is what you might
call a Swiss Army knife of Big
Data analytics tools”
– Reynold Xin (@rxin), Berkeley AmpLab Shark Development Lead
7. 2 Spark: In a nutshell
24.11.2014
• Fast and general engine for large scale data
processing
• Advanced DAG execution engine with support for
in-memory storage
data locality
(micro) batch streaming support
• Improves usability via
Rich APIs in Scala, Java, Python
Interactive shell
• Runs Standalone, on YARN, on Mesos, and on
Amazon EC2
8. 2 Spark is also…
24.11.2014
• Came out of AMPLab at UCB in 2009
• A top-level Apache project as of 2014
– http://spark.apache.org
• Backed by a commercial entity: Databricks
• A toolset for Data Scientist / Analysts
• Implementation of Resilient Distributed Dataset
(RDD) in Scala
• Hadoop Compatible
9. 2 Spark: Trends
24.11.2014
Apache Drill Apache Storm Apache Spark Apache YARN Apache Tez
Generated using http://www.google.com/trends/
14. 2 Spark: Core Concept
24.11.2014
• Resilient Distributed Dataset (RDD)
Conceptually, RDDs can be roughly
viewed as partitioned, locality aware
distributed vectors
RDD
A11
A12
A13
• Read-only collection of objects spread across a
cluster
• Built through parallel transformations actions
• Computation can be represented by lazy evaluated
lineage DAGs composed by connected RDDs
• Automatically rebuilt on failure
• Controllable persistence
15. 2 Spark: RDD Example
24.11.2014
Base RDD from HDFS
lines = spark.textFile(“hdfs://...”)
errors =
lines.filter(_.startsWith(Error))
messages = errors.map(_.split('t')(2))
messages.cache()
RDD in memory
Iterative Processing
for (str - Array(“foo”, “bar”))
messages.filter(_.contains(str)).count()
16. 2 Spark: Transformations
24.11.2014
Transformations -
Create new datasets from existing ones
map
18. 2 Spark: Actions
24.11.2014
Actions -
Return a value to the client after running a
computation on the dataset
reduce
19. 2 Spark: Actions
24.11.2014
Actions -
Return a value to the client after running a
computation on the dataset
reduce(func)
collect()
count()
first()
countByKey()
foreach(func)
take(n)
takeSample(withReplacement,num, [seed])
takeOrdered(n, [ordering])
saveAsTextFile(path)
saveAsSequenceFile(path)
(Only Java and Scala)
saveAsObjectFile(path)
(Only Java and Scala)
20. 2 Spark: Dataflow
24.11.2014
All transformations in Spark are lazy and are only
computed when an actions requires it.
21. 2 Spark: Persistence
24.11.2014
One of the most important capabilities in Spark is
caching a dataset in-memory across operations
• cache() MEMORY_ONLY
• persist() MEMORY_ONLY
22. 2 Spark: Storage Levels
24.11.2014
• persist(Storage Level)
Storage Level Meaning
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does
not fit in memory, some partitions will not be cached and will be
recomputed on the fly each time they're needed. This is the default
level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does
not fit in memory, store the partitions that don't fit on disk, and
read them from there when they're needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition).
This is generally more space-efficient than deserialized objects,
especially when using a fast serializer, but more CPU-intensive to
read.
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in
memory to disk instead of recomputing them on the fly each time
they're needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2,
… … …
Same as the levels above, but replicate each partition on two cluster
nodes.
23. 2 Spark: Parallelism
24.11.2014
Can be specified in a number of different ways
• RDD partition number
• sc.textFile(input, minSplits = 10)
• sc.parallelize(1 to 10000, numSlices = 10)
• Mapper side parallelism
• Usually inherited from parent RDD(s)
• Reducer side parallelism
• rdd.reduceByKey(_ + _, numPartitions = 10)
• rdd.reduceByKey(partitioner = p, _ + _)
• “Zoom in/out”
• rdd.repartition(numPartitions: Int)
• rdd.coalesce(numPartitions: Int, shuffle: Boolean)
24. 2 Spark: Example
24.11.2014
Text Processing Example
Top words by frequency
25. 2 Spark: Frequency Example
24.11.2014
Create RDD from external data
Data Sources supported by
Hadoop
Cassandra ElasticSearch
HDFS S3 HBase
Mongo
DB
…
I/O via Hadoop optional
// Step 1. Create RDD from Hadoop text files
val docs = spark.textFile(“hdfs://docs/“)
26. 2 Spark: Frequency Example
24.11.2014
Function map
Hello World
This is
Spark
Spark
The end
hello world
this is
spark
spark
the end
RDD[String]
.map(line = line.ToLowerCase)
RDD[String]
27. 2 Spark: Frequency Example
24.11.2014
Function map
Hello World
This is
Spark
Spark
The end
hello world
this is
spark
spark
the end
RDD[String]
.map(line = line.ToLowerCase)
RDD[String]
=
.map(_.ToLowerCase)
28. 2 Spark: Frequency Example
24.11.2014
Function map
Hello World
This is
Spark
Spark
The end
=
// Step 2. Convert lines to lower case
val lower = docs.map(line = line.ToLowerCase)
hello world
this is
spark
spark
the end
RDD[String]
.map(line = line.ToLowerCase)
RDD[String]
.map(_.ToLowerCase)
29. 2 Spark: Frequency Example
24.11.2014
map vs. flatMap
RDD[String]
hello world
this is
spark
spark
the end
.map(…)
RDD[Array[String]]
hello
spark
_.split(s+)
world
this is spark
the end
30. 2 Spark: Frequency Example
24.11.2014
map vs. flatMap
RDD[String]
hello world
this is
spark
spark
the end
.map(…)
RDD[String]
RDD[Array[String]]
hello
spark
.flatten*
_.split(s+)
world
this is spark
hello
world
this
the end
end
31. 2 Spark: Frequency Example
24.11.2014
map vs. flatMap
RDD[String]
hello world
this is
spark
spark
the end
.map(…)
RDD[String]
RDD[Array[String]]
hello
world
this is spark
spark
.flatten*
_.split(s+)
the end
.flatMap(line = line.split(“s+“))
hello
world
this
end
32. 2 Spark: Frequency Example
24.11.2014
map vs. flatMap
RDD[String]
hello world
this is
spark
spark
the end
.map(…)
RDD[String]
RDD[Array[String]]
hello
world
this is spark
spark
.flatten*
_.split(s+)
hello
world
this
the end
end
.flatMap(line = line.split(“s+“))
// Step 3. Split lines into words
val words = lower.flatMap(line = line.split(“s+“))
33. 2 Spark: Frequency Example
24.11.2014
Key-Value Pairs
RDD[String]
hello
world
spark
end
.map(word = Tuple2(word,1))
RDD[(String, Int)]
hello
world
spark
end
spark
1
1
spark
1
1
1
34. 2 Spark: Frequency Example
24.11.2014
Key-Value Pairs
RDD[String]
hello
world
spark
end
.map(word = Tuple2(word,1))
=
.map(word = (word,1))
RDD[(String, Int)]
hello
world
spark
end
spark
1
1
spark
1
1
1
35. 2 Spark: Frequency Example
24.11.2014
Key-Value Pairs
RDD[String]
hello
world
spark
end
.map(word = Tuple2(word,1))
=
.map(word = (word,1))
// Step 4. Convert into tuples
val counts = words.map(word = (word,1))
RDD[(String, Int)]
hello
world
spark
end
spark
1
1
spark
1
1
1
36. 2 Spark: Frequency Example
24.11.2014
Shuffling
RDD[(String, Int)]
hello
world
spark
end
1
1
spark
1
1
1
RDD[(String, Iterator(Int))]
.groupByKey
end 1
hello 1
spark 1 1
world 1
37. 2 Spark: Frequency Example
24.11.2014
Shuffling
RDD[(String, Int)]
hello
world
spark
end
1
1
spark
1
1
1
RDD[(String, Iterator(Int))] RDD[(String, Int)]
.groupByKey
end 1
hello 1
spark 1 1
world 1
end 1
hello 1
spark 2
world 1
.mapValues
_.reduce…
(a,b) = a+b
38. 2 Spark: Frequency Example
24.11.2014
Shuffling
RDD[(String, Int)]
hello
world
spark
end
1
1
spark
1
1
1
RDD[(String, Iterator(Int))] RDD[(String, Int)]
.groupByKey
end 1
hello 1
spark 1 1
world 1
end 1
hello 1
spark 2
world 1
.mapValues
_.reduce…
(a,b) = a+b
.reduceByKey((a,b) = a+b)
39. 2 Spark: Frequency Example
24.11.2014
Shuffling
RDD[(String, Int)]
hello
world
spark
spark
end
1
1
1
1
1
RDD[(String, Iterator(Int))] RDD[(String, Int)]
.groupByKey
end 1
hello 1
spark 1 1
world 1
// Step 5. Count all words
val freq = counts.reduceByKey(_ + _)
end 1
hello 1
spark 2
world 1
.mapValues
_.reduce…
(a,b) = a+b
40. 2 Spark: Frequency Example
24.11.2014
Top N (Prepare data)
RDD[(String, Int)]
end 1
hello 1
spark 2
world 1
// Step 6. Swap tupels (Partial code)
freq.map(_.swap)
RDD[(Int, String)]
1 end
1 hello
2 spark
1 world
.map(_.swap)
41. 2 Spark: Frequency Example
24.11.2014
Top N (First Attempt)
RDD[(Int, String)]
1 end
1 hello
2 spark
1 world
RDD[(Int, String)]
2 spark
1 end
1 hello
1 world
.sortByKey
42. 2 Spark: Frequency Example
24.11.2014
Top N
RDD[(Int, String)]
1 end
1 hello
2 spark
1 world
RDD[(Int, String)]
2 spark
1 end
1 hello
1 world
local top N
.top(N)
local top N
43. 2 Spark: Frequency Example
24.11.2014
Top N
RDD[(Int, String)]
1 end
1 hello
2 spark
1 world
RDD[(Int, String)]
2 spark
1 end
1 hello
1 world
.top(N)
Array[(Int, String)]
2 spark
1 end
local top N
local top N
reduction
44. 2 Spark: Frequency Example
24.11.2014
Top N
RDD[(Int, String)]
1 end
1 hello
2
spark
1 world
RDD[(Int, String)]
spark
2
1 end
1 hello
1 world
.top(N)
Array[(Int, String)]
2 spark
1 end
local top N
local top N
reduction
// Step 6. Swap tupels (Complete code)
val top = freq.map(_.swap).top(N)
45. 2 Spark: Frequency Example
24.11.2014
val spark = new SparkContext()
// Create RDD from Hadoop text file
val docs = spark.textFile(“hdfs://docs/“)
// Split lines into words and process
val lower = docs.map(line = line.ToLowerCase)
val words = lower.flatMap(line = line.split(“s+“))
val counts = words.map(word = (word,1))
// Count all words
val freq = counts.reduceByKey(_ + _)
// Swap tupels and get top results
val top = freq.map(_.swap).top(N)
top.foreach(println)
50. 2 Spark: SQL
24.11.2014
• Spark SQL allows relational queries
expressed in SQL, HiveQL or Scala
• Uses SchemaRDD’s composed of Row objects
(= table in a traditional RDBMS)
• SchemaRDD can be created from an
• Existing RDD
• Parquet File
• JSON dataset
• By running HiveQL against data stored in Apache Hive
• Supports a domain specific language for
writing queries
51. 2 Spark: SQL
24.11.2014
registerFunction(LEN, (_: String).length)
val queryRdd = sql(
SELECT * FROM counts
WHERE LEN(word) = 10
ORDER BY total DESC
LIMIT 10
)
queryRdd
.map( c = sword: ${c(0)} t| total: ${c(1)})
.collect()
.foreach(println)
52. 2 Spark: GraphX
24.11.2014
• GraphX is the Spark API for graphs
and graph-parallel computation
• API’s to join and traverse graphs
• Optimally partitions and indexes
vertices edges (represented as RDD’s)
• Supports PageRank, connected
components, triangle counting, …
53. 2 Spark: GraphX
24.11.2014
val graph = Graph(userIdRDD, assocRDD)
val ranks = graph.pageRank(0.0001).vertices
val userRDD = sc.textFile(graphx/data/users.txt)
val users = userRdd. map {line =
val fields = line.split(,)
(fields(0).toLong, fields( 1))
}
val ranksByUsername = users.join(ranks).map {
case (id, (username, rank)) = (username, rank)
}
54. 2 Spark: MLlib
24.11.2014
• Machine learning library similar to
Apache Mahout
• Supports statistics, regression, decision
trees, clustering, PCA, gradient
descent, …
• Iterative algorithms much faster due to
in-memory processing
55. 2 Spark: MLlib
24.11.2014
val data = sc.textFile(data.txt)
val parsedData = data.map {line =
val parts = line.split(',')
LabeledPoint(
parts( 0). toDouble,
Vectors.dense(parts(1).split(' ').map(_.toDouble)) )
}
val model = LinearRegressionWithSGD.train(
parsedData, 100
)
val valuesAndPreds = parsedData.map {point =
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds
.map{case(v, p) = math.pow((v - p), 2)}.mean()
57. 2 Use Case: Yahoo Native Ads
24.11.2014
Logistic regression
algorithm
• 120 LOC in Spark/Scala
• 30 min. on model creation for
100M samples and 13K
features
Initial version launched
within 2 hours after Spark-on-
YARN announcement
• Compared: Several days on
hardware acquisition, system
setup and data movement
http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
58. 2 Use Case: Yahoo Mobile Ads
24.11.2014
Learn from mobile search
ads clicks data
• 600M labeled examples on
HDFS
• 100M sparse features
Spark programs for
Gradient Boosting Decision
Trees
• 6 hours for model training
with 100 workers
• Model with accuracy very
close to heavily-manually-tuned
Logistic Regression
models
http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
62. 2 Spark: Future work
24.11.2014
• Spark Core
• Focus on maturity, optimization
pluggability
• Enable long-running services (Slider)
• Give resources back to cluster when idle
• Integrate with Hadoop enhancements
• Timeline server
• ORC File Format
• Spark Eco System
• Focus on adding capabilities
63. 2 One more thing…
24.11.2014
Let’s get started with
Spark!