15. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Big Data is like...
“Big Data is like teenage sex: everyone
talks about it, nobody really knows how
to do it, everyone thinks everyone else
is doing it, so everyone claims they are
doing it”
22. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Big Data is all about...
● well, the data :)
● It is said that 2.5 exabytes (2.5×10^18) of
data is being created around the world every
single day
● It's a capacity on which you can not any
longer use standard tools and methods of
evaluation
27. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
To the rescue
MAP REDUCE
“'MapReduce' is a framework for processing
parallelizable problems across huge datasets
using a cluster, taking into consideration
scalability and fault-tolerance”
33. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Word Count
● The “Hello World” of Big Data world.
● For initial input of multiple lines, extract all
words with number of occurrences
To be or not to be
Let it be
Be me
It must be
Let it be
be 7
to 2
let 2
or 1
not 1
must 2
me 1
36. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Input Splitting Mapping
To be or not to be
Let it be
Be me
It must be
Let it be
To be or not to be
Let it be
It must be
Let it be
Be me
to 1
be 1
or 1
not 1
to 1
be 1
let 1
it 1
be 1
be 1
me 1
let 1
it 1
be 1
it 1
must 1
be 1
37. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Input Splitting Mapping Shuffling
To be or not to be
Let it be
Be me
It must be
Let it be
To be or not to be
Let it be
It must be
Let it be
Be me
to 1
be 1
or 1
not 1
to 1
be 1
let 1
it 1
be 1
be 1
me 1
let 1
it 1
be 1
it 1
must 1
be 1
be 1
be 1
be 1
be 1
be 1
be 1
to 1
to 1
or 1
not 1
let 1
let 1
must 1
me 1
38. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Input Splitting Mapping Shuffling
To be or not to be
Let it be
Be me
It must be
Let it be
To be or not to be
Let it be
It must be
Let it be
Be me
to 1
be 1
or 1
not 1
to 1
be 1
let 1
it 1
be 1
be 1
me 1
let 1
it 1
be 1
it 1
must 1
be 1
be 1
be 1
be 1
be 1
be 1
be 1
to 1
to 1
or 1
not 1
let 1
let 1
must 1
me 1
EXPENSIVE
39. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Input Splitting Mapping Shuffling
To be or not to be
Let it be
Be me
It must be
Let it be
To be or not to be
Let it be
It must be
Let it be
Be me
to 1
be 1
or 1
not 1
to 1
be 1
let 1
it 1
be 1
be 1
me 1
let 1
it 1
be 1
it 1
must 1
be 1
be 1
be 1
be 1
be 1
be 1
be 1
to 1
to 1
or 1
not 1
let 1
let 1
must 1
me 1
40. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Input Splitting Mapping Shuffling Reducing
To be or not to be
Let it be
Be me
It must be
Let it be
To be or not to be
Let it be
It must be
Let it be
Be me
to 1
be 1
or 1
not 1
to 1
be 1
let 1
it 1
be 1
be 1
me 1
let 1
it 1
be 1
it 1
must 1
be 1
be 1
be 1
be 1
be 1
be 1
be 1
to 1
to 1
or 1
not 1
let 1
let 1
must 1
me 1
be 6
to 2
or 1
not 1
let 2
must 1
me 1
41. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Input Splitting Mapping Shuffling Reducing Final result
To be or not to be
Let it be
Be me
It must be
Let it be
To be or not to be
Let it be
It must be
Let it be
Be me
to 1
be 1
or 1
not 1
to 1
be 1
let 1
it 1
be 1
be 1
me 1
let 1
it 1
be 1
it 1
must 1
be 1
be 1
be 1
be 1
be 1
be 1
be 1
to 1
to 1
or 1
not 1
let 1
let 1
must 1
me 1
be 6
to 2
or 1
not 1
let 2
must 1
me 1
be 6
to 2
let 2
or 1
not 1
must 2
me 1
43. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Word count - pseudo-code
function map(String name, String document):
for each word w in document:
emit (w, 1)
function reduce(String word, Iterator
partialCounts):
sum = 0
for each pc in partialCounts:
sum += ParseInt(pc)
emit (word, sum)
48. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Word count - revisited
function map(String name, String document):
for each word w in document:
emit (w, 1)
function reduce(String word, Iterator
partialCounts):
sum = 0
for each pc in partialCounts:
sum += ParseInt(pc)
emit (word, sum)
49. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Word count: Hadoop implementation
15 public class WordCount { 16
17 public static class Map extends Mapper<LongWritable, Text,
Text, IntWritable> {
18 private final static IntWritable one = new IntWritable(1);
19 private Text word = new Text(); 20
21 public void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException {
22 String line = value.toString();
23 StringTokenizer tokenizer = new StringTokenizer(line);
24 while (tokenizer.hasMoreTokens()) {
25 word.set(tokenizer.nextToken());
26 context.write(word, one);
27 }
28 }
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable,
Text, IntWritable> {
33 public void reduce(Text key, Iterable<IntWritable> values,
Context context)
34 throws IOException, InterruptedException {
35 int sum = 0;
36 for (IntWritable val : values) { sum += val.get(); }
39 context.write(key, new IntWritable(sum));
40 }
41 }
43 public static void main(String[] args) throws Exception {
44 Configuration conf = new Configuration();
46 Job job = new Job(conf, "wordcount");
48 job.setOutputKeyClass(Text.class);
49 job.setOutputValueClass(IntWritable.class);
51 job.setMapperClass(Map.class);
52 job.setReducerClass(Reduce.class);
54 job.setInputFormatClass(TextInputFormat.class);
55. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Performance issues
● Map-Reduce pair combination
● Output saved to the file
● Iterative algorithms go through IO path again
and again
● Poor API (key, value), even basic join
requires expensive code
57. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Problems with Map Reduce
1. MapReduce provides a difficult programming
model for developers
2. It suffers from a number of performance
issues
3. While batch-mode analysis is still important,
reacting to events as they arrive has become
more important (lack support of “almost”
real-time)
66. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
67. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
The Big Picture
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
Executor 1
Executor 2
Executor 3
72. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
RDD - the definition
RDD stands for resilient distributed dataset
Resilient - if data is lost, data can be recreated
Distributed - stored in nodes among the cluster
Dataset - initial data comes from a file
or can be created programmatically
85. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Getting started with Spark
https://github.com/rabbitonweb/spark-workshop
● Make it a love story: Print out all lines that
have both Juliet & Romeo in it
http://spark.apache.
org/docs/latest/api/scala/index.html#org.
86. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Getting started with Spark
https://github.com/rabbitonweb/spark-workshop
● Make it a love story: Print out all lines that
have both Juliet & Romeo in it
● Would be nice to have a REPL
88. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
And yet another RDD
Performance Alert?!?!
89. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
RDD - Operations
1. Transformations
a. Map
b. Filter
c. FlatMap
d. Sample
e. Union
f. Intersect
g. Distinct
h. GroupByKey
i. ….
2. Actions
a. Reduce
b. Collect
c. Count
d. First
e. Take(n)
f. TakeSample
g. SaveAsTextFile
h. ….
91. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
val numberOfErrors = errors.count
92. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
val numberOfErrors = errors.count
This will trigger the
computation
93. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
val numberOfErrors = errors.count
This will trigger the
computation
This will the calculated
value (Int)
160. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Directed Acyclic Graph
val startings = allShakespeare
.filter(_.trim != "")
.map(line => (line.charAt(0), line))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
println(startings.toDebugString)
(2) ShuffledRDD[5] at reduceByKey at Ex3.scala:18 []
+-(2) MapPartitionsRDD[4] at mapValues at Ex3.scala:17 []
| MapPartitionsRDD[3] at map at Ex3.scala:16 []
| MapPartitionsRDD[2] at filter at Ex3.scala:15 []
| src/main/resources/all-shakespeare.txt MapPartitionsRDD[1]
| src/main/resources/all-shakespeare.txt HadoopRDD[0] at textFile
161. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Directed Acyclic Graph
val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
println(startings.toDebugString)
(2) MapPartitionsRDD[6] at reduceByKey at Ex3.scala:42
| MapPartitionsRDD[5] at mapValues at Ex3.scala:41
| ShuffledRDD[4] at groupBy at Ex3.scala:40
+-(2) MapPartitionsRDD[3] at groupBy at Ex3.scala:40
| MapPartitionsRDD[2] at filter at Ex3.scala:39
| src/main/resources/all-shakespeare.txt MapPartitionsRDD[1]
| src/main/resources/all-shakespeare.txt HadoopRDD[0]
162. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
RDD - the definition
RDD stands for resilient distributed dataset
Resilient - if data is lost, data can be recreated
Distributed - stored in nodes among the cluster
Dataset - initial data comes from a file
or can be created programmatically
163. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
What about Resilience?
RDD stands for resilient distributed dataset
Resilient - if data is lost, data can be recreated
Distributed - stored in nodes among the cluster
Dataset - initial data comes from a file
or can be created programmatically
164. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
165. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1T2T3
166. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1T2T3
167. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2T3
168. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
169. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
170. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
171. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
172. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
173. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
174. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1 T2
T3
175. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1 T2 T3
176. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1T2T3
177. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1T2T3
178. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2T3
179. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
180. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
181. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
182. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
183. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3
184. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
EDeDEADutor
3
T1
T2
185. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
T1 T2
186. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
T1 T2 T3
187. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
T1 T2
T3
188. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
T1 T2
T3
189. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
T1 T2
T3
190. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
T1 T2
T3
191. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Resilience
Driver Program
Cluster (Standalone, Yarn, Mesos)
Master
val master = “spark://host:pt”
val conf = new SparkConf()
.setMaster(master)
val sc = new SparkContext
(conf)
val logs =
sc.textFile(“logs.
txt”)
println(logs.count())
Executor 1
Executor 2
T1 T2 T3
192. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Exercise 5 - The Big Data problem
● Write a Word Count program using Spark
● Use all-shakespeare.txt as input
To be or not to be
Let it be
Be me
It must be
Let it be
be 7
to 2
let 2
or 1
not 1
must 2
me 1
206. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
But that solution has major flaw
● Flaw: groupBy
● But before we do understand it, we have to:
○ instantiate a Standalone cluster
○ understand how cluster works
○ talk about serialization (& uber jar!)
○ see the Spark UI
○ talk about Spark configuration
207. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
But that solution has major flaw
● Flaw: groupBy
● But before we do understand it, we have to:
○ instantiate a Standalone cluster
○ understand how cluster works
○ talk about serialization
○ see the Spark UI
○ talk about Spark configuration
● http://spark.apache.
org/docs/latest/configuration.html
209. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Mid-term exam
● Given all-shakespeare.txt
● Given names popularity in male-names.txt & female-names.
txt
● Show how given name is popular nowadays & how many
times it occurred in Shakespeare
● Result: key-value pair (key: name, value: pair)
E.g Romeo is mentioned 340 in Shakespeare
Romeo is nowadays 688th popular name
So result will be: (romeo,(688,340))
217. What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
218. What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
219. What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
220. What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how to evaluate its internal data
221. What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how to evaluate its internal data
222. What is a partition?
A partition represents subset of data within your
distributed collection.
223. What is a partition?
A partition represents subset of data within your
distributed collection.
override def getPartitions: Array[Partition] = ???
224. What is a partition?
A partition represents subset of data within your
distributed collection.
override def getPartitions: Array[Partition] = ???
How this subset is defined depends on type of
the RDD
227. example: HadoopRDD
val journal = sc.textFile(“hdfs://journal/*”)
How HadoopRDD is partitioned?
In HadoopRDD partition is exactly the same as
file chunks in HDFS
234. example: HadoopRDD
class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {
...
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
if (inputFormat.isInstanceOf[Configurable]) {
inputFormat.asInstanceOf[Configurable].setConf(jobConf)
}
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) }
array
}
235. example: HadoopRDD
class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {
...
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
if (inputFormat.isInstanceOf[Configurable]) {
inputFormat.asInstanceOf[Configurable].setConf(jobConf)
}
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) }
array
}
236. example: HadoopRDD
class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {
...
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
if (inputFormat.isInstanceOf[Configurable]) {
inputFormat.asInstanceOf[Configurable].setConf(jobConf)
}
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) }
array
}
237. example: MapPartitionsRDD
val journal = sc.textFile(“hdfs://journal/*”)
val fromMarch = journal.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
238. example: MapPartitionsRDD
val journal = sc.textFile(“hdfs://journal/*”)
val fromMarch = journal.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
How MapPartitionsRDD is partitioned?
239. example: MapPartitionsRDD
val journal = sc.textFile(“hdfs://journal/*”)
val fromMarch = journal.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
How MapPartitionsRDD is partitioned?
MapPartitionsRDD inherits partition information
from its parent RDD
241. What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how to evaluate its internal data
242. What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how to evaluate its internal data
262. What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
263. What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
267. Stage 1
Stage 2
Running Job aka materializing DAG
sc.textFile() .groupBy() .map { } .filter { } .collect()
action
Actions are implemented
using sc.runJob method
268. Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
): Array[U]
269. Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
rdd: RDD[T],
): Array[U]
270. Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
rdd: RDD[T],
partitions: Seq[Int],
): Array[U]
271. Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
rdd: RDD[T],
partitions: Seq[Int],
func: Iterator[T] => U,
): Array[U]
272. Running Job aka materializing DAG
/**
* Return an array that contains all of the elements in this RDD.
*/
def collect(): Array[T] = {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
273. Running Job aka materializing DAG
/**
* Return an array that contains all of the elements in this RDD.
*/
def collect(): Array[T] = {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
/**
* Return the number of elements in the RDD.
*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
274. Multiple jobs for single action
/**
* Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to
estimate the number of additional partitions needed to satisfy the limit.
*/
def take(num: Int): Array[T] = {
(….)
val left = num - buf.size
val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true)
(….)
res.foreach(buf ++= _.take(num - buf.size))
partsScanned += numPartsToTry
(….)
buf.toArray
}
276. Towards efficiency
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
277. Towards efficiency
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
scala> events.toDebugString
(4) MapPartitionsRDD[22] at filter at <console>:50 []
| MapPartitionsRDD[21] at map at <console>:49 []
| ShuffledRDD[20] at groupBy at <console>:48 []
+-(6) HadoopRDD[17] at textFile at <console>:47 []
278. Towards efficiency
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
scala> events.toDebugString
(4) MapPartitionsRDD[22] at filter at <console>:50 []
| MapPartitionsRDD[21] at map at <console>:49 []
| ShuffledRDD[20] at groupBy at <console>:48 []
+-(6) HadoopRDD[17] at textFile at <console>:47 []
events.count
279. Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
280. Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
328. A bit more about partitions
val events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4
.map( e => (extractDate(e), e))
329. A bit more about partitions
val events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4
.repartition(256) // note, this will cause a shuffle
.map( e => (extractDate(e), e))
330. A bit more about partitions
val events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.map( e => (extractDate(e), e))
331. A bit more about partitions
val events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.coalesce(64) // this will NOT shuffle
.map( e => (extractDate(e), e))
332. What is a RDD?
RDD needs to hold 3 chunks of information
in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
333. What is a RDD?
RDD needs to hold 3 + 2 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
4. data locality
5. paritioner
334. What is a RDD?
RDD needs to hold 3 + 2 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
4. data locality
5. paritioner
336. What is a RDD?
RDD needs to hold 3 + 2 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
4. data locality
5. paritioner
337. What is a RDD?
RDD needs to hold 3 + 2 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
4. data locality
5. paritioner
369. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Spark performance - vs Hadoop (3)
“(...) we decided to participate in the Sort Benchmark (...),
an industry benchmark on how fast a system can sort 100
TB of data (1 trillion records).
370. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Spark performance - vs Hadoop (3)
“(...) we decided to participate in the Sort Benchmark (...),
an industry benchmark on how fast a system can sort 100
TB of data (1 trillion records). The previous world record
was 72 minutes, set by (...) Hadoop (...) cluster of 2100
nodes.
371. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Spark performance - vs Hadoop (3)
“(...) we decided to participate in the Sort Benchmark (...),
an industry benchmark on how fast a system can sort 100
TB of data (1 trillion records). The previous world record
was 72 minutes, set by (...) Hadoop (...) cluster of 2100
nodes. Using Spark on 206 nodes, we completed the
benchmark in 23 minutes.
372. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Spark performance - vs Hadoop (3)
“(...) we decided to participate in the Sort Benchmark (...),
an industry benchmark on how fast a system can sort 100
TB of data (1 trillion records). The previous world record
was 72 minutes, set by (...) Hadoop (...) cluster of 2100
nodes. Using Spark on 206 nodes, we completed the
benchmark in 23 minutes. This means that Spark sorted
the same data 3X faster using 10X fewer machines.
373. twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Spark performance - vs Hadoop (3)
“(...) we decided to participate in the Sort Benchmark (...),
an industry benchmark on how fast a system can sort 100
TB of data (1 trillion records). The previous world record
was 72 minutes, set by (...) Hadoop (...) cluster of 2100
nodes. Using Spark on 206 nodes, we completed the
benchmark in 23 minutes. This means that Spark sorted
the same data 3X faster using 10X fewer machines. All (...)
without using Spark’s in-memory cache.”