2. Agenda
• Execution Engine (How it works ?)
• Different Api's of Spark.
• Optimization (How spark does optimization in joins, filters)
• Shuffling mechanism in distributed system.
• Whole-stage Code Generation (Fusing operators together by
identifying stages on spark)
• Spark Internals and what makes Spark faster.
3. Partitioning and Parallelism in Spark
What is a partition in Spark?
• Resilient Distributed Datasets are collection of various data items that are so huge in size, that they
cannot fit into a single node and have to be partitioned across various nodes.
• A partition in spark is an logical division of data stored on a node in the cluster.
• Partitions are basic units of parallelism in Apache Spark.
• RDDs in Apache Spark are collection of partitions.
• Apache Spark can run a single concurrent task for every partition of an RDD, up to the total number
of cores in the cluster.
5. Lazy Evaluation & Lineage Graph
• Lazy Evaluation
val lines = sc.textFile("words.txt") //Transformation(1)
val filtered = lines.filter(line => line.contains("word1"))
filtered.first() //Action(2)
The benefit of this Lazy Evaluation is, we only need to
read the first line from the File instead of the whole file
and also there is no need to store the complete file
content in Memory
• Caching – rdd.cache()
• Spark Lineage - What transformations need to be executed
after an action has been called.
6. Spark API’s
Write Less Code: Compute an Average
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1]))
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
.map(lambda x: [x[0], x[1][0] / x[1][1]])
.collect()
Using DataFrames
sqlCtx.table("people")
.groupBy("name")
.agg("name", avg("age"))
.collect()
Using SQL
SELECT name, avg(age)
FROM people
GROUP BY name
10. Trees: Abstractions of Users’ Programs
QueryPlan
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
10
11. Logical Plan
• ALogical Plan describes computation
on datasets without defining how to
conduct the computation
• output: a list of attributes generated
by this Logical Plan, e.g. [id, v]
Join
Filter
Project
Aggregate sum(v)
Scan
(t1)
Scan
(t2)
11
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
12. Physical Plan
• A Physical Plan describes computation
on datasets with specific definitions on
how to conduct the computation
Parquet Scan
(t1)
JS ONScan
(t2)
Sort-Merge
Join
Filter
Project
Hash-
Aggregate
sum(v)
16
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
13. Combining Multiple Rules
Scan Scan
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Predicate Pushdown
Scan Scan
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
(t1) (t2) (t1) (t2) 13
16. An example query
SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1
Logical
Plan
Project
name
Filter
id = 1
Project
id,name
People
16
17. Naive Query Planning
SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1 Project
id,name
Filter
id = 1
People
Logical
Plan
Project
name
Project
name
Project
id,name
Filter
id = 1
TableScan
People
Physical
Plan
17
18. Writing Rules as Tree Transformations
1. Find filters on top of projections.
2. Check that the filter can be
evaluated without the result of
the project.
3. If so, switch the operators.
Project
name
Project
id,name
Filter
id = 1
People
Original
Plan
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
19. Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
20. Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Partial FunctionTree
20
21. Filter Push Down Transformation
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Find Filter on Project
val newPlan = queryPlan transform {
21
22. Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Check that the filter can be evaluated without the result of the project.
22
23. Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
If so, switch the order.
23
24. Optimizing with Rules
Project
name
Project
id,name
Filter
id = 1
People
Original
Plan
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
Project
name
Filter
id = 1
People
Combine
Projection
IndexLookup
id = 1
return: name
24
25. Shuffle Hash Join
25
A Shuffle Hash Join is the most basic type of join, and goes back to Map
Reduce Fundamentals.
• Map through two different data frames/tables.
• Use the fields in the join condition as the output key.
• Shuffle both datasets by the output key.
• In the reduce phase, join the two datasets now any rows of both tables
with the same keys are on the same machine and are sorted.
27. Shuffle Hash Join Performance
Works best when the DF’s:-
• Distribute evenly with the key you are joining on.
• Have an adequate number of keys for parallelism
• join_rdd = sqlContext.sql(“select *
• FROM people_in_the_us
• JOIN states
• ON people_in_the_us.state = states.name”)
28. US DF
Partition 1
Problems:
● Uneven
Sharding
● Limited
parallelism w/
50 output
partitions
US RDD
Partition 2
US RDD
Partition 2**All** the
Data for CA
**All** the
Data for RI
CA
RI
All the data
for the US
will be
shuffled into
only 50 keys
for each of
the states.
Uneven Sharding & Limited Parallelism
US DF
Partition 2
US DF
Partition N Small State
DF
• A larger Spark Cluster will not solve these problems!
29. • join_rdd = sqlContext.sql(“select *
• FROM people_in_california
• LEFT JOIN all_the_people_in_the_world
• ON people_in_california.id =
• all_the_people_in_the_world.id”)
More Performance Considerations
Final output keys = # of people in CA, so don’t need a
huge Spark cluster, right?
30. The Size of the Spark Cluster to run this job is limited by the
Large table rather than the Medium Sized Table.
Left Join - Shuffle Step
Not a Problem:
● Even Sharding
● Good Parallelism
Shuffles everything
before dropping keys
All CA DF All World
DF
All the Data from
Both Tables
Final
Joined
Output
31. A Better Solution
Filter the World DF for only entries that match the CA ID
Filter Transform
Benefits:
● Less Data shuffled
over the network
and less shuffle
space needed.
● More transforms,
but still faster.
Shuffle
All CA DF All World
DF
Final
Joined
Output
Partial
World DF
32. Broadcast Hash Join
32
Parallelism of the large DF is maintained (n output
partitions), and shuffle is not even needed.
Broadcast
Large DF
Partition N
Large DF
Partition 1
Large DF
Partition 2
Optimization: When one of the DF’s is small enough to fit in
memory on a single machine.
Small DF
Small DF Small DF Small DF