Meetup talk

1
Optimization and Internals of Apache Spark
Arpit Tak
Spark Developer, Quotient Technology Inc.
http://in.linkedin.com/in/arpittak/

Agenda
• Execution Engine (How it works ?)
• Different Api's of Spark.
• Optimization (How spark does optimization in joins, filters)
• Shuffling mechanism in distributed system.
• Whole-stage Code Generation (Fusing operators together by
identifying stages on spark)
• Spark Internals and what makes Spark faster.

Partitioning and Parallelism in Spark
What is a partition in Spark?
• Resilient Distributed Datasets are collection of various data items that are so huge in size, that they
cannot fit into a single node and have to be partitioned across various nodes.
• A partition in spark is an logical division of data stored on a node in the cluster.
• Partitions are basic units of parallelism in Apache Spark.
• RDDs in Apache Spark are collection of partitions.
• Apache Spark can run a single concurrent task for every partition of an RDD, up to the total number
of cores in the cluster.

How your Spark Application runs on a Hadoop cluster

Lazy Evaluation & Lineage Graph
• Lazy Evaluation
val lines = sc.textFile("words.txt") //Transformation(1)
val filtered = lines.filter(line => line.contains("word1"))
filtered.first() //Action(2)
The benefit of this Lazy Evaluation is, we only need to
read the first line from the File instead of the whole file
and also there is no need to store the complete file
content in Memory
• Caching – rdd.cache()
• Spark Lineage - What transformations need to be executed
after an action has been called.

Spark API’s
Write Less Code: Compute an Average
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1]))
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
.map(lambda x: [x[0], x[1][0] / x[1][1]])
.collect()
Using DataFrames
sqlCtx.table("people")
.groupBy("name")
.agg("name", avg("age"))
.collect()
Using SQL
SELECT name, avg(age)
FROM people
GROUP BY name

Not Just Less Code: Faster Implementations
0
7
10
DataFrame SQL
DataFrame R
DataFrame Python
DataFrame Scala
RDD Python
RDDScala
2 4 6 8
Time to Aggregate 10 million int pairs (secs)

How Catalyst Works: An Overview
SQL AST
DataFrame Query Plan
Optimized
Query Plan
RDDs
Catalyst
Transformations
Dataset
Abstractions of users’ programs
(Trees)

Trees: Abstractions of Users’ Programs
9
Example 1
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp

Trees: Abstractions of Users’ Programs
QueryPlan
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
10

Logical Plan
• ALogical Plan describes computation
on datasets without defining how to
conduct the computation
• output: a list of attributes generated
by this Logical Plan, e.g. [id, v]
Join
Filter
Project
Aggregate sum(v)
Scan
(t1)
Scan
(t2)
11
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000

Physical Plan
• A Physical Plan describes computation
on datasets with specific definitions on
how to conduct the computation
Parquet Scan
(t1)
JS ONScan
(t2)
Sort-Merge
Join
Filter
Project
Hash-
Aggregate
sum(v)
16
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000

Combining Multiple Rules
Scan Scan
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Predicate Pushdown
Scan Scan
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
(t1) (t2) (t1) (t2) 13

Whole-Stage Code Generation
• Fuse multiple operator together into a single Java function that is aimed at
improving execution performance.
• Identify chain of operators.
• Collapses a query into a single optimized function that eliminates virtual function
calls and leverages CPU registers for intermediate data
Performance Optimization:-
• No virtual function dispatches
• Intermediate data in memory vs CPU registers
• Loop unrolling

15
Tree Transformations
Developers express tree transformations as
PartialFunction[TreeType,TreeType]
1. If the function does apply to an operator, that
operator is replaced with the result.
2. When the function does not apply to an
operator, that operator is left unchanged.
3. The transformation is applied recursively to all children.

An example query
SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1
Logical
Plan
Project
name
Filter
id = 1
Project
id,name
People
16

Naive Query Planning
SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1 Project
id,name
Filter
id = 1
People
Logical
Plan
Project
name
Project
name
Project
id,name
Filter
id = 1
TableScan
People
Physical
Plan
17

Writing Rules as Tree Transformations
1. Find filters on top of projections.
2. Check that the filter can be
evaluated without the result of
the project.
3. If so, switch the operators.
Project
name
Project
id,name
Filter
id = 1
People
Original
Plan
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down

Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}

}
Partial FunctionTree
20

}
Find Filter on Project
21

}
Check that the filter can be evaluated without the result of the project.
22

}
If so, switch the order.
23

Optimizing with Rules
Project
name
Project
id,name
Filter
id = 1
People
Original
Plan
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
Project
name
Filter
id = 1
People
Combine
Projection
IndexLookup
id = 1
return: name
24

Shuffle Hash Join
25
A Shuffle Hash Join is the most basic type of join, and goes back to Map
Reduce Fundamentals.
• Map through two different data frames/tables.
• Use the fields in the join condition as the output key.
• Shuffle both datasets by the output key.
• In the reduce phase, join the two datasets now any rows of both tables
with the same keys are on the same machine and are sorted.

Shuffle Hash Join
26
Table 1 Table 2MAP
SHUFFLE
REDUCE Output Output Output Output Output

Shuffle Hash Join Performance
Works best when the DF’s:-
• Distribute evenly with the key you are joining on.
• Have an adequate number of keys for parallelism
• join_rdd = sqlContext.sql(“select *
• FROM people_in_the_us
• JOIN states
• ON people_in_the_us.state = states.name”)

US DF
Partition 1
Problems:
● Uneven
Sharding
● Limited
parallelism w/
50 output
partitions
US RDD
Partition 2
US RDD
Partition 2**All** the
Data for CA
**All** the
Data for RI
CA
RI
All the data
for the US
will be
shuffled into
only 50 keys
for each of
the states.
Uneven Sharding & Limited Parallelism
US DF
Partition 2
US DF
Partition N Small State
DF
• A larger Spark Cluster will not solve these problems!

• join_rdd = sqlContext.sql(“select *
• FROM people_in_california
• LEFT JOIN all_the_people_in_the_world
• ON people_in_california.id =
• all_the_people_in_the_world.id”)
More Performance Considerations
Final output keys = # of people in CA, so don’t need a
huge Spark cluster, right?

The Size of the Spark Cluster to run this job is limited by the
Large table rather than the Medium Sized Table.
Left Join - Shuffle Step
Not a Problem:
● Even Sharding
● Good Parallelism
Shuffles everything
before dropping keys
All CA DF All World
DF
All the Data from
Both Tables
Final
Joined
Output

A Better Solution
Filter the World DF for only entries that match the CA ID
Filter Transform
Benefits:
● Less Data shuffled
over the network
and less shuffle
space needed.
● More transforms,
but still faster.
Shuffle
All CA DF All World
DF
Final
Joined
Output
Partial
World DF

Broadcast Hash Join
32
Parallelism of the large DF is maintained (n output
partitions), and shuffle is not even needed.
Broadcast
Large DF
Partition N
Large DF
Partition 1
Large DF
Partition 2
Optimization: When one of the DF’s is small enough to fit in
memory on a single machine.
Small DF
Small DF Small DF Small DF

Meetup talk

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Meetup talk

Similar to Meetup talk (20)

Recently uploaded

Recently uploaded (20)

Meetup talk