SlideShare a Scribd company logo
1 of 33
1
Optimization	and	Internals	of	Apache	Spark	
Arpit Tak
Spark Developer, Quotient Technology Inc.
http://in.linkedin.com/in/arpittak/
Agenda
• Execution	Engine	(How	it	works	?)
• Different	Api's of	Spark.
• Optimization	 (How	spark	does	optimization	in	joins,	filters)
• Shuffling	mechanism	in distributed system.
• Whole-stage	Code	Generation	(Fusing	operators	together	by	
identifying	stages	on	spark)
• Spark	Internals	and	what	makes	Spark	faster.
Partitioning	and	Parallelism	in	Spark
What is a partition in Spark?
• Resilient Distributed Datasets are collection of various data items that are so huge in size, that they
cannot fit into a single node and have to be partitioned across various nodes.
• A partition in spark is an logical division of data stored on a node in the cluster.
• Partitions are basic units of parallelism in Apache Spark.
• RDDs in Apache Spark are collection of partitions.
• Apache Spark can run a single concurrent task for every partition of an RDD, up to the total number
of cores in the cluster.
How	your	Spark	Application	runs	on	a	Hadoop	cluster
Lazy	Evaluation	&	Lineage	Graph
• Lazy	Evaluation	
val lines	=	sc.textFile("words.txt")	//Transformation(1)
val filtered	=	lines.filter(line	=>	line.contains("word1"))
filtered.first() //Action(2)
The	benefit	of	this Lazy	Evaluation is,	we	only	need	to	
read	the	first	line	from	the	File	instead	of	the	whole	file	
and		also	there	is	no	need	to	store	the	complete	file	
content in	Memory
• Caching	– rdd.cache()
• Spark	Lineage	- What	transformations	need	to	be	executed	
after	an	action	has	been	called.
Spark	API’s	
Write	Less	Code:	Compute	an Average
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1])) 
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) 
.map(lambda x: [x[0], x[1][0] / x[1][1]]) 
.collect()
Using DataFrames
sqlCtx.table("people") 
.groupBy("name") 
.agg("name", avg("age")) 
.collect()
Using SQL
SELECT name, avg(age)
FROM people
GROUP BY name
Not	Just	Less	Code:	Faster Implementations
0
7
10
DataFrame SQL
DataFrame R
DataFrame Python
DataFrame Scala
RDD Python
RDDScala
2 4 6 8
Time to Aggregate 10 million int pairs (secs)
How	Catalyst	Works:	An Overview
SQL AST
DataFrame Query Plan
Optimized		
Query Plan
RDDs
Catalyst
Transformations
Dataset
Abstractions of users’ programs
(Trees)
Trees:	Abstractions	of	Users’ Programs
9
Example 1
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp
Trees:	Abstractions	of	Users’ Programs
QueryPlan
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp Scan		
(t1)
Scan		
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
10
Logical Plan
• ALogical Plan describes computation
on datasets without defining how to
conduct the computation
• output:	a	list	of	attributes generated		
by	this	Logical	Plan,	e.g.	[id, v]
Join
Filter
Project
Aggregate sum(v)
Scan
(t1)
Scan
(t2)
11
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Physical Plan
• A Physical Plan describes computation
on datasets with specific definitions on
how to conduct the computation
Parquet Scan		
(t1)
JS ONScan		
(t2)
Sort-Merge		
Join
Filter
Project
Hash-
Aggregate
sum(v)
16
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Combining	Multiple Rules
Scan Scan
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Predicate Pushdown
Scan Scan
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
(t1) (t2) (t1) (t2) 13
Whole-Stage	Code	Generation
• Fuse	multiple	operator	together	into	a	single	Java	function	that	is	aimed	at	
improving	execution	performance.	
• Identify	chain	of	operators.
• Collapses	a	query	into	a	single	optimized	function	that	eliminates	virtual	function	
calls	and	leverages	CPU	registers	for	intermediate	data
Performance	Optimization:-
• No	virtual	function	dispatches
• Intermediate	data	in	memory	vs	CPU	registers
• Loop	unrolling
15
Tree Transformations
Developers	express	tree	transformations as
PartialFunction[TreeType,TreeType]
1. If	the	function	does	apply	to	an	operator,	that		
operator	is	replaced	with	the result.
2. When	the	function	does	not	apply	to	an		
operator,	that	operator	is	left unchanged.
3. The	transformation	is	applied	recursively	to	all		children.
An example query
SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1
Logical
Plan
Project
name
Filter
id = 1
Project
id,name
People
16
Naive Query Planning
SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1 Project
id,name
Filter
id = 1
People
Logical
Plan
Project
name
Project
name
Project
id,name
Filter
id = 1
TableScan
People
Physical
Plan
17
Writing Rules as Tree Transformations
1. Find	filters	on	top	of		projections.
2. Check	that	the	filter		can	be	
evaluated		without	the	result	of		
the project.
3. If	so,	switch	the		operators.
Project
name
Project
id,name
Filter
id = 1
People
Original
Plan
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Partial FunctionTree
20
Filter Push Down Transformation
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Find Filter on Project
val newPlan = queryPlan transform {
21
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Check that the filter can be evaluated without the result of the project.
22
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
If so, switch the order.
23
Optimizing with Rules
Project
name
Project
id,name
Filter
id = 1
People
Original
Plan
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
Project
name
Filter
id = 1
People
Combine
Projection
IndexLookup
id = 1
return: name
24
Shuffle Hash Join
25
A Shuffle Hash Join is the most basic type of join, and goes back to Map
Reduce Fundamentals.
• Map through two different data frames/tables.
• Use the fields in the join condition as the output key.
• Shuffle both datasets by the output key.
• In the reduce phase, join the two datasets now any rows of both tables
with the same keys are on the same machine and are sorted.
Shuffle Hash Join
26
Table	1 Table	2MAP
SHUFFLE
REDUCE Output Output Output Output Output
Shuffle	Hash	Join	Performance
Works best when the DF’s:-
• Distribute evenly with the key you are joining on.
• Have an adequate number of keys for parallelism
• join_rdd = sqlContext.sql(“select *
• FROM people_in_the_us
• JOIN states
• ON people_in_the_us.state = states.name”)
US DF
Partition 1
Problems:
● Uneven
Sharding
● Limited
parallelism w/
50 output
partitions
US RDD
Partition 2
US RDD
Partition 2**All** the
Data for CA
**All** the
Data for RI
CA
RI
All the data
for the US
will be
shuffled into
only 50 keys
for each of
the states.
Uneven Sharding & Limited Parallelism
US DF
Partition 2
US DF
Partition N Small State
DF
• A larger Spark Cluster will not solve these problems!
• join_rdd = sqlContext.sql(“select *
• FROM people_in_california
• LEFT JOIN all_the_people_in_the_world
• ON people_in_california.id =
• all_the_people_in_the_world.id”)
More Performance Considerations
Final output keys = # of people in CA, so don’t need a
huge Spark cluster, right?
The Size of the Spark Cluster to run this job is limited by the
Large table rather than the Medium Sized Table.
Left Join - Shuffle Step
Not a Problem:
● Even Sharding
● Good Parallelism
Shuffles everything
before dropping keys
All CA DF All World
DF
All the Data from
Both Tables
Final
Joined
Output
A Better Solution
Filter the World DF for only entries that match the CA ID
Filter Transform
Benefits:
● Less Data shuffled
over the network
and less shuffle
space needed.
● More transforms,
but still faster.
Shuffle
All CA DF All World
DF
Final
Joined
Output
Partial
World DF
Broadcast Hash Join
32
Parallelism of the large DF is maintained (n output
partitions), and shuffle is not even needed.
Broadcast
Large DF
Partition N
Large DF
Partition 1
Large DF
Partition 2
Optimization: When one of the DF’s is small enough to fit in
memory on a single machine.
Small DF
Small DF Small DF Small DF
•Thank	You

More Related Content

What's hot

Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Anyscale
 
Cassandra Materialized Views
Cassandra Materialized ViewsCassandra Materialized Views
Cassandra Materialized ViewsCarl Yeksigian
 
Spark Cassandra 2016
Spark Cassandra 2016Spark Cassandra 2016
Spark Cassandra 2016Duyhai Doan
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?Miklos Christine
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
Cassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series Example
Cassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series ExampleCassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series Example
Cassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series ExampleDataStax Academy
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsCheng Lian
 
Datastax day 2016 introduction to apache cassandra
Datastax day 2016   introduction to apache cassandraDatastax day 2016   introduction to apache cassandra
Datastax day 2016 introduction to apache cassandraDuyhai Doan
 
Sasi, cassandra on full text search ride
Sasi, cassandra on full text search rideSasi, cassandra on full text search ride
Sasi, cassandra on full text search rideDuyhai Doan
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Query parameterization
Query parameterizationQuery parameterization
Query parameterizationRiteshkiit
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Spark Summit
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDatabricks
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Databricks
 

What's hot (20)

Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Cassandra Materialized Views
Cassandra Materialized ViewsCassandra Materialized Views
Cassandra Materialized Views
 
Spark Cassandra 2016
Spark Cassandra 2016Spark Cassandra 2016
Spark Cassandra 2016
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Cassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series Example
Cassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series ExampleCassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series Example
Cassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series Example
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Datastax day 2016 introduction to apache cassandra
Datastax day 2016   introduction to apache cassandraDatastax day 2016   introduction to apache cassandra
Datastax day 2016 introduction to apache cassandra
 
Sasi, cassandra on full text search ride
Sasi, cassandra on full text search rideSasi, cassandra on full text search ride
Sasi, cassandra on full text search ride
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Query parameterization
Query parameterizationQuery parameterization
Query parameterization
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New World
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
 

Viewers also liked

Melt iron heterogeneous computing - lspe v3
Melt iron   heterogeneous computing - lspe v3Melt iron   heterogeneous computing - lspe v3
Melt iron heterogeneous computing - lspe v3Rinka Singh
 
Problem solving - A presentation at IISc
Problem solving - A presentation at IIScProblem solving - A presentation at IISc
Problem solving - A presentation at IIScRinka Singh
 
El docente sin vocación
El docente sin vocaciónEl docente sin vocación
El docente sin vocaciónclara pineda
 
Presentació 17 18 slideshare
Presentació 17 18 slidesharePresentació 17 18 slideshare
Presentació 17 18 slideshareRamon Grau
 
Comparative genomics to the rescue: How complete is your plant genome sequence?
Comparative genomics to the rescue: How complete is your plant genome sequence?Comparative genomics to the rescue: How complete is your plant genome sequence?
Comparative genomics to the rescue: How complete is your plant genome sequence?Klaas Vandepoele
 
Tema iii integral definida y aplicaciones uney
Tema iii integral definida y aplicaciones uneyTema iii integral definida y aplicaciones uney
Tema iii integral definida y aplicaciones uneyJulio Barreto Garcia
 
Kan ve kan hastalıkları
Kan ve kan hastalıklarıKan ve kan hastalıkları
Kan ve kan hastalıklarıHasan YANGEL
 
Climate challenges and climate solutions
Climate challenges and climate solutionsClimate challenges and climate solutions
Climate challenges and climate solutionsKim Cobb
 
QNBFS Daily Technical Trader - Qatar March 19, 2017
QNBFS Daily Technical Trader - Qatar March 19, 2017QNBFS Daily Technical Trader - Qatar March 19, 2017
QNBFS Daily Technical Trader - Qatar March 19, 2017QNB Group
 
Trulondon Schedule
Trulondon ScheduleTrulondon Schedule
Trulondon ScheduleBill Boorman
 
[테크앤로] 세계소비자의날 토론 개인정보보호 패러다임의 변화 170315_구태언
[테크앤로] 세계소비자의날 토론 개인정보보호 패러다임의 변화 170315_구태언[테크앤로] 세계소비자의날 토론 개인정보보호 패러다임의 변화 170315_구태언
[테크앤로] 세계소비자의날 토론 개인정보보호 패러다임의 변화 170315_구태언TEK & LAW, LLP
 
TensorFlow XLA : AOT編 チラ見版
TensorFlow XLA : AOT編 チラ見版TensorFlow XLA : AOT編 チラ見版
TensorFlow XLA : AOT編 チラ見版Mr. Vengineer
 
Datazoa for airports
Datazoa for airportsDatazoa for airports
Datazoa for airportsRudy Parker
 

Viewers also liked (20)

Melt iron heterogeneous computing - lspe v3
Melt iron   heterogeneous computing - lspe v3Melt iron   heterogeneous computing - lspe v3
Melt iron heterogeneous computing - lspe v3
 
Problem solving - A presentation at IISc
Problem solving - A presentation at IIScProblem solving - A presentation at IISc
Problem solving - A presentation at IISc
 
El docente sin vocación
El docente sin vocaciónEl docente sin vocación
El docente sin vocación
 
Presentació 17 18 slideshare
Presentació 17 18 slidesharePresentació 17 18 slideshare
Presentació 17 18 slideshare
 
Spanish murder mystery
Spanish murder mysterySpanish murder mystery
Spanish murder mystery
 
E views v
E views vE views v
E views v
 
Lspe
LspeLspe
Lspe
 
Comparative genomics to the rescue: How complete is your plant genome sequence?
Comparative genomics to the rescue: How complete is your plant genome sequence?Comparative genomics to the rescue: How complete is your plant genome sequence?
Comparative genomics to the rescue: How complete is your plant genome sequence?
 
Depreciation
DepreciationDepreciation
Depreciation
 
Otitis externa maligna
Otitis externa malignaOtitis externa maligna
Otitis externa maligna
 
Tema iii integral definida y aplicaciones uney
Tema iii integral definida y aplicaciones uneyTema iii integral definida y aplicaciones uney
Tema iii integral definida y aplicaciones uney
 
Online votinh
Online votinh Online votinh
Online votinh
 
Kan ve kan hastalıkları
Kan ve kan hastalıklarıKan ve kan hastalıkları
Kan ve kan hastalıkları
 
Weld Strata talk
Weld Strata talkWeld Strata talk
Weld Strata talk
 
Climate challenges and climate solutions
Climate challenges and climate solutionsClimate challenges and climate solutions
Climate challenges and climate solutions
 
QNBFS Daily Technical Trader - Qatar March 19, 2017
QNBFS Daily Technical Trader - Qatar March 19, 2017QNBFS Daily Technical Trader - Qatar March 19, 2017
QNBFS Daily Technical Trader - Qatar March 19, 2017
 
Trulondon Schedule
Trulondon ScheduleTrulondon Schedule
Trulondon Schedule
 
[테크앤로] 세계소비자의날 토론 개인정보보호 패러다임의 변화 170315_구태언
[테크앤로] 세계소비자의날 토론 개인정보보호 패러다임의 변화 170315_구태언[테크앤로] 세계소비자의날 토론 개인정보보호 패러다임의 변화 170315_구태언
[테크앤로] 세계소비자의날 토론 개인정보보호 패러다임의 변화 170315_구태언
 
TensorFlow XLA : AOT編 チラ見版
TensorFlow XLA : AOT編 チラ見版TensorFlow XLA : AOT編 チラ見版
TensorFlow XLA : AOT編 チラ見版
 
Datazoa for airports
Datazoa for airportsDatazoa for airports
Datazoa for airports
 

Similar to Meetup talk

SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Databricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentationMichael Keane
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache SparkLucian Neghina
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xshradha ambekar
 
Deep Dive into Spark
Deep Dive into SparkDeep Dive into Spark
Deep Dive into SparkEric Xiao
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookDatabricks
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationVolodymyr Rovetskiy
 

Similar to Meetup talk (20)

SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentation
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
 
Deep Dive into Spark
Deep Dive into SparkDeep Dive into Spark
Deep Dive into Spark
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
 

Recently uploaded

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 

Recently uploaded (20)

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 

Meetup talk

  • 1. 1 Optimization and Internals of Apache Spark Arpit Tak Spark Developer, Quotient Technology Inc. http://in.linkedin.com/in/arpittak/
  • 2. Agenda • Execution Engine (How it works ?) • Different Api's of Spark. • Optimization (How spark does optimization in joins, filters) • Shuffling mechanism in distributed system. • Whole-stage Code Generation (Fusing operators together by identifying stages on spark) • Spark Internals and what makes Spark faster.
  • 3. Partitioning and Parallelism in Spark What is a partition in Spark? • Resilient Distributed Datasets are collection of various data items that are so huge in size, that they cannot fit into a single node and have to be partitioned across various nodes. • A partition in spark is an logical division of data stored on a node in the cluster. • Partitions are basic units of parallelism in Apache Spark. • RDDs in Apache Spark are collection of partitions. • Apache Spark can run a single concurrent task for every partition of an RDD, up to the total number of cores in the cluster.
  • 5. Lazy Evaluation & Lineage Graph • Lazy Evaluation val lines = sc.textFile("words.txt") //Transformation(1) val filtered = lines.filter(line => line.contains("word1")) filtered.first() //Action(2) The benefit of this Lazy Evaluation is, we only need to read the first line from the File instead of the whole file and also there is no need to store the complete file content in Memory • Caching – rdd.cache() • Spark Lineage - What transformations need to be executed after an action has been called.
  • 6. Spark API’s Write Less Code: Compute an Average Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() Using DataFrames sqlCtx.table("people") .groupBy("name") .agg("name", avg("age")) .collect() Using SQL SELECT name, avg(age) FROM people GROUP BY name
  • 7. Not Just Less Code: Faster Implementations 0 7 10 DataFrame SQL DataFrame R DataFrame Python DataFrame Scala RDD Python RDDScala 2 4 6 8 Time to Aggregate 10 million int pairs (secs)
  • 8. How Catalyst Works: An Overview SQL AST DataFrame Query Plan Optimized Query Plan RDDs Catalyst Transformations Dataset Abstractions of users’ programs (Trees)
  • 9. Trees: Abstractions of Users’ Programs 9 Example 1 SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp
  • 10. Trees: Abstractions of Users’ Programs QueryPlan SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000 10
  • 11. Logical Plan • ALogical Plan describes computation on datasets without defining how to conduct the computation • output: a list of attributes generated by this Logical Plan, e.g. [id, v] Join Filter Project Aggregate sum(v) Scan (t1) Scan (t2) 11 t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  • 12. Physical Plan • A Physical Plan describes computation on datasets with specific definitions on how to conduct the computation Parquet Scan (t1) JS ONScan (t2) Sort-Merge Join Filter Project Hash- Aggregate sum(v) 16 t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  • 13. Combining Multiple Rules Scan Scan Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000 Predicate Pushdown Scan Scan Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000 (t1) (t2) (t1) (t2) 13
  • 14. Whole-Stage Code Generation • Fuse multiple operator together into a single Java function that is aimed at improving execution performance. • Identify chain of operators. • Collapses a query into a single optimized function that eliminates virtual function calls and leverages CPU registers for intermediate data Performance Optimization:- • No virtual function dispatches • Intermediate data in memory vs CPU registers • Loop unrolling
  • 15. 15 Tree Transformations Developers express tree transformations as PartialFunction[TreeType,TreeType] 1. If the function does apply to an operator, that operator is replaced with the result. 2. When the function does not apply to an operator, that operator is left unchanged. 3. The transformation is applied recursively to all children.
  • 16. An example query SELECT name FROM ( SELECT id, name FROM People) p WHERE p.id = 1 Logical Plan Project name Filter id = 1 Project id,name People 16
  • 17. Naive Query Planning SELECT name FROM ( SELECT id, name FROM People) p WHERE p.id = 1 Project id,name Filter id = 1 People Logical Plan Project name Project name Project id,name Filter id = 1 TableScan People Physical Plan 17
  • 18. Writing Rules as Tree Transformations 1. Find filters on top of projections. 2. Check that the filter can be evaluated without the result of the project. 3. If so, switch the operators. Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down
  • 19. Filter Push Down Transformation val newPlan = queryPlan transform { case f @ Filter(_, p @ Project(_, grandChild)) if(f.references subsetOf grandChild.output) => p.copy(child = f.copy(child = grandChild) }
  • 20. Filter Push Down Transformation val newPlan = queryPlan transform { case f @ Filter(_, p @ Project(_, grandChild)) if(f.references subsetOf grandChild.output) => p.copy(child = f.copy(child = grandChild) } Partial FunctionTree 20
  • 21. Filter Push Down Transformation case f @ Filter(_, p @ Project(_, grandChild)) if(f.references subsetOf grandChild.output) => p.copy(child = f.copy(child = grandChild) } Find Filter on Project val newPlan = queryPlan transform { 21
  • 22. Filter Push Down Transformation val newPlan = queryPlan transform { case f @ Filter(_, p @ Project(_, grandChild)) if(f.references subsetOf grandChild.output) => p.copy(child = f.copy(child = grandChild) } Check that the filter can be evaluated without the result of the project. 22
  • 23. Filter Push Down Transformation val newPlan = queryPlan transform { case f @ Filter(_, p @ Project(_, grandChild)) if(f.references subsetOf grandChild.output) => p.copy(child = f.copy(child = grandChild) } If so, switch the order. 23
  • 24. Optimizing with Rules Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down Project name Filter id = 1 People Combine Projection IndexLookup id = 1 return: name 24
  • 25. Shuffle Hash Join 25 A Shuffle Hash Join is the most basic type of join, and goes back to Map Reduce Fundamentals. • Map through two different data frames/tables. • Use the fields in the join condition as the output key. • Shuffle both datasets by the output key. • In the reduce phase, join the two datasets now any rows of both tables with the same keys are on the same machine and are sorted.
  • 26. Shuffle Hash Join 26 Table 1 Table 2MAP SHUFFLE REDUCE Output Output Output Output Output
  • 27. Shuffle Hash Join Performance Works best when the DF’s:- • Distribute evenly with the key you are joining on. • Have an adequate number of keys for parallelism • join_rdd = sqlContext.sql(“select * • FROM people_in_the_us • JOIN states • ON people_in_the_us.state = states.name”)
  • 28. US DF Partition 1 Problems: ● Uneven Sharding ● Limited parallelism w/ 50 output partitions US RDD Partition 2 US RDD Partition 2**All** the Data for CA **All** the Data for RI CA RI All the data for the US will be shuffled into only 50 keys for each of the states. Uneven Sharding & Limited Parallelism US DF Partition 2 US DF Partition N Small State DF • A larger Spark Cluster will not solve these problems!
  • 29. • join_rdd = sqlContext.sql(“select * • FROM people_in_california • LEFT JOIN all_the_people_in_the_world • ON people_in_california.id = • all_the_people_in_the_world.id”) More Performance Considerations Final output keys = # of people in CA, so don’t need a huge Spark cluster, right?
  • 30. The Size of the Spark Cluster to run this job is limited by the Large table rather than the Medium Sized Table. Left Join - Shuffle Step Not a Problem: ● Even Sharding ● Good Parallelism Shuffles everything before dropping keys All CA DF All World DF All the Data from Both Tables Final Joined Output
  • 31. A Better Solution Filter the World DF for only entries that match the CA ID Filter Transform Benefits: ● Less Data shuffled over the network and less shuffle space needed. ● More transforms, but still faster. Shuffle All CA DF All World DF Final Joined Output Partial World DF
  • 32. Broadcast Hash Join 32 Parallelism of the large DF is maintained (n output partitions), and shuffle is not even needed. Broadcast Large DF Partition N Large DF Partition 1 Large DF Partition 2 Optimization: When one of the DF’s is small enough to fit in memory on a single machine. Small DF Small DF Small DF Small DF