SlideShare a Scribd company logo
1 of 50
Download to read offline
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Daniel Tomes, Databricks
Spark Core – Proper
Optimization
#UnifiedAnalytics #SparkAISummit
Me
• Norman, OK
– Undergrad OU – SOONER
– Masters – OK State
• ConocoPhillips
• Raleigh, NC
• Cloudera
• Databricks
3#UnifiedAnalytics #SparkAISummit
/in/tomes
Talking Points
• Spark Hierarchy
• The Spark UI
• Rightsizing & Optimizing
• Advanced Optimizations
4#UnifiedAnalytics #SparkAISummit
Spark Hierarchy
5#UnifiedAnalytics #SparkAISummit
Spark Hierarchy
6#UnifiedAnalytics #SparkAISummit
• Actions are eager
– Made of transformations (lazy)
• narrow
• wide (requires shuffle)
– Spawn jobs
• Spawn Stages
– Spawn Tasks
» Do work & utilize hardware
Navigating The Spark UI
DEMO
7#UnifiedAnalytics #SparkAISummit
Understand Your Hardware
• Core Count & Speed
• Memory Per Core (Working & Storage)
• Local Disk Type, Count, Size, & Speed
• Network Speed & Topology
• Data Lake Properties (rate limits)
• Cost / Core / Hour
– Financial For Cloud
– Opportunity for Shared & On Prem
8#UnifiedAnalytics #SparkAISummit
Get A Baseline
• Is your action efficient?
– Long Stages, Spills, Laggard Tasks, etc?
• CPU Utilization
– GANGLIA / YARN / Etc
– Tails
9#UnifiedAnalytics #SparkAISummit
Goal
Minimize Data Scans (Lazy Load)
• Data Skipping
– HIVE Partitions
– Bucketing
• Only Experts – Nearly Impossible to Maintain
– Databricks Delta Z-Ordering
• What is It
• How To Do It
10#UnifiedAnalytics #SparkAISummit
11#UnifiedAnalytics #SparkAISummit
12#UnifiedAnalytics #SparkAISummit
Simple
Extra Shuffle Partitions
No Lazy Loading With Lazy Loading
13#UnifiedAnalytics #SparkAISummit
Without Partition Filter
With Partition Filter
Shrink Partition Range
Using a Filter on HIVE
Partitioned Column
Partitions – Definition
Each of a number of portions into which some
operating systems divide memory or storage
14#UnifiedAnalytics #SparkAISummit
HIVE PARTITION == SPARK PARTITION
Spark Partitions – Types
• Input
– Controls - Size
• spark.default.parallelism (don’t use)
• spark.sql.files.maxPartitionBytes (mutable)
– assuming source has sufficient partitions
• Shuffle
– Control = Count
• spark.sql.shuffle.partitions
• Output
– Control = Size
• Coalesce(n) to shrink
• Repartition(n) to increase and/or balance (shuffle)
• df.write.option(“maxRecordsPerFile”, N)
15#UnifiedAnalytics #SparkAISummit
Partitions – Shuffle – Default
Default = 200 Shuffle Partitions
16#UnifiedAnalytics #SparkAISummit
Partitions – Right Sizing – Shuffle – Master Equation
• Largest Shuffle Stage
– Target Size <= 200 MB/partition
• Partition Count = Stage Input Data / Target Size
– Solve for Partition Count
EXAMPLE
Shuffle Stage Input = 210GB
x = 210000MB / 200MB = 1050
spark.conf.set(“spark.sql.shuffle.partitions”, 1050)
BUT -> If cluster has 2000 cores
spark.conf.set(“spark.sql.shuffle.partitions”, 2000)
17#UnifiedAnalytics #SparkAISummit
18#UnifiedAnalytics #SparkAISummit
Stage 21 -> Shuffle Fed By Stage 19 & 20
THUS
Stage 21 Shuffle Input = 45.4g + 8.6g == 54g
Default Shuffle Partition == 200 == 54000mb/200parts =~
270mb/shuffle part
Cluster Spec
96 cores @ 7.625g/core
3.8125g Working Mem
3.8125g Storage Mem
Spills
19#UnifiedAnalytics #SparkAISummit
Cluster Spec
96 cores @ 7.625g/core
3.8125g Working Mem
3.8125g Storage Mem
480 shuffle partitions – WHY?
Target shuffle part size == 100m
p = 54g / 100m == 540
540p / 96 cores == 5.625
96 * 5 == 480
If p == 540 another 60p have to be loaded
and processed after first cycle is complete
NO SPILL
Input Partitions – Right Sizing
• Use Spark Defaults (128MB) unless…
– Increase Parallelism
– Heavily Nested/Repetitive Data
– Generating Data – i.e. Explode
– Source Structure is not optimal (upstream)
– UDFs
20#UnifiedAnalytics #SparkAISummit
spark.conf.set("spark.sql.files.maxPartitionBytes", 16777216)
21#UnifiedAnalytics #SparkAISummit
128mb
16mb
22#UnifiedAnalytics #SparkAISummit
Output Partitions – Right Sizing
• Write Once -> Read Many
– More Time to Write but Faster to Read
• Perfect writes limit parallelism
– Compactions (minor & major)
Write Data Size = 14.7GB
Desired File Size = 1500MB
Max write stage parallelism = 10
96 – 10 == 86 cores idle during write
23#UnifiedAnalytics #SparkAISummit
24#UnifiedAnalytics #SparkAISummit
Only 10 Cores Used
Average File Size == 1.5g
All 96 Cores Used
Average File Size == 0.16g
Output Partitions – Composition
25#UnifiedAnalytics #SparkAISummit
• df.write.option("maxRecordsPerFile", n)
• df.coalesce(n).write…
• df.repartition(n).write…
• df.repartition(n, [colA, …]).write…
• spark.sql.shuffle.partitions(n)
• df.localCheckpoint(…).repartition(n).write…
• df.localCheckpoint(…).coalesce(n).write…
Partitions – Why So Serious?
• Avoid The Spill
• Maximize Parallelism
– Utilize All Cores
– Provision only the cores you need
26#UnifiedAnalytics #SparkAISummit
Advanced Optimizations
• Finding Imbalances
• Persisting
• Join Optimizations
• Handling Skew
• Expensive Operations
• UDFs
• Multi-Dimensional Parallelism
27#UnifiedAnalytics #SparkAISummit
Balance
• Maximizing Resources Requires Balance
– Task Duration
– Partition Size
• SKEW
– When some partitions are significantly larger than most
28#UnifiedAnalytics #SparkAISummit
Input Partitions
Shuffle Partitions
Output Files
Spills
GC Times
Straggling Tasks
29#UnifiedAnalytics #SparkAISummit
75th percentile ~ 2m recs
max ~ 45m recs
stragglers take > 22X longer IF no spillage
With spillage, 100Xs longer
Minimize Data Scans (Persistence)
• Persistence
– Not Free
• Repetition
– SQL Plan
30#UnifiedAnalytics #SparkAISummit
df.cache == df.persist(StorageLevel.MEMORY_AND_DISK)
• Types
– Default (MEMORY_AND_DISK)
• Deserialized
– Deserialized = Faster = Bigger
– Serialized = Slower = Smaller
– _2 = Safety = 2X bigger
– MEMORY_ONLY
– DISK_ONLY
Don’t Forget To Cleanup!
df.unpersist
31#UnifiedAnalytics #SparkAISummit
32#UnifiedAnalytics #SparkAISummit
TPCDS Query 4
Minimize Data Scans (Delta Cache)
CACHE SELECT column_name[, column_name, ...] FROM [db_name.]table_name [ WHERE boolean_expression ]
33#UnifiedAnalytics #SparkAISummit
Why So Fast?
HOW TO USE
AWS - i3s – On By Default
AZURE – Ls-series – On By Default
spark.databricks.io.cache.enabled true
spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.compression.enabled false
Hot Data Auto Cached
Super Fast
Relieves Memory Pressure
Join Optimization
• SortMergeJoins (Standard)
• Broadcast Joins (Fastest)
• Skew Joins
• Range Joins
• BroadcastedNestedLoop Joins (BNLJ)
34#UnifiedAnalytics #SparkAISummit
Join Optimization
• SortMerge Join – Both sides are lrage
• Broadcast Joins – One side is small
– Automatic If:
(one side < spark.sql.autoBroadcastJoinThreshold) (default 10m)
– Risks
• Not Enough Driver Memory
• DF > spark.driver.maxResultSize
• DF > Single Executor Available Working Memory
– Prod – Mitigate The Risks
• Validation Functions
35#UnifiedAnalytics #SparkAISummit
Persistence Vs. Broadcast
36#UnifiedAnalytics #SparkAISummit
Attempt to send compute to the data
Data availability guaranteed ->
each executor has entire dataset
12gb
12gb
37#UnifiedAnalytics #SparkAISummit
126MB
270MB
38#UnifiedAnalytics #SparkAISummit
From 6h and barely started
TO 8m à Lazy Loading
TO 2.5m à Broadcast
SQL Example:
SELECT /*+ BROADCAST(customers) */ * FROM customers, orders WHERE o_custId = c_custId
Skew Join Optimization
• OSS Fix - Salting
– Add Column to each side with random
int between 0 and
spark.sql.shuffle.partitions – 1 to both
sides
– Add join clause to include join on
generated column above
– Drop temp columns from result
• Databricks Fix (Skew Join)
val skewedKeys = List(”id1”, “id200”, ”id-99”)
df.join(
skewDF.hint(“SKEW”, “skewKey”, skewedKeys),
Seq(keyCol), “inner”)
39#UnifiedAnalytics #SparkAISummit
Skewed Aggregates
40#UnifiedAnalytics #SparkAISummit
df.groupBy(“city”, “state”).agg(<f(x)>).orderBy(col.desc)
val saltVal = random(0, spark.conf.get(org...shuffle.partitions) - 1)
df.withColumn(“salt”, lit(saltVal))
.groupBy(“city”, “state”, “salt”)
.agg(<f(x)>)
.drop(“salt”)
.orderBy(col.desc)
BroadcastNestedLoopJoin (BNLJ)
41#UnifiedAnalytics #SparkAISummit
42#UnifiedAnalytics #SparkAISummit
Range Join Optimization
• Range Joins Types
– Point In Interval Range Join
• Predicate specifies value in one relation that is between two values from the other relation
– Interval Overlap Range Join
• Predicate specifies an overlap of intervals between two values from each relation
REFERENCE
43#UnifiedAnalytics #SparkAISummit
Omit Expensive Ops
• Repartition
– Use Coalesce or Shuffle Partition Count
• Count – Do you really need it?
• DistinctCount
– use approxCountDistinct()
• If distincts are required, put them in the right place
– Use dropDuplicates
– dropDuplicates BEFORE the join
– dropDuplicates BEFORE the groupBy
44#UnifiedAnalytics #SparkAISummit
UDF Penalties
• Traditional UDFs cannot use Tungsten
– Use org.apache.spark.sql.functions
– Use PandasUDFs
• Utilizes Apache Arrow
– Use SparkR UDFs
45#UnifiedAnalytics #SparkAISummit
Advanced Parallelism
• Spark’s Three Levels of Parallelism
– Driver Parallelism
– Horizontal Parallelism
– Executor Parallelism
46#UnifiedAnalytics #SparkAISummit
47#UnifiedAnalytics #SparkAISummit
Summary
• Utilize Lazy Loading (Data Skipping)
• Maximize Your Hardware
• Right Size Spark Partitions
• Balance
• Optimized Joins
• Minimize Data Movement
• Minimize Repetition
• Only Use Vectorized UDFs
48#UnifiedAnalytics #SparkAISummit
QUESTIONS
49#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkDatabricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInDatabricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 

What's hot (20)

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedIn
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 

Similar to Optimize Spark Core Performance

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudRose Toomey
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Databricks
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptxDori Waldman
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Chris Fregly
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupGwen (Chen) Shapira
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaSpark Summit
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareDatabricks
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...Kristofferson A
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 

Similar to Optimize Spark Core Performance (20)

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Spark on Yarn @ Netflix
Spark on Yarn @ NetflixSpark on Yarn @ Netflix
Spark on Yarn @ Netflix
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
Spark cep
Spark cepSpark cep
Spark cep
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
 
Quick Wins
Quick WinsQuick Wins
Quick Wins
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 

Recently uploaded (20)

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 

Optimize Spark Core Performance

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Daniel Tomes, Databricks Spark Core – Proper Optimization #UnifiedAnalytics #SparkAISummit
  • 3. Me • Norman, OK – Undergrad OU – SOONER – Masters – OK State • ConocoPhillips • Raleigh, NC • Cloudera • Databricks 3#UnifiedAnalytics #SparkAISummit /in/tomes
  • 4. Talking Points • Spark Hierarchy • The Spark UI • Rightsizing & Optimizing • Advanced Optimizations 4#UnifiedAnalytics #SparkAISummit
  • 6. Spark Hierarchy 6#UnifiedAnalytics #SparkAISummit • Actions are eager – Made of transformations (lazy) • narrow • wide (requires shuffle) – Spawn jobs • Spawn Stages – Spawn Tasks » Do work & utilize hardware
  • 7. Navigating The Spark UI DEMO 7#UnifiedAnalytics #SparkAISummit
  • 8. Understand Your Hardware • Core Count & Speed • Memory Per Core (Working & Storage) • Local Disk Type, Count, Size, & Speed • Network Speed & Topology • Data Lake Properties (rate limits) • Cost / Core / Hour – Financial For Cloud – Opportunity for Shared & On Prem 8#UnifiedAnalytics #SparkAISummit
  • 9. Get A Baseline • Is your action efficient? – Long Stages, Spills, Laggard Tasks, etc? • CPU Utilization – GANGLIA / YARN / Etc – Tails 9#UnifiedAnalytics #SparkAISummit Goal
  • 10. Minimize Data Scans (Lazy Load) • Data Skipping – HIVE Partitions – Bucketing • Only Experts – Nearly Impossible to Maintain – Databricks Delta Z-Ordering • What is It • How To Do It 10#UnifiedAnalytics #SparkAISummit
  • 12. 12#UnifiedAnalytics #SparkAISummit Simple Extra Shuffle Partitions No Lazy Loading With Lazy Loading
  • 13. 13#UnifiedAnalytics #SparkAISummit Without Partition Filter With Partition Filter Shrink Partition Range Using a Filter on HIVE Partitioned Column
  • 14. Partitions – Definition Each of a number of portions into which some operating systems divide memory or storage 14#UnifiedAnalytics #SparkAISummit HIVE PARTITION == SPARK PARTITION
  • 15. Spark Partitions – Types • Input – Controls - Size • spark.default.parallelism (don’t use) • spark.sql.files.maxPartitionBytes (mutable) – assuming source has sufficient partitions • Shuffle – Control = Count • spark.sql.shuffle.partitions • Output – Control = Size • Coalesce(n) to shrink • Repartition(n) to increase and/or balance (shuffle) • df.write.option(“maxRecordsPerFile”, N) 15#UnifiedAnalytics #SparkAISummit
  • 16. Partitions – Shuffle – Default Default = 200 Shuffle Partitions 16#UnifiedAnalytics #SparkAISummit
  • 17. Partitions – Right Sizing – Shuffle – Master Equation • Largest Shuffle Stage – Target Size <= 200 MB/partition • Partition Count = Stage Input Data / Target Size – Solve for Partition Count EXAMPLE Shuffle Stage Input = 210GB x = 210000MB / 200MB = 1050 spark.conf.set(“spark.sql.shuffle.partitions”, 1050) BUT -> If cluster has 2000 cores spark.conf.set(“spark.sql.shuffle.partitions”, 2000) 17#UnifiedAnalytics #SparkAISummit
  • 18. 18#UnifiedAnalytics #SparkAISummit Stage 21 -> Shuffle Fed By Stage 19 & 20 THUS Stage 21 Shuffle Input = 45.4g + 8.6g == 54g Default Shuffle Partition == 200 == 54000mb/200parts =~ 270mb/shuffle part Cluster Spec 96 cores @ 7.625g/core 3.8125g Working Mem 3.8125g Storage Mem Spills
  • 19. 19#UnifiedAnalytics #SparkAISummit Cluster Spec 96 cores @ 7.625g/core 3.8125g Working Mem 3.8125g Storage Mem 480 shuffle partitions – WHY? Target shuffle part size == 100m p = 54g / 100m == 540 540p / 96 cores == 5.625 96 * 5 == 480 If p == 540 another 60p have to be loaded and processed after first cycle is complete NO SPILL
  • 20. Input Partitions – Right Sizing • Use Spark Defaults (128MB) unless… – Increase Parallelism – Heavily Nested/Repetitive Data – Generating Data – i.e. Explode – Source Structure is not optimal (upstream) – UDFs 20#UnifiedAnalytics #SparkAISummit spark.conf.set("spark.sql.files.maxPartitionBytes", 16777216)
  • 23. Output Partitions – Right Sizing • Write Once -> Read Many – More Time to Write but Faster to Read • Perfect writes limit parallelism – Compactions (minor & major) Write Data Size = 14.7GB Desired File Size = 1500MB Max write stage parallelism = 10 96 – 10 == 86 cores idle during write 23#UnifiedAnalytics #SparkAISummit
  • 24. 24#UnifiedAnalytics #SparkAISummit Only 10 Cores Used Average File Size == 1.5g All 96 Cores Used Average File Size == 0.16g
  • 25. Output Partitions – Composition 25#UnifiedAnalytics #SparkAISummit • df.write.option("maxRecordsPerFile", n) • df.coalesce(n).write… • df.repartition(n).write… • df.repartition(n, [colA, …]).write… • spark.sql.shuffle.partitions(n) • df.localCheckpoint(…).repartition(n).write… • df.localCheckpoint(…).coalesce(n).write…
  • 26. Partitions – Why So Serious? • Avoid The Spill • Maximize Parallelism – Utilize All Cores – Provision only the cores you need 26#UnifiedAnalytics #SparkAISummit
  • 27. Advanced Optimizations • Finding Imbalances • Persisting • Join Optimizations • Handling Skew • Expensive Operations • UDFs • Multi-Dimensional Parallelism 27#UnifiedAnalytics #SparkAISummit
  • 28. Balance • Maximizing Resources Requires Balance – Task Duration – Partition Size • SKEW – When some partitions are significantly larger than most 28#UnifiedAnalytics #SparkAISummit Input Partitions Shuffle Partitions Output Files Spills GC Times Straggling Tasks
  • 29. 29#UnifiedAnalytics #SparkAISummit 75th percentile ~ 2m recs max ~ 45m recs stragglers take > 22X longer IF no spillage With spillage, 100Xs longer
  • 30. Minimize Data Scans (Persistence) • Persistence – Not Free • Repetition – SQL Plan 30#UnifiedAnalytics #SparkAISummit df.cache == df.persist(StorageLevel.MEMORY_AND_DISK) • Types – Default (MEMORY_AND_DISK) • Deserialized – Deserialized = Faster = Bigger – Serialized = Slower = Smaller – _2 = Safety = 2X bigger – MEMORY_ONLY – DISK_ONLY Don’t Forget To Cleanup! df.unpersist
  • 33. Minimize Data Scans (Delta Cache) CACHE SELECT column_name[, column_name, ...] FROM [db_name.]table_name [ WHERE boolean_expression ] 33#UnifiedAnalytics #SparkAISummit Why So Fast? HOW TO USE AWS - i3s – On By Default AZURE – Ls-series – On By Default spark.databricks.io.cache.enabled true spark.databricks.io.cache.maxDiskUsage 50g spark.databricks.io.cache.maxMetaDataCache 1g spark.databricks.io.cache.compression.enabled false Hot Data Auto Cached Super Fast Relieves Memory Pressure
  • 34. Join Optimization • SortMergeJoins (Standard) • Broadcast Joins (Fastest) • Skew Joins • Range Joins • BroadcastedNestedLoop Joins (BNLJ) 34#UnifiedAnalytics #SparkAISummit
  • 35. Join Optimization • SortMerge Join – Both sides are lrage • Broadcast Joins – One side is small – Automatic If: (one side < spark.sql.autoBroadcastJoinThreshold) (default 10m) – Risks • Not Enough Driver Memory • DF > spark.driver.maxResultSize • DF > Single Executor Available Working Memory – Prod – Mitigate The Risks • Validation Functions 35#UnifiedAnalytics #SparkAISummit
  • 36. Persistence Vs. Broadcast 36#UnifiedAnalytics #SparkAISummit Attempt to send compute to the data Data availability guaranteed -> each executor has entire dataset 12gb 12gb
  • 38. 38#UnifiedAnalytics #SparkAISummit From 6h and barely started TO 8m à Lazy Loading TO 2.5m à Broadcast SQL Example: SELECT /*+ BROADCAST(customers) */ * FROM customers, orders WHERE o_custId = c_custId
  • 39. Skew Join Optimization • OSS Fix - Salting – Add Column to each side with random int between 0 and spark.sql.shuffle.partitions – 1 to both sides – Add join clause to include join on generated column above – Drop temp columns from result • Databricks Fix (Skew Join) val skewedKeys = List(”id1”, “id200”, ”id-99”) df.join( skewDF.hint(“SKEW”, “skewKey”, skewedKeys), Seq(keyCol), “inner”) 39#UnifiedAnalytics #SparkAISummit
  • 40. Skewed Aggregates 40#UnifiedAnalytics #SparkAISummit df.groupBy(“city”, “state”).agg(<f(x)>).orderBy(col.desc) val saltVal = random(0, spark.conf.get(org...shuffle.partitions) - 1) df.withColumn(“salt”, lit(saltVal)) .groupBy(“city”, “state”, “salt”) .agg(<f(x)>) .drop(“salt”) .orderBy(col.desc)
  • 43. Range Join Optimization • Range Joins Types – Point In Interval Range Join • Predicate specifies value in one relation that is between two values from the other relation – Interval Overlap Range Join • Predicate specifies an overlap of intervals between two values from each relation REFERENCE 43#UnifiedAnalytics #SparkAISummit
  • 44. Omit Expensive Ops • Repartition – Use Coalesce or Shuffle Partition Count • Count – Do you really need it? • DistinctCount – use approxCountDistinct() • If distincts are required, put them in the right place – Use dropDuplicates – dropDuplicates BEFORE the join – dropDuplicates BEFORE the groupBy 44#UnifiedAnalytics #SparkAISummit
  • 45. UDF Penalties • Traditional UDFs cannot use Tungsten – Use org.apache.spark.sql.functions – Use PandasUDFs • Utilizes Apache Arrow – Use SparkR UDFs 45#UnifiedAnalytics #SparkAISummit
  • 46. Advanced Parallelism • Spark’s Three Levels of Parallelism – Driver Parallelism – Horizontal Parallelism – Executor Parallelism 46#UnifiedAnalytics #SparkAISummit
  • 48. Summary • Utilize Lazy Loading (Data Skipping) • Maximize Your Hardware • Right Size Spark Partitions • Balance • Optimized Joins • Minimize Data Movement • Minimize Repetition • Only Use Vectorized UDFs 48#UnifiedAnalytics #SparkAISummit
  • 50. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT