SlideShare a Scribd company logo
1 of 84
Download to read offline
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Will Manning + Matt Cheah, Palantir Technologies
Reliable Performance at Scale
with Spark on Kubernetes
#UnifiedDataAnalytics #SparkAISummit
Joined Palantir
First production adopter of
YARN & Parquet
Helped form our Data Flows
& Analytics product groups
Responsible for Engineering
and Architecture for Compute
(including Spark)
3
Will Manning
About us
2013
2015
2016
Joined Palantir
Migrated Spark cluster management
from standalone to YARN, and later
from YARN to Kubernetes
Spark committer / open source
Spark developer
4
Matt Cheah
About us
2014
2018
Agenda
1. A (Very) Quick Primer on Palantir
2. Why We Moved from YARN
3. Spark on Kubernetes
4. Key Production Challenges
Ø Kubernetes Scheduling
Ø Shuffle Resiliency
5
A (Very) Quick Primer on Palantir
… and how Spark on Kubernetes helps power Palantir Foundry
Who are we?
7
Headquartered
Presence
Employees
Palo Alto, CA
Global
~2500 / Mostly Engineers
Founded 2004
Software Data Integration
8
Supporting Counterterrorism
From intelligence operations
to mission planning in the field.
9
Energy
Institutions must evolve or die.
Technology and data are driving
this evolution.
10
Manufacturing
Ferrari uses Palantir Foundry
to increase performance + reliability.
11
Aviation Safety
Palantir and Airbus founded
Skywise to help make air travel
safer and more economical.
12
Cancer Research
Syntropy brings together the greatest
minds and institutions to advance
research toward the common goal
of improving human lives.
13
Products Built
for a Purpose
Integrate, manage, secure,
and analyze all of your
enterprise data.
Amplify and extend the power
of data integration.
Enabling analytics in Palantir Foundry
Executing untrusted code on behalf of trusted users in a multitenant environment
14
– Users can author code (e.g., using Spark SQL
or pySpark) to define complex data
transformations or to perform analysis
– Our users want to write code once and have it
keep working the same way indefinitely
Enabling analytics in Palantir Foundry
Executing untrusted code on behalf of trusted users in a multitenant environment
15
– Users can author code (e.g., using Spark SQL
or pySpark) to define complex data
transformations or to perform analysis
– Our users want to write code once and have it
keep working the same way indefinitely
– Foundry is responsible for executing arbitrary
code on users’ behalf securely
– Even though the customer might trust the
user, Palantir infrastructure can’t
Enabling collaboration across organizations
Using Spark on Kubernetes to enable multitenant compute
16
Engineers from Airline A
Airbus Employees
Engineers from Airline B
Enabling analytics in Palantir Foundry
Executing untrusted code on behalf of trusted users in a multitenant environment
17
– Users can author code (e.g., using Spark SQL
or pySpark) to define complex data
transformations or to perform analysis
– Our users want to write code once and have it
keep working the same way indefinitely
– Foundry is responsible for executing arbitrary
code on users’ behalf securely
– Even though the customer might trust the
user, Palantir infrastructure can’t
Repeatable Performance Security
Why We Moved from YARN
Hadoop/YARN Security
Two modes
1. Kerberos
19
Hadoop/YARN Security
Two modes
1. Kerberos
2. None (only mode until Hadoop 2.x)
NB: I recommend reading Steve Loughran’s “Madness Beyond the Gate” to learn more
20
Hadoop/YARN Security
Containerization as of 3.1.1 (late 2018)
21
Performance in YARN
22
YARN’s scheduler attempts to maximize utilization
Performance in YARN
23
YARN’s scheduler attempts to maximize utilization
Spark on YARN with dynamic allocation is great at:
– extracting maximum performance from static resources
– providing bursts of resources for one-off batch work
– running “just one more thing”
Performance in YARN
24
YARN’s scheduler attempts to maximize utilization
Spark on YARN with dynamic allocation is great at:
– extracting maximum performance from static resources
– providing bursts of resources for one-off batch work
– running “just one more thing”
YARN: Clown Car Scheduling
25
(Image Credit: 20th Century Fox Television)
Performance in YARN
26
YARN’s scheduler attempts to maximize utilization
Spark on YARN with dynamic allocation is terrible at:
– providing consistency from run to run
– isolating performance between different users/tenants
(i.e., if you kick off a big job, then my job is likely to run slower)
So… Kubernetes?
27
ü Native containerization
ü Extreme extensibility (e.g., scheduler, networking/firewalls)
ü Active community with a fast-moving code base
ü Single platform for microservices and compute*
*Spoiler alert: the Kubernetes scheduler is excellent
for web services, not optimized for batch
Spark on Kubernetes
Timeline
29
Sep ‘16 Mar ‘17 Jan ‘18
Nov ‘16 Oct ‘17 Jun ‘18
Begin initial prototype
of Spark on K8s
Minimal integration complete,
begin experimental deployment
Begin first migration
from YARN toK8s
Establish working group
with community
First PR merged into
upstream master
Complete first migration
from YARN to K8s
in production
Spark on K8s Architecture
30
Client runs spark-submit with arguments
spark-submit converts arguments into a PodSpec for the driver
Spark on K8s Architecture
31
Client runs spark-submit with arguments
spark-submit converts arguments into a PodSpec for the driver
K8s-specific implementation of SchedulerBackend interface requests
executor pods in batches
Spark on K8s Architecture
32
CREATE
DRIVER POD
spark-submit input:
…
--master k8s://example.com:8443
--conf spark.kubernetes.image=example.com/appImage
--conf …
com.palantir.spark.app.main.Main
spark-submit output:
…
spec:
containers:
- name: example
image: example.com/appImage
command: [”/opt/spark/entrypoint.sh"]
args: [”--driver", ”--class”, “com.palantir.spark.app.main.Main”]
…
spark-submit
Spark on K8s Architecture
33
Driver Pod
CREATE: 2
EXECUTOR PODS
spark.executor.instances = 4, spark.kubernetes.allocation.batch.size = 2
Spark on K8s Architecture
34
spark.executor.instances = 4, spark.kubernetes.allocation.batch.size = 2
Executor Pod #1
Executor Pod #2
Driver Pod
REGISTER
REGISTER
CREATE: 2
EXECUTOR PODS
Spark on K8s Architecture
35
Executor Pod #1
Executor Pod #2
Driver Pod
REGISTER
REGISTER
CREATE: 2
EXECUTOR PODS
CREATE: 2 MORE
EXECUTOR PODS
spark.executor.instances = 4, spark.kubernetes.allocation.batch.size = 2
Spark on K8s Architecture
36
Executor Pod #1
Executor Pod #2
Executor Pod #3
Executor Pod #4
Driver Pod
REGISTER
REGISTER
REGISTER
REGISTER
CREATE: 2
EXECUTOR PODS
CREATE: 2 MORE
EXECUTOR PODS
spark.executor.instances = 4, spark.kubernetes.allocation.batch.size = 2
Spark on K8s Architecture
37
Executor Pod #1
Executor Pod #2
Executor Pod #3
Executor Pod #4
Driver Pod
RUN TASKS
REGISTER
REGISTER
REGISTER
REGISTER
RUN TASKS
RUN TASKS
RUN TASKS
CREATE: 2
EXECUTOR PODS
CREATE: 2 MORE
EXECUTOR PODS
spark.executor.instances = 4, spark.kubernetes.allocation.batch.size = 2
Early Challenge: Disk Performance
38
Executor (Heap = E) Executor (Heap = E)
OS File System Cache
(Max Size = M – 2E)
YARN Node Manager (Memory = M)
File System
Early Challenge: Disk Performance
39
Executor (Heap = E) Executor (Heap = E)
OS File System Cache
(Max Size = M – 2E)
YARN Node Manager (Memory = M)
OS File System Cache
(Max Size = C – E = ~0)
Executor Pod
(Memory = C)
Executor
(Heap = E = ~C)
OS File System Cache
(Max Size = C – E = ~0)
Executor Pod
(Memory = C)
Executor
(Heap = E = ~C)
File System File System File System
Early Challenge: Disk Performance
OS filesystem cache is now container-local
– i.e., dramatically smaller
– disks must be fast without hitting FS cache
– Solution: use NVMe drives for temp storage
Docker disk interface is slower than direct disk access
– Solution: Use EmptyDir volumes for temp storage
40
Key Production Challenges
Challenge I: Kubernetes Scheduling
42
– The built-in Kubernetes scheduler isn’t really designed
for distributed batch workloads (e.g., MapReduce)
– Historically, optimized for microservice instances or
single-pod, one-off jobs (what k8s natively supports)
Challenge II: Shuffle Resiliency
43
– External Shuffle Service is unavailable in Kubernetes
– Jobs must be written more carefully to avoid executor
failure (e.g., OOM) and the subsequent need to
recompute lost blocks
Kubernetes Scheduling
Reliable, Repeatable Runtimes
45
A key goal is to make runtimes of the same workload
consistent from run to run
Recall
Reliable, Repeatable Runtimes
46
A key goal is to make runtimes of the same workload
consistent from run to run
When a driver is deployed, wants to do work, it should
receive the same resources from run to run
Recall
Corollary
Reliable, Repeatable Runtimes
47
A key goal is to make runtimes of the same workload
consistent from run to run
When a driver is deployed, wants to do work, it should
receive the same resources from run to run
Using vanilla k8s scheduler led to partial starvation
as the cluster became saturated
Recall
Corollary
Problem
Kubernetes Scheduling
48
Scheduling Queue
P2
Running Pods
Rd 1 P1 P3
Kubernetes Scheduling
49
Scheduling Queue
P2
Running Pods
P1
Rd 1 P1 P3
P3 P2
Kubernetes Scheduling
50
Scheduling Queue
P2
P2
Running Pods
P1
P1
Rd 1
Rd 2
P1 P3
P3
P3
P2
Back to the
end of the line!
Kubernetes Scheduling
51
Scheduling Queue
P2
P2
Running Pods
P1
P1
P3
Rd 1
Rd 2
P1
P1 P3
P3
P3
P2
P2
(Still waiting…)
Naïve Spark-on-K8s Scheduling
52
Scheduling Queue Running Pods
Rd 1 D
Naïve Spark-on-K8s Scheduling
53
Scheduling Queue Running Pods
D
Rd 1 D
Naïve Spark-on-K8s Scheduling
54
Scheduling Queue Running Pods
D
D
Rd 1
Rd 2
D
E2E1
Naïve Spark-on-K8s Scheduling
55
Scheduling Queue Running Pods
D
D
Rd 1
Rd 2
D
D
E2E1
E2E1
The “1000 Drivers” Problem
56
Scheduling Queue Running Pods
Rd 1 D1 D2
…D1 D2
… D1000
The “1000 Drivers” Problem
57
Scheduling Queue Running Pods
D1
Rd 1 D1 D2
…D1 D2
… D1000
D2 D1000
…
The “1000 Drivers” Problem
58
Scheduling Queue Running Pods
D1
Rd 1
Rd 2
D1
E2E1
D2
…D1 D2
… D1000
D2 D1000
…
… EN D2 D1000
…D1
The “1000 Drivers” Problem
59
Scheduling Queue Running Pods
D1
Rd 1
Rd 2
D1
E2E1
D2
…D1 D2
… D1000
D2 D1000
…
… EN D2 D1000
…
E3E2
… EN D2 D1000
…D1
D1
(Uh oh!)
E1
60
ü Native containerization
ü Extreme extensibility (e.g., scheduler, networking/firewalls)
ü Active community with a fast-moving code base
ü Single platform for microservices and compute*
*Spoiler alert: the Kubernetes scheduler is excellent
for web services, not optimized for batch
So… Kubernetes?
K8s Spark Scheduler
Idea: use the Kubernetes scheduler extender API to add
– Gang scheduling of drivers & executors
– FIFO (within instance groups)
K8s Spark Scheduler
Goal: build entirely with native K8s extension points
– Scheduler extender API: fn(pod, [node]) -> Option[node]
– Custom resource definition: ResourceReservation
– Driver annotations for resource requests (executors)
K8s Spark Scheduler
get cluster usage (running pods + reservations)
bin pack pending resource requests in FIFO order
reserve resources (with CRD)
if driver
find unbound reservation
bind reservation
if executor
K8s Spark Scheduler
64
Scheduling Queue Running Pods
D1
Rd 1
Rd 2+
D1
E2E1
D2
…D1 D2
… D1000
…
D1
D1
R2R1D2
… D1000
D2
… D1000
…R2R1
E2E1D2
… D1000
Spark-Aware Autoscaling
Idea: use the resource request annotations for unscheduled drivers to
project desired cluster size
– Again, we use a CRD to represent this unsatisfied Demand
– Sum resources for pods + reservations + demand
– In the interest of time, out of scope for today
Spark pods per day
Pod processing time
“Soft” dynamic allocation
68
Static allocation wastes resourcesProblem
“Soft” dynamic allocation
69
Static allocation wastes resourcesProblem
Idea
Voluntarily give up executors that aren’t
needed, no preemption (same from run to run)
“Soft” dynamic allocation
70
Static allocation wastes resources
No external shuffle, so executors store shuffle files
Problem
Idea
Problem
Voluntarily give up executors that aren’t
needed, no preemption (same from run to run)
“Soft” dynamic allocation
71
Static allocation wastes resources
No external shuffle, so executors store shuffle files
Problem
Idea
Problem
The driver already tracks shuffle file locations, so it
can determine which executors are safe to give up
Idea
Voluntarily give up executors that aren’t
needed, no preemption (same from run to run)
“Soft” dynamic allocation
72
– Saves $$$$, ~no runtime variance if consistent from run to run
– Inspired by a 2018 Databricks blog post [1]
– Merged into our fork in ~January 2019
– Recently adapted by @vanzin (thanks!) and merged upstream [2]
[1] https://databricks.com/blog/2018/05/02/introducing-databricks-optimized-auto-
scaling.html
[2] https://github.com/apache/spark/pull/24817
K8s Spark Scheduler
See our engineering blog on Medium
– https://medium.com/palantir/spark-scheduling-in-kubernetes-
4976333235f3
Or, check out the source for yourself! (Apache v2 license)
– https://github.com/palantir/k8s-spark-scheduler
Shuffle Resiliency
Shuffle Resiliency
75
Shuffles have a map side and a reduce side
— Mapper executors write temporary data to local disk
— Reducer executors contact mapper executors to retrieve written data
YARN Node Manager
Mapper
Executor
Reducer
Executor
YARN Node Manager
Local Disk Local Disk
Shuffle Resiliency
76
If an executor crashes, all data written by that executor’s map tasks are lost
— Spark must re-schedule the map tasks on other executors
— Cannot remove executors to save resources because we would lose shuffle files
YARN Node Manager
Mapper
Executor
Reducer
Executor
YARN Node Manager
Local Disk Local Disk
Shuffle Resiliency
77
Spark’s external shuffle service preserves shuffle data in cases of executor loss
— Required to make preemption non-pathological
— Prevents loss of work when executors crash
Mapper
Executor
YARN Node Manager
Shuffle
Service
Reducer
Executor
YARN Node Manager
Shuffle
Service
Local Disk Local Disk
Problem: for security, containers have isolated storage
Shuffle Resiliency
78
YARN Node Manager
Shuffle
Service
Mapper
Executor
Executor Pod
Local Disk Local Disk
Mapper
Executor
Shuffle Service Pod
YARN Kubernetes
Shuffle Resiliency
79
Idea: asynchronously back up shuffle files to a distributed storage system
Backup
Thread
Executor Pod
Local Disk
Map Task
Thread
Shuffle Resiliency
80
1. Reducers first try to fetch from other executors
2. Download from remote storage if mapper is unreachable
Mapper Executor
Pod
Mapper Executor
Pod
Live Mapper
Dead Mapper
Reducer Executor
Pod
Reducer Executor
Pod
Shuffle Resiliency
– Wanted to generalize the framework for storing shuffle data in arbitrary
storage systems
– API In Progress: https://issues.apache.org/jira/browse/SPARK-25299
– Goal: Open source asynchronous backup strategy by end of 2019
Thanks to the team!
Q&A
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot

Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesDatabricks
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoDatabricks
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveYingjun Wu
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks
 

What's hot (20)

Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with Cosco
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWave
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta Lake
 

Similar to Reliable Performance at Scale with Apache Spark on Kubernetes

SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesDatabricks
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...Spark Summit
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftLi Gao
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesYousun Jeong
 
Scaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftScaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftDatabricks
 
Running Apache Spark Jobs Using Kubernetes
Running Apache Spark Jobs Using KubernetesRunning Apache Spark Jobs Using Kubernetes
Running Apache Spark Jobs Using KubernetesDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyondXiao Li
 
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes Elasticsearch
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Etu Solution
 
Kubernetes Java Operator
Kubernetes Java OperatorKubernetes Java Operator
Kubernetes Java OperatorAnthony Dahanne
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupHolden Karau
 
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful ServingDatabricks
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks
 
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on KubernetesAthens Big Data
 
Kubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesKubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesWeaveworks
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
 

Similar to Reliable Performance at Scale with Apache Spark on Kubernetes (20)

SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on Kubernetes
 
Scaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftScaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at Lyft
 
Running Apache Spark Jobs Using Kubernetes
Running Apache Spark Jobs Using KubernetesRunning Apache Spark Jobs Using Kubernetes
Running Apache Spark Jobs Using Kubernetes
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyond
 
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
Kubernetes Java Operator
Kubernetes Java OperatorKubernetes Java Operator
Kubernetes Java Operator
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
 
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
 
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
 
Kubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesKubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slides
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 

Recently uploaded (20)

wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 

Reliable Performance at Scale with Apache Spark on Kubernetes

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Will Manning + Matt Cheah, Palantir Technologies Reliable Performance at Scale with Spark on Kubernetes #UnifiedDataAnalytics #SparkAISummit
  • 3. Joined Palantir First production adopter of YARN & Parquet Helped form our Data Flows & Analytics product groups Responsible for Engineering and Architecture for Compute (including Spark) 3 Will Manning About us 2013 2015 2016
  • 4. Joined Palantir Migrated Spark cluster management from standalone to YARN, and later from YARN to Kubernetes Spark committer / open source Spark developer 4 Matt Cheah About us 2014 2018
  • 5. Agenda 1. A (Very) Quick Primer on Palantir 2. Why We Moved from YARN 3. Spark on Kubernetes 4. Key Production Challenges Ø Kubernetes Scheduling Ø Shuffle Resiliency 5
  • 6. A (Very) Quick Primer on Palantir … and how Spark on Kubernetes helps power Palantir Foundry
  • 7. Who are we? 7 Headquartered Presence Employees Palo Alto, CA Global ~2500 / Mostly Engineers Founded 2004 Software Data Integration
  • 8. 8 Supporting Counterterrorism From intelligence operations to mission planning in the field.
  • 9. 9 Energy Institutions must evolve or die. Technology and data are driving this evolution.
  • 10. 10 Manufacturing Ferrari uses Palantir Foundry to increase performance + reliability.
  • 11. 11 Aviation Safety Palantir and Airbus founded Skywise to help make air travel safer and more economical.
  • 12. 12 Cancer Research Syntropy brings together the greatest minds and institutions to advance research toward the common goal of improving human lives.
  • 13. 13 Products Built for a Purpose Integrate, manage, secure, and analyze all of your enterprise data. Amplify and extend the power of data integration.
  • 14. Enabling analytics in Palantir Foundry Executing untrusted code on behalf of trusted users in a multitenant environment 14 – Users can author code (e.g., using Spark SQL or pySpark) to define complex data transformations or to perform analysis – Our users want to write code once and have it keep working the same way indefinitely
  • 15. Enabling analytics in Palantir Foundry Executing untrusted code on behalf of trusted users in a multitenant environment 15 – Users can author code (e.g., using Spark SQL or pySpark) to define complex data transformations or to perform analysis – Our users want to write code once and have it keep working the same way indefinitely – Foundry is responsible for executing arbitrary code on users’ behalf securely – Even though the customer might trust the user, Palantir infrastructure can’t
  • 16. Enabling collaboration across organizations Using Spark on Kubernetes to enable multitenant compute 16 Engineers from Airline A Airbus Employees Engineers from Airline B
  • 17. Enabling analytics in Palantir Foundry Executing untrusted code on behalf of trusted users in a multitenant environment 17 – Users can author code (e.g., using Spark SQL or pySpark) to define complex data transformations or to perform analysis – Our users want to write code once and have it keep working the same way indefinitely – Foundry is responsible for executing arbitrary code on users’ behalf securely – Even though the customer might trust the user, Palantir infrastructure can’t Repeatable Performance Security
  • 18. Why We Moved from YARN
  • 20. Hadoop/YARN Security Two modes 1. Kerberos 2. None (only mode until Hadoop 2.x) NB: I recommend reading Steve Loughran’s “Madness Beyond the Gate” to learn more 20
  • 21. Hadoop/YARN Security Containerization as of 3.1.1 (late 2018) 21
  • 22. Performance in YARN 22 YARN’s scheduler attempts to maximize utilization
  • 23. Performance in YARN 23 YARN’s scheduler attempts to maximize utilization Spark on YARN with dynamic allocation is great at: – extracting maximum performance from static resources – providing bursts of resources for one-off batch work – running “just one more thing”
  • 24. Performance in YARN 24 YARN’s scheduler attempts to maximize utilization Spark on YARN with dynamic allocation is great at: – extracting maximum performance from static resources – providing bursts of resources for one-off batch work – running “just one more thing”
  • 25. YARN: Clown Car Scheduling 25 (Image Credit: 20th Century Fox Television)
  • 26. Performance in YARN 26 YARN’s scheduler attempts to maximize utilization Spark on YARN with dynamic allocation is terrible at: – providing consistency from run to run – isolating performance between different users/tenants (i.e., if you kick off a big job, then my job is likely to run slower)
  • 27. So… Kubernetes? 27 ü Native containerization ü Extreme extensibility (e.g., scheduler, networking/firewalls) ü Active community with a fast-moving code base ü Single platform for microservices and compute* *Spoiler alert: the Kubernetes scheduler is excellent for web services, not optimized for batch
  • 29. Timeline 29 Sep ‘16 Mar ‘17 Jan ‘18 Nov ‘16 Oct ‘17 Jun ‘18 Begin initial prototype of Spark on K8s Minimal integration complete, begin experimental deployment Begin first migration from YARN toK8s Establish working group with community First PR merged into upstream master Complete first migration from YARN to K8s in production
  • 30. Spark on K8s Architecture 30 Client runs spark-submit with arguments spark-submit converts arguments into a PodSpec for the driver
  • 31. Spark on K8s Architecture 31 Client runs spark-submit with arguments spark-submit converts arguments into a PodSpec for the driver K8s-specific implementation of SchedulerBackend interface requests executor pods in batches
  • 32. Spark on K8s Architecture 32 CREATE DRIVER POD spark-submit input: … --master k8s://example.com:8443 --conf spark.kubernetes.image=example.com/appImage --conf … com.palantir.spark.app.main.Main spark-submit output: … spec: containers: - name: example image: example.com/appImage command: [”/opt/spark/entrypoint.sh"] args: [”--driver", ”--class”, “com.palantir.spark.app.main.Main”] … spark-submit
  • 33. Spark on K8s Architecture 33 Driver Pod CREATE: 2 EXECUTOR PODS spark.executor.instances = 4, spark.kubernetes.allocation.batch.size = 2
  • 34. Spark on K8s Architecture 34 spark.executor.instances = 4, spark.kubernetes.allocation.batch.size = 2 Executor Pod #1 Executor Pod #2 Driver Pod REGISTER REGISTER CREATE: 2 EXECUTOR PODS
  • 35. Spark on K8s Architecture 35 Executor Pod #1 Executor Pod #2 Driver Pod REGISTER REGISTER CREATE: 2 EXECUTOR PODS CREATE: 2 MORE EXECUTOR PODS spark.executor.instances = 4, spark.kubernetes.allocation.batch.size = 2
  • 36. Spark on K8s Architecture 36 Executor Pod #1 Executor Pod #2 Executor Pod #3 Executor Pod #4 Driver Pod REGISTER REGISTER REGISTER REGISTER CREATE: 2 EXECUTOR PODS CREATE: 2 MORE EXECUTOR PODS spark.executor.instances = 4, spark.kubernetes.allocation.batch.size = 2
  • 37. Spark on K8s Architecture 37 Executor Pod #1 Executor Pod #2 Executor Pod #3 Executor Pod #4 Driver Pod RUN TASKS REGISTER REGISTER REGISTER REGISTER RUN TASKS RUN TASKS RUN TASKS CREATE: 2 EXECUTOR PODS CREATE: 2 MORE EXECUTOR PODS spark.executor.instances = 4, spark.kubernetes.allocation.batch.size = 2
  • 38. Early Challenge: Disk Performance 38 Executor (Heap = E) Executor (Heap = E) OS File System Cache (Max Size = M – 2E) YARN Node Manager (Memory = M) File System
  • 39. Early Challenge: Disk Performance 39 Executor (Heap = E) Executor (Heap = E) OS File System Cache (Max Size = M – 2E) YARN Node Manager (Memory = M) OS File System Cache (Max Size = C – E = ~0) Executor Pod (Memory = C) Executor (Heap = E = ~C) OS File System Cache (Max Size = C – E = ~0) Executor Pod (Memory = C) Executor (Heap = E = ~C) File System File System File System
  • 40. Early Challenge: Disk Performance OS filesystem cache is now container-local – i.e., dramatically smaller – disks must be fast without hitting FS cache – Solution: use NVMe drives for temp storage Docker disk interface is slower than direct disk access – Solution: Use EmptyDir volumes for temp storage 40
  • 42. Challenge I: Kubernetes Scheduling 42 – The built-in Kubernetes scheduler isn’t really designed for distributed batch workloads (e.g., MapReduce) – Historically, optimized for microservice instances or single-pod, one-off jobs (what k8s natively supports)
  • 43. Challenge II: Shuffle Resiliency 43 – External Shuffle Service is unavailable in Kubernetes – Jobs must be written more carefully to avoid executor failure (e.g., OOM) and the subsequent need to recompute lost blocks
  • 45. Reliable, Repeatable Runtimes 45 A key goal is to make runtimes of the same workload consistent from run to run Recall
  • 46. Reliable, Repeatable Runtimes 46 A key goal is to make runtimes of the same workload consistent from run to run When a driver is deployed, wants to do work, it should receive the same resources from run to run Recall Corollary
  • 47. Reliable, Repeatable Runtimes 47 A key goal is to make runtimes of the same workload consistent from run to run When a driver is deployed, wants to do work, it should receive the same resources from run to run Using vanilla k8s scheduler led to partial starvation as the cluster became saturated Recall Corollary Problem
  • 50. Kubernetes Scheduling 50 Scheduling Queue P2 P2 Running Pods P1 P1 Rd 1 Rd 2 P1 P3 P3 P3 P2 Back to the end of the line!
  • 51. Kubernetes Scheduling 51 Scheduling Queue P2 P2 Running Pods P1 P1 P3 Rd 1 Rd 2 P1 P1 P3 P3 P3 P2 P2 (Still waiting…)
  • 52. Naïve Spark-on-K8s Scheduling 52 Scheduling Queue Running Pods Rd 1 D
  • 53. Naïve Spark-on-K8s Scheduling 53 Scheduling Queue Running Pods D Rd 1 D
  • 54. Naïve Spark-on-K8s Scheduling 54 Scheduling Queue Running Pods D D Rd 1 Rd 2 D E2E1
  • 55. Naïve Spark-on-K8s Scheduling 55 Scheduling Queue Running Pods D D Rd 1 Rd 2 D D E2E1 E2E1
  • 56. The “1000 Drivers” Problem 56 Scheduling Queue Running Pods Rd 1 D1 D2 …D1 D2 … D1000
  • 57. The “1000 Drivers” Problem 57 Scheduling Queue Running Pods D1 Rd 1 D1 D2 …D1 D2 … D1000 D2 D1000 …
  • 58. The “1000 Drivers” Problem 58 Scheduling Queue Running Pods D1 Rd 1 Rd 2 D1 E2E1 D2 …D1 D2 … D1000 D2 D1000 … … EN D2 D1000 …D1
  • 59. The “1000 Drivers” Problem 59 Scheduling Queue Running Pods D1 Rd 1 Rd 2 D1 E2E1 D2 …D1 D2 … D1000 D2 D1000 … … EN D2 D1000 … E3E2 … EN D2 D1000 …D1 D1 (Uh oh!) E1
  • 60. 60 ü Native containerization ü Extreme extensibility (e.g., scheduler, networking/firewalls) ü Active community with a fast-moving code base ü Single platform for microservices and compute* *Spoiler alert: the Kubernetes scheduler is excellent for web services, not optimized for batch So… Kubernetes?
  • 61. K8s Spark Scheduler Idea: use the Kubernetes scheduler extender API to add – Gang scheduling of drivers & executors – FIFO (within instance groups)
  • 62. K8s Spark Scheduler Goal: build entirely with native K8s extension points – Scheduler extender API: fn(pod, [node]) -> Option[node] – Custom resource definition: ResourceReservation – Driver annotations for resource requests (executors)
  • 63. K8s Spark Scheduler get cluster usage (running pods + reservations) bin pack pending resource requests in FIFO order reserve resources (with CRD) if driver find unbound reservation bind reservation if executor
  • 64. K8s Spark Scheduler 64 Scheduling Queue Running Pods D1 Rd 1 Rd 2+ D1 E2E1 D2 …D1 D2 … D1000 … D1 D1 R2R1D2 … D1000 D2 … D1000 …R2R1 E2E1D2 … D1000
  • 65. Spark-Aware Autoscaling Idea: use the resource request annotations for unscheduled drivers to project desired cluster size – Again, we use a CRD to represent this unsatisfied Demand – Sum resources for pods + reservations + demand – In the interest of time, out of scope for today
  • 68. “Soft” dynamic allocation 68 Static allocation wastes resourcesProblem
  • 69. “Soft” dynamic allocation 69 Static allocation wastes resourcesProblem Idea Voluntarily give up executors that aren’t needed, no preemption (same from run to run)
  • 70. “Soft” dynamic allocation 70 Static allocation wastes resources No external shuffle, so executors store shuffle files Problem Idea Problem Voluntarily give up executors that aren’t needed, no preemption (same from run to run)
  • 71. “Soft” dynamic allocation 71 Static allocation wastes resources No external shuffle, so executors store shuffle files Problem Idea Problem The driver already tracks shuffle file locations, so it can determine which executors are safe to give up Idea Voluntarily give up executors that aren’t needed, no preemption (same from run to run)
  • 72. “Soft” dynamic allocation 72 – Saves $$$$, ~no runtime variance if consistent from run to run – Inspired by a 2018 Databricks blog post [1] – Merged into our fork in ~January 2019 – Recently adapted by @vanzin (thanks!) and merged upstream [2] [1] https://databricks.com/blog/2018/05/02/introducing-databricks-optimized-auto- scaling.html [2] https://github.com/apache/spark/pull/24817
  • 73. K8s Spark Scheduler See our engineering blog on Medium – https://medium.com/palantir/spark-scheduling-in-kubernetes- 4976333235f3 Or, check out the source for yourself! (Apache v2 license) – https://github.com/palantir/k8s-spark-scheduler
  • 75. Shuffle Resiliency 75 Shuffles have a map side and a reduce side — Mapper executors write temporary data to local disk — Reducer executors contact mapper executors to retrieve written data YARN Node Manager Mapper Executor Reducer Executor YARN Node Manager Local Disk Local Disk
  • 76. Shuffle Resiliency 76 If an executor crashes, all data written by that executor’s map tasks are lost — Spark must re-schedule the map tasks on other executors — Cannot remove executors to save resources because we would lose shuffle files YARN Node Manager Mapper Executor Reducer Executor YARN Node Manager Local Disk Local Disk
  • 77. Shuffle Resiliency 77 Spark’s external shuffle service preserves shuffle data in cases of executor loss — Required to make preemption non-pathological — Prevents loss of work when executors crash Mapper Executor YARN Node Manager Shuffle Service Reducer Executor YARN Node Manager Shuffle Service Local Disk Local Disk
  • 78. Problem: for security, containers have isolated storage Shuffle Resiliency 78 YARN Node Manager Shuffle Service Mapper Executor Executor Pod Local Disk Local Disk Mapper Executor Shuffle Service Pod YARN Kubernetes
  • 79. Shuffle Resiliency 79 Idea: asynchronously back up shuffle files to a distributed storage system Backup Thread Executor Pod Local Disk Map Task Thread
  • 80. Shuffle Resiliency 80 1. Reducers first try to fetch from other executors 2. Download from remote storage if mapper is unreachable Mapper Executor Pod Mapper Executor Pod Live Mapper Dead Mapper Reducer Executor Pod Reducer Executor Pod
  • 81. Shuffle Resiliency – Wanted to generalize the framework for storing shuffle data in arbitrary storage systems – API In Progress: https://issues.apache.org/jira/browse/SPARK-25299 – Goal: Open source asynchronous backup strategy by end of 2019
  • 82. Thanks to the team!
  • 83. Q&A
  • 84. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT