SlideShare a Scribd company logo
1 of 33
About Anya (she/her)
Sr. Member of Technical Staff (SRE)
Salesforce Production Engineering
Salesforce Einstein Platform
Co-organizer SF Big Analytics
Spark Tuning
• Cheat-sheet
• Talks
Previously at Alpine Data, SRI
PhD Mayo Clinic, BS Johns Hopkins
@anyabida1
1700s 1800s 1900s Today
1st Industrial Revolution
Steam
2nd Industrial Revolution
Electricity
3rd Industrial Revolution
Computing
4th Industrial Revolution
Intelligence
Fourth Industrial Revolution
Intelligence is transforming the customer experience
Just Enough DevOps for Data Scientists
Part II: Handling Infra Failures When Running Spark
abida@salesforce.com
@ anyabida1
Anya Bida, SRE at Salesforce
What is DevOps?
Software Development
Network &
SecurityInfrastructure
Build & Release
What is DevOps?
Software Development
Network &
SecurityInfrastructure
Build & Release
What is DevOps?
Software Development
Network &
SecurityInfrastructure
Build & Release
Data Science
Hello Ada!
Spark Primer
Apache Spark
https://spark.apache.org/examples.html
https://spark.apache.org/examples.html
Blue Green Deployments
https://docs.mobingi.com/official/guide/bg-deploy
Blue Machine
(old)
Green Machine
(new)
Users
https://spark.apache.org/examples.html
How to avoid potential HDFS failures
- Use high availability for the namenode
- Plenty of disk space for hdfs
- Plenty of disk space per disk
- Block replication = 3
- Monitor disk I/O, network connectivity
- Correct permissions
https://spark.apache.org/examples.html
Spark Context defines
the application
https://spark.apache.org/examples.html
Spark operations
saveAsTextFile
map
flatMap
.
reduceByKey
textFile
https://spark.apache.org/examples.html
Spark operations
saveAsTextFile
map
flatMap
.
reduceByKey
textFile
Stage Boundaries
https://spark.apache.org/examples.html
Spark operations
saveAsTextFile
map
flatMap
.
reduceByKey
textFile
Stage Boundaries
Wide Transformation defin
a new stage
Anatomy of a Spark Job
High Performance Spark, Karau & Warren, O’Reilly
Spark Context / Spark
Session Object
Actions (eg collect,
saveAsTextFile)
Wide Transformations
(sort, groupByKey)
Computation to
evaluate one partition
(combine narrow transforms)
Spark
Application
Job
Stage Stage
Task Task
https://spark.apache.org/examples.html
Spark operations
saveAsTextFile
map
flatMap
.
reduceByKey
textFile
Stage Boundaries
Where are the tasks?
Tasks run on executors
Apache Spark
Tasks run on executors
Apache Spark
How to avoid common task failures
- Use default retry & exponential backoff settings
- Spark is tolerant to single / multi node failures
- Spark 2.2 is tolerant to single disk failures even on non-raid commodity hardware
- Etc.
- Optimize number of partitions
- Beware data skew & dirty data
- Etc.
- Etc.
https://spark.apache.org/examples.html
Spark operations
.
reduceByKey
Stage Boundaries
The Shuffle
Cache Persist Checkpoint Local
Checkpoint
local mem cache MEM MEM MEM
local disk DISK DISK
HDFS / S3 Specify dir
If exec is
decommed, are
writes available?
No No Yes No
If job finishes are
writes available?
No No Yes No
Preserve lineage
graph?
Yes Yes No No
RDD Re-use
Persist to improve speed, Checkpoint to improve fault tolerance
https://spark.apache.org/examples.html
Spark operations
.
reduceByKey
Stage Boundaries
The Shuffle
- Persist to improve speed
- Checkpoint to improve fault tolerance
https://spark.apache.org/examples.html
Spark operations
.
Stage Boundaries
The Write
saveAsTextFile
https://spark.apache.org/examples.html
Spark operations
.
Stage Boundaries
saveAsTextFile
The Write
- reading and writing != efficient
- Writing a few large files files is more efficient than writing thousands of small
files
https://spark.apache.org/examples.html
Spark operations
.
Stage Boundaries
The Write - S3
- S3 partitions != hdfs partitions
- S3 partitions != spark partitions
- S3 partitioning can slow your write
saveAsTextFile
https://spark.apache.org/examples.html
Spark operations
.
Stage Boundaries
The Write - S3
- S3 partitions != hdfs partitions
- S3 partitions != spark partitions
- S3 partitioning can slow your write
- S3 partitioning depends on the first few characters of the bucket path
- S3://mybucket/hash-myresultfile
saveAsTextFile
https://spark.apache.org/examples.html
FAILURE
FAILURE
FAILURE
Common failures
https://spark.apache.org/examples.html
FAILURE
FAILURE
FAILURE
Common failures
Where do I find Metrics? Logs?
Ganglia
• windowing, dashboarding
Spark History Server
More info: SRE How Google Runs
Production Systems book
High Performance
spark bookChaos Engineering
abida@salesforce.com
@ anyabida1
Anya Bida, SRE at Salesforce

More Related Content

What's hot

Navigating the Incubator at the Apache Software Foundation
Navigating the Incubator at the Apache Software FoundationNavigating the Incubator at the Apache Software Foundation
Navigating the Incubator at the Apache Software Foundation
Brett Porter
 
How Shopify Scales Rails
How Shopify Scales RailsHow Shopify Scales Rails
How Shopify Scales Rails
jduff
 

What's hot (7)

Embrace Chaos - Introducing Chaos Engineering to your Organization
Embrace Chaos - Introducing Chaos Engineering to your OrganizationEmbrace Chaos - Introducing Chaos Engineering to your Organization
Embrace Chaos - Introducing Chaos Engineering to your Organization
 
Where Node.JS Meets iOS
Where Node.JS Meets iOSWhere Node.JS Meets iOS
Where Node.JS Meets iOS
 
Navigating the Incubator at the Apache Software Foundation
Navigating the Incubator at the Apache Software FoundationNavigating the Incubator at the Apache Software Foundation
Navigating the Incubator at the Apache Software Foundation
 
And The Slow Suffer What They Must
And The Slow Suffer What They MustAnd The Slow Suffer What They Must
And The Slow Suffer What They Must
 
And the Slow Suffer What they Must...
And the Slow Suffer What they Must...And the Slow Suffer What they Must...
And the Slow Suffer What they Must...
 
Building REST APIs using gRPC and Go
Building REST APIs using gRPC and GoBuilding REST APIs using gRPC and Go
Building REST APIs using gRPC and Go
 
How Shopify Scales Rails
How Shopify Scales RailsHow Shopify Scales Rails
How Shopify Scales Rails
 

Similar to Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When Running Spark Handling Infra Failures When Running Spark

Spark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya BidaSpark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya Bida
Spark Summit
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit
 

Similar to Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When Running Spark Handling Infra Failures When Running Spark (20)

Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Spark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya BidaSpark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya Bida
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
 
A short introduction to Spark and its benefits
A short introduction to Spark and its benefitsA short introduction to Spark and its benefits
A short introduction to Spark and its benefits
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 
Detailed guide to the Apache Spark Framework
Detailed guide to the Apache Spark FrameworkDetailed guide to the Apache Spark Framework
Detailed guide to the Apache Spark Framework
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Getting Started with Spark Scala
Getting Started with Spark ScalaGetting Started with Spark Scala
Getting Started with Spark Scala
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Recently uploaded (20)

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 

Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When Running Spark Handling Infra Failures When Running Spark

Editor's Notes

  1. Key Messages In Fourth Industrial Revolution, artificial intelligence, robotics, and the Internet of Things (IoT) are transforming the customer experience. The lines between the physical and digital worlds are blurring, especially with technology like voice command, autonomous cars, and smart devices that keep you connected and always-on. Talk Track Over the last 300 years, our world has seen incredible innovation and unprecedented technological change. Today, we're entering the Fourth Industrial Revolution where artificial intelligence, robotics, and the Internet of Things (IoT) are transforming the customer experience. AI is making devices and apps smarter The lines between the physical and digital worlds are blurring We are seeing incredible new products and services — like connected motorcycles and connected coolers You can see this change in our everyday lives: shopping via voice command, autonomous cars, and smart devices that keep you connected and always-on. Transition These next-generation technologies are connecting us to our customers in a whole new way. And customer expectations, in turn, are changing.
  2. Salesforce Einstein is serving 475 Million predictions per day, and growing. So how do we do this from an infra perspective?
  3. What DevOps actually IS??? -- cross section of infrastructure, -- here’s all the things data scientists need to support themselves at scale
  4. What DevOps actually IS??? -- cross section of infrastructure, -- here’s all the things data scientists need to support themselves at scale
  5. What DevOps actually IS??? -- cross section of infrastructure, -- here’s all the things data scientists need to support themselves at scale
  6. valtextFile=sc.textFile("hdfs://...")valcounts=textFile.flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey(_+_)counts.saveAsTextFile("hdfs://...")
  7. When to use: Persist(useDisk) - When partitions cannot fit into memory, & When jobs are slow, eg due to network delays Checkpoint - When cost to recompute is high, when there’s no time to recompute on failure, when jobs fail due to OOM or network interruptions Local checkpoint - when lineage graph is super long
  8. Cloudwatch by default stores all my cluster and host metrics. So when I’ve got a huge job running, my cluster memory might be reduced significantly, and I can monitor this on cloudwatch. Ganglia is nice for creating dashboards and configuring the monitors for my clusters. I should mention logs too - the Spark History Server lets me view logs even for clusters that are terminated.
  9. This has been “Just enough devops for data scientists”