Anya Bida is a senior member of technical staff working on Spark tuning at Salesforce. She has a PhD from Mayo Clinic and BS from Johns Hopkins. The document discusses DevOps concepts for data scientists like handling infrastructure failures when running Spark jobs. It provides an overview of Spark operations like map, reduceByKey and saveAsTextFile. It also discusses best practices for avoiding common Spark and HDFS failures through techniques like high availability, sufficient disk space, optimizing partitions, and persisting or checkpointing data.
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When Running Spark Handling Infra Failures When Running Spark
1.
2. About Anya (she/her)
Sr. Member of Technical Staff (SRE)
Salesforce Production Engineering
Salesforce Einstein Platform
Co-organizer SF Big Analytics
Spark Tuning
• Cheat-sheet
• Talks
Previously at Alpine Data, SRI
PhD Mayo Clinic, BS Johns Hopkins
@anyabida1
5. Just Enough DevOps for Data Scientists
Part II: Handling Infra Failures When Running Spark
abida@salesforce.com
@ anyabida1
Anya Bida, SRE at Salesforce
13. https://spark.apache.org/examples.html
How to avoid potential HDFS failures
- Use high availability for the namenode
- Plenty of disk space for hdfs
- Plenty of disk space per disk
- Block replication = 3
- Monitor disk I/O, network connectivity
- Correct permissions
21. Tasks run on executors
Apache Spark
How to avoid common task failures
- Use default retry & exponential backoff settings
- Spark is tolerant to single / multi node failures
- Spark 2.2 is tolerant to single disk failures even on non-raid commodity hardware
- Etc.
- Optimize number of partitions
- Beware data skew & dirty data
- Etc.
- Etc.
23. Cache Persist Checkpoint Local
Checkpoint
local mem cache MEM MEM MEM
local disk DISK DISK
HDFS / S3 Specify dir
If exec is
decommed, are
writes available?
No No Yes No
If job finishes are
writes available?
No No Yes No
Preserve lineage
graph?
Yes Yes No No
RDD Re-use
Persist to improve speed, Checkpoint to improve fault tolerance
Key Messages
In Fourth Industrial Revolution, artificial intelligence, robotics, and the Internet of Things (IoT) are transforming the customer experience.
The lines between the physical and digital worlds are blurring, especially with technology like voice command, autonomous cars, and smart devices that keep you connected and always-on.
Talk TrackOver the last 300 years, our world has seen incredible innovation and unprecedented technological change. Today, we're entering the Fourth Industrial Revolution where artificial intelligence, robotics, and the Internet of Things (IoT) are transforming the customer experience.
AI is making devices and apps smarter
The lines between the physical and digital worlds are blurring
We are seeing incredible new products and services — like connected motorcycles and connected coolers
You can see this change in our everyday lives: shopping via voice command, autonomous cars, and smart devices that keep you connected and always-on.TransitionThese next-generation technologies are connecting us to our customers in a whole new way. And customer expectations, in turn, are changing.
Salesforce Einstein is serving 475 Million predictions per day, and growing.So how do we do this from an infra perspective?
What DevOps actually IS???
-- cross section of infrastructure,
-- here’s all the things data scientists need to support themselves at scale
What DevOps actually IS???
-- cross section of infrastructure,
-- here’s all the things data scientists need to support themselves at scale
What DevOps actually IS???
-- cross section of infrastructure,
-- here’s all the things data scientists need to support themselves at scale
When to use: Persist(useDisk) - When partitions cannot fit into memory, & When jobs are slow, eg due to network delaysCheckpoint - When cost to recompute is high, when there’s no time to recompute on failure, when jobs fail due to OOM or network interruptions
Local checkpoint - when lineage graph is super long
Cloudwatch by default stores all my cluster and host metrics. So when I’ve got a huge job running, my cluster memory might be reduced significantly, and I can monitor this on cloudwatch.
Ganglia is nice for creating dashboards and configuring the monitors for my clusters.
I should mention logs too - the Spark History Server lets me view logs even for clusters that are terminated.
This has been “Just enough devops for data scientists”