We all dread “Lost task” and “Container killed by YARN for exceeding memory limits” messages in our scaled-up spark yarn applications. Even answering the question “How much memory did my application use?” is surprisingly tricky in the distributed yarn environment. Sqrrl has developed a testing framework for observing vital statistics of spark jobs including executor-by-executor memory and CPU usage over time for both the JDK and python portions of pyspark yarn containers. This talk will detail the methods we use to collect, store, and report spark yarn resource usage. This information has proved to be invaluable for performance and regression testing of the spark jobs in Sqrrl Enterprise.
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn: Spark Summit East talk by Ed Barnes and Ruslan Vaulin
1. Monitoring the Dynamic Resource
Usage of Scala and Python Spark
Jobs in Yarn
Ed Barnes, Ruslan Vaulin and Chris McCubbin
Sqrrl Data
2.
3. ML Application Workflow
Yarn Cluster
Spark
ETL Prediction
Python
Database/
Graph
Model
Scala
Stats
Models
ETL Training
Database/
Graph
Model
4. Taking Spark Applications into
Production
• Requires execution framework
• Scalable, Robust, Tested
• Test at scale
• Many issues show up only at scale
– Performance
– Memory requirements
– Failures
– Scaling
• Debugging distributed applications is really hard!
8. Case Study: Py4J Issue
• Testing engineer - “Your code blows up with OOM
when processing X amount of data!”
• Why?!!!!
• Nothing obvious in Spark UI!
9. Case Study: Py4J Issue
• Testing engineer - “Your code blows up with OOM
when processing X amount of data!”
• Why?!!!!
• Nothing obvious in Spark UI!
OOM
Exception
11. YARN
Overall System Design
Django/Apache
MySQL
Sqrrl Server
Data Node 1
process_monitor.py
Sqrrl Server
Data Node 2
process_monitor.py
Sqrrl Server
Data Node n
process_monitor.py
Sqrrl Server
Name Node
REST API
● Centralized reporting/logging system
○ REST API for scripts
○ Browser-based reporting
○ Database for process data and logging
● Main performance script (run_perf.py)
● Process monitoring script
(process_monitor.py)
Performance
Test System
run_perf.py
21. Py4J Issue: Evidence
• Memory spiked during loading of trained ML models into
python process.
• Only on driver!
Memory spikes
Python Process
Memory
22. Diagnosis and Conclusions
• Loading trained model from HDFS required decryption with
java libraries and Py4J for loading into python.
• Py4J protocol inflated size by almost 100x!
• Solution: implement custom protocol for loading data from
jvm to python via socket.
23. Py4J issue: After Fix
• Memory spikes are gone!
• Ready for production!
24. Recommendations & Lessons
Learned
• Do not take scalability for granted!
• Understand Spark’s architecture
– Python/JVM interaction
• Follow best practices
– Iterators not Lists
– Careful with joins
• Understand your computing demands
• Test at scale
• Invest in tools
• Think distributed and your code will shine!