Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
1. Making Sense of Performance in Data
Analytics Frameworks
Kay Ousterhout
Joint work with Ryan Rasti, Sylvia Ratnasamy,
Scott Shenker, Byung-Gon Chun
UC Berkeley
2. About Me
PhD student at UC Berkeley
Thesis work centers around performance of large-scale
distributed systems
Spark PMC member
6. …
How can we make this job faster?
Cache input data in
memory
Task
Task
7. …
How can we make this job faster?
Cache input data in
memory
Task
Task
8. …
How can we make this job faster?
Cache input data in
memory
Optimize the network
Task
Task
Task
Task
9. …
How can we make this job faster?
Cache input data in
memory
Optimize the network
Task
Task
Task
Task
Task
Task
10. …
How can we make this job faster?
Cache input data in
memory
Optimize the network
Mitigate effect of
stragglers
Task
Task
Task: 30s
Task: 2s
Task: 3s
Task: 2s
13. Stragglers
Scarlett [EuroSys ‘11], SkewTune [SIGMOD ‘12], LATE [OSDI ‘08], Mantri [OSDI ‘10],
Dolly [NSDI ‘13], GRASS [NSDI ‘14], Wrangler [SoCC ’14]
Disk
Themis [SoCC ‘12], PACMan [NSDI ’12], Spark [NSDI ’12], Tachyon [SoCC ’14]
Network
Load balancing: VL2 [SIGCOMM ‘09], Hedera [NSDI ’10], Sinbad [SIGCOMM ’13]
Application semantics: Orchestra [SIGCOMM ’11], Baraat [SIGCOMM ‘14], Varys
[SIGCOMM ’14]
Reduce data sent: PeriSCOPE [OSDI ‘12], SUDO [NSDI ’12]
In-network aggregation: Camdoop [NSDI ’12]
Better isolation and fairness: Oktopus [SIGCOMM ’11], EyeQ [NSDI ‘12], FairCloud
[SIGCOMM ’12]
Widely-accepted mantras:
Network and disk I/O are bottlenecks
Stragglers are a major issue with
unknown causes
14. (1) How can we quantify performance bottlenecks?
Blocked time analysis
(2) Do the mantras hold?
Takeaways based on three workloads run
with Spark
This work
15. Takeaways based on three Spark workloads:
Network optimizations
can reduce job completion time by at most 2%
CPU (not I/O) often the bottleneck
<19% reduction in completion time from optimizing disk
Many straggler causes can be identified and
fixed
16. Takeaways will not hold
for every single analytics workload
nor for all time
17. Accepted mantras are often not true
Methodology to avoid performance
misunderstandings in the future
This work:
18. Outline
• Methodology: How can we measure Spark bottlenecks?
• Workloads: What workloads did we use?
• Results: How well do the mantras hold?
• Why?: Why do our results differ from past work?
• Demo: How can you understand your own workload?
19. Outline
• Methodology: How can we measure Spark bottlenecks?
• Workloads: What workloads did we use?
• Results: How well do the mantras hold?
• Why?: Why do our results differ from past work?
• Demo: How can you understand your own workload?
21. What exactly happens in a Spark task?
Task reads shuffle data, generates in-memory output
compute
network
time
(1) Request a few
shuffle blocks
disk
(2) Start processing
local data
(3) Process data
fetched remotely
(4) Continue fetching
remote data
: time to handle
one shuffle block
22. What’s the bottleneck for this task?
Task reads shuffle data, generates in-memory output
compute
network
time
disk
Bottlenecked on
network and disk
Bottlenecked
on network
Bottlenecked on
CPU
23. What’s the bottleneck for the job?
time
tasks
compute
network
disk
Task x: may be bottlenecked
on different resources at
different times
Time t: different tasks may be
bottlenecked on different resources
24. How does network affect the job’s completion
time?
time
tasks
:Time when task is
blocked on the
network
Blocked time analysis: how much faster would the
job complete if tasks never blocked on the network?
25. Blocked time analysis
tasks
(2) Simulate how job completion
time would change
(1) Measure time
when tasks are
blocked on the
network
26. (1) Measure time when tasks are blocked on
network
compute
network
disk
: time blocked
on network
: time blocked
on disk
Original task runtime
task runtime if network were infinitely fastBest case
27. (2) Simulate how job completion time would
change
Task 0
Task 1
Task 2
time
2slots
to: Original job completion time
Task 0
Task 1
Task 2
2slots
Incorrectly computed time: doesn’t
account for task scheduling
: time blocked
on network
28. (2) Simulate how job completion time would
change
Task 0
Task 1
Task 2
time
2slots
to: Original job completion time
Task 0
Task 1 Task 2
2slots
: time blocked
on network
tn: Job completion time with infinitely fast network
29. Blocked time analysis: how quickly
could a job have completed if a resource
were infinitely fast?
30. Outline
• Methodology: How can we measure Spark bottlenecks?
• Workloads: What workloads did we use?
• Results: How well do the mantras hold?
• Why?: Why do our results differ from past work?
• Demo: How can you understand your own workload?
32. SQL Workloads run on Spark
TPC-DS (20 machines, 850GB;
60 machines, 2.5TB; 200 machines, 2.5TB)
Big Data Benchmark (5 machines, 60GB)
Databricks (Production; 9 machines, tens of GB)
2 versions of each: in-memory, on-disk
Only 3 workloads
Small cluster sizes
33. Outline
• Methodology: How can we measure Spark bottlenecks?
• Workloads: What workloads did we use?
• Results: How well do the mantras hold?
• Why?: Why do our results differ from past work?
• Demo: How can you understand your own workload?
34. How much faster could jobs get from optimizing
network performance?
35. How much faster could jobs get from optimizing
network performance?
5
95
75
25
50
Percentiles
Median improvement: 2%
95%ile improvement: 10%
36. How much faster could jobs get from optimizing
network performance?
Median improvement at most 2%
5
95
75
25
50
Percentiles
37. How much faster could jobs get from optimizing
disk performance?
Median improvement at most 19%
38. How important is CPU?
CPU much more highly utilized than
disk or network!
39. What about stragglers?
5-10% improvement from eliminating stragglers
Based on simulation
Can explain >60% of stragglers in >75% of jobs
Fixing underlying cause can speed up other tasks too!
2x speedup from fixing one straggler cause
40. Takeaways based on three Spark workloads:
Network optimizations
can reduce job completion time by at most 2%
CPU (not I/O) often the bottleneck
<19% reduction in completion time from optimizing disk
Many straggler causes can be identified and
fixed
41. Outline
• Methodology: How can we measure Spark bottlenecks?
• Workloads: What workloads did we use?
• Results: How well do the mantras hold?
• Why?: Why do our results differ from past work?
• Demo: How can you understand your own workload?
network
>
42. Why are our results so different than what’s
stated in prior work?
Are the workloads we measured unusually
network-light?
How can we compare our workloads to large-
scale traces used to motivate prior work?
43. How much data is transferred per CPU second?
Microsoft ’09-’10: 1.9–6.35 Mb / task second
Google ’04-‘07: 1.34–1.61 Mb / machine second
44. Why are our results so different than what’s
stated in prior work?
Our workloads are network light
1) Incomplete metrics
2) Conflation of CPU and network time
45. When is the network used?
map
task
map
task
map
task
…
reduce
task
reduce
task
reduce
task
…
Input data
(read
locally)
Output
data
(1) To shuffle
intermediate
data
(2) To
replicate
output data
Some work
focuses only on
the shuffle
46. How does the data transferred over the network
compare to the input data?
Not realistic to look only at shuffle!
Or to use workloads where all input is shuffled
Shuffled data is only
~1/3 of input data!
Even less output data
47. Prior work conflates CPU and network time
To send data over network:
(1) Serialize objects into
bytes
(2) Send bytes
(1) and (2) often conflated.
Reducing application data sent reduces both!
48. When does the network matter?
Network important when:
(1) Computation optimized
(2) Serialization time low
(3) Large amount of data sent
over network
49. Why are our results so different than what’s
stated in prior work?
Our workloads are network light
1) Incomplete metrics
e.g., looking only at shuffle time
2) Conflation of CPU and network time
Sending data over the network has an associated CPU cost
51. Limitations aren’t fatal
Only three workloads
Industry-standard workloads
Results sanity-checked with larger production traces
Small cluster sizes
Takeaways don’t change when we move between cluster
sizes
52. Outline
• Methodology: How can we measure Spark bottlenecks?
• Workloads: What workloads did we use?
• Results: How well do the mantras hold?
• Why?: Why do our results differ from past work?
• Demo: How can you understand your own workload?
54. Demo
Often can tune parameters to shift the bottleneck
(e.g., change snappy to lzf)
55. What’s missing from Spark metrics?
Time blocked on reading input data and writing output
data (HADOOP-11873)
Time spent spilling intermediate data to disk
(SPARK-3577)
56. Network optimizations
can reduce job completion time by at most 2%
CPU (not I/O) often the bottleneck
<19% reduction in completion time from optimizing disk
Many straggler causes can be identified and fixed
All traces and tools publicly available:
tinyurl.com/summit-traces
Takeaway: performance understandability should
be a first-class concern!
(almost) All Instrumentation now part of Spark
I want your workload! keo@eecs.berkeley.edu