Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in PySpark, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Debuggers are a wonderful tool, however when you have 100 computers the “wonder” can be a bit more like “pain”. This talk will look at how to connect remote debuggers, but also remind you that it’s probably not the easiest path forward.
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Debugging PySpark: Understanding Errors and Tracing Failures
1. Debugging PySpark
--
Or trying to make sense of a JVM stack trace when
you were minding your own bus
with your friend
Holden
Jody McIntyre
2. Debugging PySpark
--
Or trying to make sense of a JVM stack trace when
you were minding your own bus
with your friend
Holden
And the JVM :p Jody McIntyre
3. Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback
4.
5. Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
● On twitter @BooProgrammer
6. Who do we think y’all are?
● Friendly[ish] people
● Don’t mind pictures of cats or stuffed animals
● May or may not know some Python
○ If you’re new to Python welcome to the community! If this isn’t your first rodeo thanks for
making this place awesome :)
● Might know some Spark
● Want to debug your Spark applications
● Ok with things getting a little bit silly
Lori Erickson
7. What will be covered?
● What is PySpark & why it’s hard to debug
● Getting at Spark’s logs & persisting them
● What your options for logging are
● Attempting to understand common Spark error messages
● Understanding the DAG (and how pipelining can impact your life)
● Subtle attempts to get you to use spark-testing-base or similar
● Fancy Java Debugging tools & clusters - not entirely the path of sadness
● Holden’s even less subtle attempts to get you to buy her new book
● Pictures of cats & stuffed animals
8. Key takeaways
● Debugging distributed systems is hard
● Debugging mixed language systems is hard
● Lazy evaluation can be difficult to debug
○ We’re even talking about it on the dev@ mailing list (maybe partially fixing with a better
repr).
● PySpark is both of those things
● Don’t feel bad when you get lost
● Some common pointers or tricks to help you on your debugging journey
● After you’ve debugged your code please add tests for it!
jeffreyw
9. What is Spark?
● General purpose distributed system
○ With a really nice API including Python :)
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
10. Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods
11. Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau
12. The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
13. Word count - Because this is Big Data
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile("output")
No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
daniilr
15. So where are the logs/errors?
(e.g. before we can identify a monster we have to find it)
● Error messages reported to the console*
● Log messages reported to the console*
● Log messages on the workers - access through the
Spark Web UI or Spark History Server :)
● Where to error: driver versus worker
(*When running in client mode)
PROAndrey
16. One weird trick to debug anything
● Don’t read the logs (yet)
● Draw (possibly in your head) a model of how you think a
working app would behave
● Then predict where in that model things are broken
● Now read logs to prove or disprove your theory
● Repeat
Krzysztof Belczyński
17. Working in YARN?
(e.g. before we can identify a monster we have to find it)
● Note: not the Javascript Yarn
○ Think “enterprise scheduling solution” written Java
○ Like Kubernetes, but with 80% more Java! Go TEAM!
● Use yarn logs to get logs after log collection
● Or set up the Spark history server
● Or yarn.nodemanager.delete.debug-delay-sec :)
Lauren Mitchell
18. Spark is pretty verbose by default
● Most of the time it tells you things you already know
● Or don’t need to know
● You can dynamically control the log level with
sc.setLogLevel
● This is especially useful to increase logging near the
point of error in your code
19. But what about when we get an error?
● Python Spark errors come in two-ish-parts often
● JVM Stack Trace (Friend Monster - comes most errors)
● Python Stack Trace (Boo - has information)
● Buddy - Often used to report the information from Friend
Monster and Boo
20. So what is that JVM stack trace?
● Java/Scala
○ Normal stack trace
○ Sometimes can come from worker or driver, if from worker may be
repeated several times for each partition & attempt with the error
○ Driver stack trace wraps worker stack trace
● R/Python
○ Same as above but...
○ Doesn’t want your actual error message to get lonely
○ Wraps any exception on the workers (& some exceptions on the
drivers)
○ Not always super useful
21. Let’s make a simple mistake & debug :)
● Error in transformation (divide by zero)
Image by: Tomomi
22. Bad outer transformation (Python):
data = sc.parallelize(range(10))
transform1 = data.map(lambda x: x + 1)
transform2 = transform1.map(lambda x: x / 0)
transform2.count()
David Martyn
Hunt
23. Let’s look at the error messages for it:
[Stage 0:> (0 + 0) / 4]17/02/01 09:52:07 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main
process()
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func
return f(iterator)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
Continued for ~400 lines
File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda>
24. Working in Jupyter?
“The error messages were so useless -
I looked up how to disable error reporting in Jupyter”
(paraphrased from PyData DC)
25. Solution* upgrade Spark
Now (2.2+) with 80% more working*
(Some log messages are still lost but
we pass most of the useful part of the error message)
26. Working in Jupyter - try your terminal for help
Possibly fix by https://issues.apache.org/jira/browse/SPARK-19094 but may not get in
tonynetone
AttributeError: unicode object has no attribute
endsWith
29. A scroll down (not quite to the bottom)
File "high_performance_pyspark/bad_pyspark.py",
line 32, in <lambda>
transform2 = transform1.map(lambda x: x / 0)
ZeroDivisionError: integer division or modulo by zero
30. Or look at the bottom of console logs:
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line
180, in main
process()
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line
175, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in
pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in
pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in
pipeline_func
return func(split, prev_func(split, iterator))
31. Or look at the bottom of console logs:
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func
return f(iterator)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <genexpr>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda>
transform2 = transform1.map(lambda x: x / 0)
ZeroDivisionError: integer division or modulo by zero
33. Pipelines (& Python)
● Some pipelining happens inside of Python
○ For performance (less copies from Python to Scala)
● DAG visualization is generated inside of Scala
○ Misses Python pipelines :(
Regardless of language
● Can be difficult to determine which element failed
● Stack trace _sometimes_ helps (it did this time)
● take(1) + count() are your friends - but a lot of work :(
● persist can help a bit too.
Arnaud Roberti
34. Side note: Lambdas aren’t always your friend
● Lambda’s can make finding the error more challenging
● I love lambda x, y: x / y as much as the next human but
when y is zero :(
● A small bit of refactoring for your debugging never hurt
anyone*
● If your inner functions are causing errors it’s a good time
to have tests for them!
● Difficult to put logs inside of them
*A blatant lie, but…. it hurts less often than it helps
Zoli Juhasz
35. Testing - you should do it!
● spark-testing-base provides simple classes to build your
Spark tests with
○ It’s available on pip & maven central
● That’s a talk unto itself though (and it's on YouTube)
37. What does the future look like?*
*Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html.
*Vendor
benchmark.
Trust but verify.
38. What about Arrow powered magic?
add = pandas_udf(lambda x, y: x + y, IntegerType())
James Willamor
39. And when it goes sideways….
Caused by: org.apache.spark.api.python.PythonException:
Traceback (most recent call last):
File
"/home/hkarau/Downloads/spark-2.3.0-bin-hadoop2.7/python/lib/pys
park.zip/pyspark/worker.py", line 229, in main
process()
File
"/home/hkarau/Downloads/spark-2.3.0-bin-hadoop2.7/python/lib/pys
park.zip/pyspark/worker.py", line 224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
Gene Spesard
40. And when it goes sideways….
File
"/home/hkarau/Downloads/spark-2.3.0-bin-hadoop2.7/python/lib/pys
park.zip/pyspark/worker.py", line 149, in <lambda>
func = lambda _, it: map(mapper, it)
File "<string>", line 1, in <lambda>
File
"/home/hkarau/Downloads/spark-2.3.0-bin-hadoop2.7/python/lib/pys
park.zip/pyspark/worker.py", line 90, in <lambda>
return lambda *a: (verify_result_length(*a),
arrow_return_type)
Gene Spesard
41. And when it goes sideways….
File
"/home/hkarau/Downloads/spark-2.3.0-bin-hadoop2.7/python/lib/pys
park.zip/pyspark/worker.py", line 84, in verify_result_length
"Pandas.Series, but is {}".format(type(result)))
TypeError: Return type of the user-defined functon should be
Pandas.Series, but is <type 'float'>
Gene Spesard
Yes this is a typo. And it would
be a great starter issue to fix if
you wanted to get started
contributing to PySpark!
This error message could be
improved (e.g. returning a
integer would have been valid).
It’s probably a great 2nd issue
(or first if someone scoops you
on the other one).
42. What to look for?
org.apache.spark.api.python.PythonException: Traceback (most
recent call last):
What about when we don’t find it?
Probably not an error (directly) inside one of our
transformations, that’s awesome we can focus our search
elsewhere!
torne (where's my lens cap?)
43. Adding your own logging:
● Java users use Log4J & friends
● Python users: use logging library (or even print!)
● Accumulators
○ Behave a bit weirdly, don’t put large amounts of data in them
44. Ok what about if we run out of memory?
In the middle of some Java stack traces:
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main
process()
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func
return f(iterator)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <genexpr>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "high_performance_pyspark/bad_pyspark.py", line 132, in generate_too_much
return range(10000000000000)
MemoryError
45. Tubbs doesn’t always look the same
● Out of memory can be pure JVM (worker)
○ OOM exception during join
○ GC timelimit exceeded
● OutOfMemory error, Executors being killed by kernel,
etc.
● Running in YARN? “Application overhead exceeded”
● JVM out of memory on the driver side from Py4J
46. Reasons for JVM worker OOMs
(w/PySpark)
● Unbalanced shuffles
● Buffering of Rows with PySpark + UDFs
○ If you have a down stream select move it up stream
● Individual jumbo records (after pickling)
● Off-heap storage
● Native code memory leak
47. Reasons for Python worker OOMs
(w/PySpark)
● Insufficient memory reserved for Python worker
● Jumbo records
● Eager entire partition evaluation (e.g. sort +
mapPartitions)
● Too large partitions (unbalanced or not enough
partitions)
48. And loading invalid paths:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/doesnotexist
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
49. Connecting Python Debuggers
● You’re going to have to change your code a bit :(
● You can use broadcast + singleton “hack” to start pydev
or desired remote debugging lib on all of the interpreters
● See https://wiki.python.org/moin/PythonDebuggingTools
for your remote debugging options and pick the one that
works with your toolchain
shadow planet
50. General techniques:
● Move take(1) up the dependency chain
● DAG in the WebUI -- less useful for Python :(
● toDebugString -- also less useful in Python :(
● Sample data and run locally
● Running in cluster mode? Consider debugging in client
mode
Melissa Wilkins
51. Key takeaways
● Debugging distributed systems is hard
● Debugging mixed language systems is hard
● Lazy evaluation can be difficult to debug
○ We’re even talking about it on the mailing list (maybe partially fixing
with a better repr).
● PySpark is both of those things
● Don’t feel bad when you get lost
● Some common pointers or tricks to help you on your
debugging journey
● After you’ve debugged your code please add tests for it!
jeffreyw
52. Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
Coming soon:
High Performance Spark
Learning PySpark
53. High Performance Spark!
Available today!
You can buy it from that scrappy Seattle bookstore, Jeff
Bezos needs another newspaper and I want a cup of
coffee.
http://bit.ly/hkHighPerfSpark
54. And some upcoming talks:
● May
○ Scala Days Berlin - Keeping the "fun" in Apache Spark: Datasets and
FP
○ Strata London - Spark Auto Tuning
○ JOnTheBeach (Spain) - General Purpose Big Data Systems are
eating the world - Is Tool Consolidation inevitable?
● June
○ Spark Summit SF (Accelerating Tensorflow & Accelerating Python +
Dependencies)
○ Scala Days NYC
○ FOSS Back Stage & BBuzz
● July
55. k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)
56. Also not all errors are “hard” errors
● Parsing input? Going to reject some malformed records
● flatMap or filter + map can make this simpler
● Still want to track number of rejected records (see
accumulators)
● Invest in dead letter queues
○ e.g. write malformed records to an Apache Kafka topic
Mustafasari
57. So using names & logging & accs could be:
data = sc.parallelize(range(10))
rejectedCount = sc.accumulator(0)
def loggedDivZero(x):
import logging
try:
return [x / 0]
except Exception as e:
rejectedCount.add(1)
logging.warning("Error found " + repr(e))
return []
transform1 = data.flatMap(loggedDivZero)
transform2 = transform1.map(add1)
transform2.count()
print("Reject " + str(rejectedCount.value))
58. Connecting Java Debuggers
● Add the JDWP incantation to your JVM launch:
-agentlib:jdwp=transport=dt_socket,server=y,address=[
debugport]
○ spark.executor.extraJavaOptions to attach debugger on the executors
○ --driver-java-options to attach on the driver process
○ Add “suspend=y” if only debugging a single worker & exiting too
quickly
● JDWP debugger is IDE specific - Eclipse & IntelliJ have
docs
● Honestly, as PySpark users you probably don’t want to
do this.
shadow planet