SlideShare a Scribd company logo
1 of 47
Download to read offline
Debugging Apache Spark
“Professional Stack Trace Reading”
with your friends
Holden & Joey
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark committer (as of January!) :)
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
● On twitter @BooProgrammer
Who is Joey?
● Joey Echeverria
● Preferred pronouns: he/him
● Stream Processing at Splunk
● Previously at Rocana and Cloudera
● Lot’s of experience feeling the pain of debugging distributed systems
● Author of “Hadoop Security”
● @fwiffo
● https://github.com/joey
Who do we think y’all are?
● Friendly[ish] people
● Don’t mind pictures of cats or stuffed animals
● Know some Spark
● Want to debug your Spark applications
● Ok with things getting a little bit silly
Lori Erickson
What will be covered?
● Getting at Spark’s logs & persisting them
● What your options for logging are
● Attempting to understand common Spark error messages
● Understanding the DAG (and how pipelining can impact your life)
● Subtle attempts to get you to use spark-testing-base or similar
● Fancy Java Debugging tools & clusters - not entirely the path of sadness
● Holden’s even less subtle attempts to get you to buy her new book
● Pictures of cats & stuffed animals
Aka: Building our Monster Identification Guide
So where are the logs/errors?
(e.g. before we can identify a monster we have to find it)
● Error messages reported to the console*
● Log messages reported to the console*
● Log messages on the workers - access through the
Spark Web UI or Spark History Server :)
● Where to error: driver versus worker
(*When running in client mode)
PROAndrey
One weird trick to debug anything
● Don’t read the logs (yet)
● Draw (possibly in your head) a model of how you think a
working app would behave
● Then predict where in that model things are broken
● Now read logs to prove or disprove your theory
● Repeat
Krzysztof Belczyński
Working in YARN?
(e.g. before we can identify a monster we have to find it)
● Use yarn logs to get logs after log collection
● Or set up the Spark history server
● Or yarn.nodemanager.delete.debug-delay-sec :)
Lauren Mitchell
Spark is pretty verbose by default
● Most of the time it tells you things you already know
● Or don’t need to know
● You can dynamically control the log level with
sc.setLogLevel
● This is especially useful to increase logging near the
point of error in your code
But what about when we get an error?
● Python Spark errors come in two-ish-parts often
● JVM Stack Trace (Friend Monster - comes most errors)
● Python Stack Trace (Boo - has information)
● Buddy - Often used to report the information from Friend
Monster and Boo
So what is that JVM stack trace?
● Java/Scala
○ Normal stack trace
○ Sometimes can come from worker or driver, if from worker may be
repeated several times for each partition & attempt with the error
○ Driver stack trace wraps worker stack trace
● R/Python
○ Same as above but...
○ Doesn’t want your actual error message to get lonely
○ Wraps any exception on the workers (& some exceptions on the
drivers)
○ Not always super useful
Let’s make a simple mistake & debug :)
● Error in transformation (divide by zero)
Image by: Tomomi
Bad outer transformation (Scala):
val transform1 = data.map(x => x + 1)
val transform2 = transform1.map(x => x/0) // Will throw an
exception when forced to evaluate
transform2.count() // Forces evaluation
David Martyn
Hunt
Let’s look at the error messages for it:
17/01/23 12:41:36 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ArithmeticException: / by zero
at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply$mcII$sp(throws.scala:9)
at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply(throws.scala:9)
at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply(throws.scala:9)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$class.foreach(Iterator.scala:750)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:295)
at scala.collection.AbstractIterator.to(Iterator.scala:1202)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:287)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1202)
Continued for ~100 lines
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:274)
Bad outer transformation (Python):
data = sc.parallelize(range(10))
transform1 = data.map(lambda x: x + 1)
transform2 = transform1.map(lambda x: x / 0)
transform2.count()
David Martyn
Hunt
Let’s look at the error messages for it:
[Stage 0:> (0 + 0) / 4]17/02/01 09:52:07 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main
process()
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func
return f(iterator)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
Continued for ~400 lines
File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda>
Working in Jupyter?
“The error messages were so useless -
I looked up how to disable error reporting in Jupyter”
(paraphrased from PyData DC)
Almost working in Jupyter?
Now with 80% more working*
(Log messages are still lost but we pass through the useful
part of the error message)
Working in Jupyter - try your terminal for help
Possibly fix by https://issues.apache.org/jira/browse/SPARK-19094 but may not get in
tonynetone
AttributeError: unicode object has no attribute
endsWith
Ok maybe the web UI is easier? Mr Thinktank
And click through... afu007
A scroll down (not quite to the bottom)
File "high_performance_pyspark/bad_pyspark.py",
line 32, in <lambda>
transform2 = transform1.map(lambda x: x / 0)
ZeroDivisionError: integer division or modulo by zero
Or look at the bottom of console logs:
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line
180, in main
process()
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line
175, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in
pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in
pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in
pipeline_func
return func(split, prev_func(split, iterator))
Or look at the bottom of console logs:
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func
return f(iterator)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <genexpr>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda>
transform2 = transform1.map(lambda x: x / 0)
ZeroDivisionError: integer division or modulo by zero
And in scala….
Caused by: java.lang.ArithmeticException: / by zero
at
com.highperformancespark.examples.errors.Throws$$anonfun$4.apply$mcII$sp(throws.sc
ala:17)
at
com.highperformancespark.examples.errors.Throws$$anonfun$4.apply(throws.scala:17)
at
com.highperformancespark.examples.errors.Throws$$anonfun$4.apply(throws.scala:17)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$class.foreach(Iterator.scala:750)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:295)
(Aside): DAG differences illustrated Melissa Wilkins
Pipelines (& Python)
● Some pipelining happens inside of Python
○ For performance (less copies from Python to Scala)
● DAG visualization is generated inside of Scala
○ Misses Python pipelines :(
Regardless of language
● Can be difficult to determine which element failed
● Stack trace _sometimes_ helps (it did this time)
● take(1) + count() are your friends - but a lot of work :(
● persist can help a bit too.
Arnaud Roberti
Side note: Lambdas aren’t always your friend
● Lambda’s can make finding the error more challenging
● I love lambda x, y: x / y as much as the next human but
when y is zero :(
● A small bit of refactoring for your debugging never hurt
anyone*
● If your inner functions are causing errors it’s a good time
to have tests for them!
● Difficult to put logs inside of them
*A blatant lie, but…. it hurts less often than it helps
Zoli Juhasz
Testing - you should do it!
● spark-testing-base provides simple classes to build your
Spark tests with
○ It’s available on pip & maven central
● That’s a talk unto itself though (and it's on YouTube)
Adding your own logging:
● Java users use Log4J & friends
● Python users: use logging library (or even print!)
● Accumulators
○ Behave a bit weirdly, don’t put large amounts of data in them
Also not all errors are “hard” errors
● Parsing input? Going to reject some malformed records
● flatMap or filter + map can make this simpler
● Still want to track number of rejected records (see
accumulators)
● Invest in dead letter queues
○ e.g. write malformed records to an Apache Kafka topic
Mustafasari
So using names & logging & accs could be:
data = sc.parallelize(range(10))
rejectedCount = sc.accumulator(0)
def loggedDivZero(x):
import logging
try:
return [x / 0]
except Exception as e:
rejectedCount.add(1)
logging.warning("Error found " + repr(e))
return []
transform1 = data.flatMap(loggedDivZero)
transform2 = transform1.map(add1)
transform2.count()
print("Reject " + str(rejectedCount.value))
Ok what about if we run out of memory?
In the middle of some Java stack traces:
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main
process()
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func
return f(iterator)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <genexpr>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "high_performance_pyspark/bad_pyspark.py", line 132, in generate_too_much
return range(10000000000000)
MemoryError
Tubbs doesn’t always look the same
● Out of memory can be pure JVM (worker)
○ OOM exception during join
○ GC timelimit exceeded
● OutOfMemory error, Executors being killed by kernel,
etc.
● Running in YARN? “Application overhead exceeded”
● JVM out of memory on the driver side from Py4J
Reasons for JVM worker OOMs
(w/PySpark)
● Unbalanced shuffles
● Buffering of Rows with PySpark + UDFs
○ If you have a down stream select move it up stream
● Individual jumbo records (after pickling)
● Off-heap storage
● Native code memory leak
Reasons for Python worker OOMs
(w/PySpark)
● Insufficient memory reserved for Python worker
● Jumbo records
● Eager entire partition evaluation (e.g. sort +
mapPartitions)
● Too large partitions (unbalanced or not enough
partitions)
And loading invalid paths:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/doesnotexist
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
Connecting Java Debuggers
● Add the JDWP incantation to your JVM launch:
-agentlib:jdwp=transport=dt_socket,server=y,address=[
debugport]
○ spark.executor.extraJavaOptions to attach debugger on the executors
○ --driver-java-options to attach on the driver process
○ Add “suspend=y” if only debugging a single worker & exiting too
quickly
● JDWP debugger is IDE specific - Eclipse & IntelliJ have
docs
shadow planet
Connecting Python Debuggers
● You’re going to have to change your code a bit :(
● You can use broadcast + singleton “hack” to start pydev
or desired remote debugging lib on all of the interpreters
● See https://wiki.python.org/moin/PythonDebuggingTools
for your remote debugging options and pick the one that
works with your toolchain
shadow planet
Alternative approaches:
● Move take(1) up the dependency chain
● DAG in the WebUI -- less useful for Python :(
● toDebugString -- also less useful in Python :(
● Sample data and run locally
● Running in cluster mode? Consider debugging in client
mode
Melissa Wilkins
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
Coming soon:
High Performance Spark
Learning PySpark
High Performance Spark!
Available today!
You can buy it from that scrappy Seattle bookstore, Jeff
Bezos needs another newspaper and I want a cup of
coffee.
http://bit.ly/hkHighPerfSpark
And some upcoming talks:
● Jan
○ Data Day Texas - Another talk @ 2PM
● Feb
○ FOSDEM - One on testing one on scaling
○ JFokus in Stockholm - Adding deep learning to Spark
○ I disappear for a week and pretend computers work
● March
○ Strata San Jose - Big Data Beyond the JVM
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)

More Related Content

What's hot

Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
 
Validating big data pipelines - Scala eXchange 2018
Validating big data pipelines -  Scala eXchange 2018Validating big data pipelines -  Scala eXchange 2018
Validating big data pipelines - Scala eXchange 2018Holden Karau
 
Validating big data jobs - Spark AI Summit EU
Validating big data jobs  - Spark AI Summit EUValidating big data jobs  - Spark AI Summit EU
Validating big data jobs - Spark AI Summit EUHolden Karau
 
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018Holden Karau
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...Holden Karau
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupHolden Karau
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Holden Karau
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...Databricks
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Holden Karau
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
Inside the JVM - Follow the white rabbit!
Inside the JVM - Follow the white rabbit!Inside the JVM - Follow the white rabbit!
Inside the JVM - Follow the white rabbit!Sylvain Wallez
 
Unbreaking Your Django Application
Unbreaking Your Django ApplicationUnbreaking Your Django Application
Unbreaking Your Django ApplicationOSCON Byrum
 
Mashups with Drupal and QueryPath
Mashups with Drupal and QueryPathMashups with Drupal and QueryPath
Mashups with Drupal and QueryPathMatt Butcher
 
Etiene Dalcol - Web development with Lua Programming Language - code.talks 2015
Etiene Dalcol - Web development with Lua Programming Language - code.talks 2015Etiene Dalcol - Web development with Lua Programming Language - code.talks 2015
Etiene Dalcol - Web development with Lua Programming Language - code.talks 2015AboutYouGmbH
 
The things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and AkkaThe things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and AkkaKonrad Malawski
 
Mastering Java Bytecode - JAX.de 2012
Mastering Java Bytecode - JAX.de 2012Mastering Java Bytecode - JAX.de 2012
Mastering Java Bytecode - JAX.de 2012Anton Arhipov
 
Not Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsNot Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsKonrad Malawski
 
Need for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applicationsNeed for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applicationsKonrad Malawski
 

What's hot (20)

Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 
Validating big data pipelines - Scala eXchange 2018
Validating big data pipelines -  Scala eXchange 2018Validating big data pipelines -  Scala eXchange 2018
Validating big data pipelines - Scala eXchange 2018
 
Validating big data jobs - Spark AI Summit EU
Validating big data jobs  - Spark AI Summit EUValidating big data jobs  - Spark AI Summit EU
Validating big data jobs - Spark AI Summit EU
 
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Inside the JVM - Follow the white rabbit!
Inside the JVM - Follow the white rabbit!Inside the JVM - Follow the white rabbit!
Inside the JVM - Follow the white rabbit!
 
Unbreaking Your Django Application
Unbreaking Your Django ApplicationUnbreaking Your Django Application
Unbreaking Your Django Application
 
Mashups with Drupal and QueryPath
Mashups with Drupal and QueryPathMashups with Drupal and QueryPath
Mashups with Drupal and QueryPath
 
Etiene Dalcol - Web development with Lua Programming Language - code.talks 2015
Etiene Dalcol - Web development with Lua Programming Language - code.talks 2015Etiene Dalcol - Web development with Lua Programming Language - code.talks 2015
Etiene Dalcol - Web development with Lua Programming Language - code.talks 2015
 
The things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and AkkaThe things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and Akka
 
Mastering Java Bytecode - JAX.de 2012
Mastering Java Bytecode - JAX.de 2012Mastering Java Bytecode - JAX.de 2012
Mastering Java Bytecode - JAX.de 2012
 
Not Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsNot Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabs
 
Async await...oh wait!
Async await...oh wait!Async await...oh wait!
Async await...oh wait!
 
Need for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applicationsNeed for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applications
 

Similar to Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018

Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018Holden Karau
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Holden Karau
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent
 
Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...
 Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K... Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...
Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...Databricks
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Pilot Tech Talk #10 — Practical automation by Kamil Cholewiński
Pilot Tech Talk #10 — Practical automation by Kamil CholewińskiPilot Tech Talk #10 — Practical automation by Kamil Cholewiński
Pilot Tech Talk #10 — Practical automation by Kamil CholewińskiPilot
 
arduino
arduinoarduino
arduinomurbz
 
Writing simple buffer_overflow_exploits
Writing simple buffer_overflow_exploitsWriting simple buffer_overflow_exploits
Writing simple buffer_overflow_exploitsD4rk357 a
 
Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...Holden Karau
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
 
PyConUK 2014 - PostMortem Debugging and Web Development Updated
PyConUK 2014 - PostMortem Debugging and Web Development UpdatedPyConUK 2014 - PostMortem Debugging and Web Development Updated
PyConUK 2014 - PostMortem Debugging and Web Development UpdatedAlessandro Molina
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
 
What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)Chris Riccomini
 
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...Holden Karau
 
PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.
PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.
PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.Puppet
 

Similar to Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018 (20)

Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...
 Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K... Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...
Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Pilot Tech Talk #10 — Practical automation by Kamil Cholewiński
Pilot Tech Talk #10 — Practical automation by Kamil CholewińskiPilot Tech Talk #10 — Practical automation by Kamil Cholewiński
Pilot Tech Talk #10 — Practical automation by Kamil Cholewiński
 
arduino
arduinoarduino
arduino
 
Writing simple buffer_overflow_exploits
Writing simple buffer_overflow_exploitsWriting simple buffer_overflow_exploits
Writing simple buffer_overflow_exploits
 
Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...
 
Cpb2010
Cpb2010Cpb2010
Cpb2010
 
Killer Bugs From Outer Space
Killer Bugs From Outer SpaceKiller Bugs From Outer Space
Killer Bugs From Outer Space
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 
PyConUK 2014 - PostMortem Debugging and Web Development Updated
PyConUK 2014 - PostMortem Debugging and Web Development UpdatedPyConUK 2014 - PostMortem Debugging and Web Development Updated
PyConUK 2014 - PostMortem Debugging and Web Development Updated
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)
 
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
 
Us 17-krug-hacking-severless-runtimes
Us 17-krug-hacking-severless-runtimesUs 17-krug-hacking-severless-runtimes
Us 17-krug-hacking-severless-runtimes
 
PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.
PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.
PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.
 

Recently uploaded

Introduction to ICANN and Fellowship program by Shreedeep Rayamajhi.pdf
Introduction to ICANN and Fellowship program  by Shreedeep Rayamajhi.pdfIntroduction to ICANN and Fellowship program  by Shreedeep Rayamajhi.pdf
Introduction to ICANN and Fellowship program by Shreedeep Rayamajhi.pdfShreedeep Rayamajhi
 
TYPES AND DEFINITION OF ONLINE CRIMES AND HAZARDS
TYPES AND DEFINITION OF ONLINE CRIMES AND HAZARDSTYPES AND DEFINITION OF ONLINE CRIMES AND HAZARDS
TYPES AND DEFINITION OF ONLINE CRIMES AND HAZARDSedrianrheine
 
Computer 10 Lesson 8: Building a Website
Computer 10 Lesson 8: Building a WebsiteComputer 10 Lesson 8: Building a Website
Computer 10 Lesson 8: Building a WebsiteMavein
 
Vision Forward: Tracing Image Search SEO From Its Roots To AI-Enhanced Horizons
Vision Forward: Tracing Image Search SEO From Its Roots To AI-Enhanced HorizonsVision Forward: Tracing Image Search SEO From Its Roots To AI-Enhanced Horizons
Vision Forward: Tracing Image Search SEO From Its Roots To AI-Enhanced HorizonsRoxana Stingu
 
LESSON 5 GROUP 10 ST. THOMAS AQUINAS.pdf
LESSON 5 GROUP 10 ST. THOMAS AQUINAS.pdfLESSON 5 GROUP 10 ST. THOMAS AQUINAS.pdf
LESSON 5 GROUP 10 ST. THOMAS AQUINAS.pdfmchristianalwyn
 
WordPress by the numbers - Jan Loeffler, CTO WebPros, CloudFest 2024
WordPress by the numbers - Jan Loeffler, CTO WebPros, CloudFest 2024WordPress by the numbers - Jan Loeffler, CTO WebPros, CloudFest 2024
WordPress by the numbers - Jan Loeffler, CTO WebPros, CloudFest 2024Jan Löffler
 
Bio Medical Waste Management Guideliness 2023 ppt.pptx
Bio Medical Waste Management Guideliness 2023 ppt.pptxBio Medical Waste Management Guideliness 2023 ppt.pptx
Bio Medical Waste Management Guideliness 2023 ppt.pptxnaveenithkrishnan
 
Benefits of doing Internet peering and running an Internet Exchange (IX) pres...
Benefits of doing Internet peering and running an Internet Exchange (IX) pres...Benefits of doing Internet peering and running an Internet Exchange (IX) pres...
Benefits of doing Internet peering and running an Internet Exchange (IX) pres...APNIC
 
Check out the Free Landing Page Hosting in 2024
Check out the Free Landing Page Hosting in 2024Check out the Free Landing Page Hosting in 2024
Check out the Free Landing Page Hosting in 2024Shubham Pant
 
Presentation2.pptx - JoyPress Wordpress
Presentation2.pptx -  JoyPress WordpressPresentation2.pptx -  JoyPress Wordpress
Presentation2.pptx - JoyPress Wordpressssuser166378
 
LESSON 10/ GROUP 10/ ST. THOMAS AQUINASS
LESSON 10/ GROUP 10/ ST. THOMAS AQUINASSLESSON 10/ GROUP 10/ ST. THOMAS AQUINASS
LESSON 10/ GROUP 10/ ST. THOMAS AQUINASSlesteraporado16
 
Zero-day Vulnerabilities
Zero-day VulnerabilitiesZero-day Vulnerabilities
Zero-day Vulnerabilitiesalihassaah1994
 

Recently uploaded (12)

Introduction to ICANN and Fellowship program by Shreedeep Rayamajhi.pdf
Introduction to ICANN and Fellowship program  by Shreedeep Rayamajhi.pdfIntroduction to ICANN and Fellowship program  by Shreedeep Rayamajhi.pdf
Introduction to ICANN and Fellowship program by Shreedeep Rayamajhi.pdf
 
TYPES AND DEFINITION OF ONLINE CRIMES AND HAZARDS
TYPES AND DEFINITION OF ONLINE CRIMES AND HAZARDSTYPES AND DEFINITION OF ONLINE CRIMES AND HAZARDS
TYPES AND DEFINITION OF ONLINE CRIMES AND HAZARDS
 
Computer 10 Lesson 8: Building a Website
Computer 10 Lesson 8: Building a WebsiteComputer 10 Lesson 8: Building a Website
Computer 10 Lesson 8: Building a Website
 
Vision Forward: Tracing Image Search SEO From Its Roots To AI-Enhanced Horizons
Vision Forward: Tracing Image Search SEO From Its Roots To AI-Enhanced HorizonsVision Forward: Tracing Image Search SEO From Its Roots To AI-Enhanced Horizons
Vision Forward: Tracing Image Search SEO From Its Roots To AI-Enhanced Horizons
 
LESSON 5 GROUP 10 ST. THOMAS AQUINAS.pdf
LESSON 5 GROUP 10 ST. THOMAS AQUINAS.pdfLESSON 5 GROUP 10 ST. THOMAS AQUINAS.pdf
LESSON 5 GROUP 10 ST. THOMAS AQUINAS.pdf
 
WordPress by the numbers - Jan Loeffler, CTO WebPros, CloudFest 2024
WordPress by the numbers - Jan Loeffler, CTO WebPros, CloudFest 2024WordPress by the numbers - Jan Loeffler, CTO WebPros, CloudFest 2024
WordPress by the numbers - Jan Loeffler, CTO WebPros, CloudFest 2024
 
Bio Medical Waste Management Guideliness 2023 ppt.pptx
Bio Medical Waste Management Guideliness 2023 ppt.pptxBio Medical Waste Management Guideliness 2023 ppt.pptx
Bio Medical Waste Management Guideliness 2023 ppt.pptx
 
Benefits of doing Internet peering and running an Internet Exchange (IX) pres...
Benefits of doing Internet peering and running an Internet Exchange (IX) pres...Benefits of doing Internet peering and running an Internet Exchange (IX) pres...
Benefits of doing Internet peering and running an Internet Exchange (IX) pres...
 
Check out the Free Landing Page Hosting in 2024
Check out the Free Landing Page Hosting in 2024Check out the Free Landing Page Hosting in 2024
Check out the Free Landing Page Hosting in 2024
 
Presentation2.pptx - JoyPress Wordpress
Presentation2.pptx -  JoyPress WordpressPresentation2.pptx -  JoyPress Wordpress
Presentation2.pptx - JoyPress Wordpress
 
LESSON 10/ GROUP 10/ ST. THOMAS AQUINASS
LESSON 10/ GROUP 10/ ST. THOMAS AQUINASSLESSON 10/ GROUP 10/ ST. THOMAS AQUINASS
LESSON 10/ GROUP 10/ ST. THOMAS AQUINASS
 
Zero-day Vulnerabilities
Zero-day VulnerabilitiesZero-day Vulnerabilities
Zero-day Vulnerabilities
 

Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018

  • 1. Debugging Apache Spark “Professional Stack Trace Reading” with your friends Holden & Joey
  • 2. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark committer (as of January!) :) ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos
  • 4. Who is Boo? ● Boo uses she/her pronouns (as I told the Texas house committee) ● Best doge ● Lot’s of experience barking at computers to make them go faster ● Author of “Learning to Bark” & “High Performance Barking” ● On twitter @BooProgrammer
  • 5. Who is Joey? ● Joey Echeverria ● Preferred pronouns: he/him ● Stream Processing at Splunk ● Previously at Rocana and Cloudera ● Lot’s of experience feeling the pain of debugging distributed systems ● Author of “Hadoop Security” ● @fwiffo ● https://github.com/joey
  • 6. Who do we think y’all are? ● Friendly[ish] people ● Don’t mind pictures of cats or stuffed animals ● Know some Spark ● Want to debug your Spark applications ● Ok with things getting a little bit silly Lori Erickson
  • 7. What will be covered? ● Getting at Spark’s logs & persisting them ● What your options for logging are ● Attempting to understand common Spark error messages ● Understanding the DAG (and how pipelining can impact your life) ● Subtle attempts to get you to use spark-testing-base or similar ● Fancy Java Debugging tools & clusters - not entirely the path of sadness ● Holden’s even less subtle attempts to get you to buy her new book ● Pictures of cats & stuffed animals
  • 8. Aka: Building our Monster Identification Guide
  • 9. So where are the logs/errors? (e.g. before we can identify a monster we have to find it) ● Error messages reported to the console* ● Log messages reported to the console* ● Log messages on the workers - access through the Spark Web UI or Spark History Server :) ● Where to error: driver versus worker (*When running in client mode) PROAndrey
  • 10. One weird trick to debug anything ● Don’t read the logs (yet) ● Draw (possibly in your head) a model of how you think a working app would behave ● Then predict where in that model things are broken ● Now read logs to prove or disprove your theory ● Repeat Krzysztof Belczyński
  • 11. Working in YARN? (e.g. before we can identify a monster we have to find it) ● Use yarn logs to get logs after log collection ● Or set up the Spark history server ● Or yarn.nodemanager.delete.debug-delay-sec :) Lauren Mitchell
  • 12. Spark is pretty verbose by default ● Most of the time it tells you things you already know ● Or don’t need to know ● You can dynamically control the log level with sc.setLogLevel ● This is especially useful to increase logging near the point of error in your code
  • 13. But what about when we get an error? ● Python Spark errors come in two-ish-parts often ● JVM Stack Trace (Friend Monster - comes most errors) ● Python Stack Trace (Boo - has information) ● Buddy - Often used to report the information from Friend Monster and Boo
  • 14. So what is that JVM stack trace? ● Java/Scala ○ Normal stack trace ○ Sometimes can come from worker or driver, if from worker may be repeated several times for each partition & attempt with the error ○ Driver stack trace wraps worker stack trace ● R/Python ○ Same as above but... ○ Doesn’t want your actual error message to get lonely ○ Wraps any exception on the workers (& some exceptions on the drivers) ○ Not always super useful
  • 15. Let’s make a simple mistake & debug :) ● Error in transformation (divide by zero) Image by: Tomomi
  • 16. Bad outer transformation (Scala): val transform1 = data.map(x => x + 1) val transform2 = transform1.map(x => x/0) // Will throw an exception when forced to evaluate transform2.count() // Forces evaluation David Martyn Hunt
  • 17. Let’s look at the error messages for it: 17/01/23 12:41:36 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.ArithmeticException: / by zero at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply$mcII$sp(throws.scala:9) at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply(throws.scala:9) at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply(throws.scala:9) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$class.foreach(Iterator.scala:750) at scala.collection.AbstractIterator.foreach(Iterator.scala:1202) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:295) at scala.collection.AbstractIterator.to(Iterator.scala:1202) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:287) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1202) Continued for ~100 lines at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:274)
  • 18. Bad outer transformation (Python): data = sc.parallelize(range(10)) transform1 = data.map(lambda x: x + 1) transform2 = transform1.map(lambda x: x / 0) transform2.count() David Martyn Hunt
  • 19. Let’s look at the error messages for it: [Stage 0:> (0 + 0) / 4]17/02/01 09:52:07 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main process() File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func return f(iterator) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() Continued for ~400 lines File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda>
  • 20. Working in Jupyter? “The error messages were so useless - I looked up how to disable error reporting in Jupyter” (paraphrased from PyData DC)
  • 21. Almost working in Jupyter? Now with 80% more working* (Log messages are still lost but we pass through the useful part of the error message)
  • 22. Working in Jupyter - try your terminal for help Possibly fix by https://issues.apache.org/jira/browse/SPARK-19094 but may not get in tonynetone AttributeError: unicode object has no attribute endsWith
  • 23. Ok maybe the web UI is easier? Mr Thinktank
  • 25. A scroll down (not quite to the bottom) File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda> transform2 = transform1.map(lambda x: x / 0) ZeroDivisionError: integer division or modulo by zero
  • 26. Or look at the bottom of console logs: File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main process() File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator))
  • 27. Or look at the bottom of console logs: File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func return f(iterator) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <genexpr> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda> transform2 = transform1.map(lambda x: x / 0) ZeroDivisionError: integer division or modulo by zero
  • 28. And in scala…. Caused by: java.lang.ArithmeticException: / by zero at com.highperformancespark.examples.errors.Throws$$anonfun$4.apply$mcII$sp(throws.sc ala:17) at com.highperformancespark.examples.errors.Throws$$anonfun$4.apply(throws.scala:17) at com.highperformancespark.examples.errors.Throws$$anonfun$4.apply(throws.scala:17) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$class.foreach(Iterator.scala:750) at scala.collection.AbstractIterator.foreach(Iterator.scala:1202) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:295)
  • 29. (Aside): DAG differences illustrated Melissa Wilkins
  • 30. Pipelines (& Python) ● Some pipelining happens inside of Python ○ For performance (less copies from Python to Scala) ● DAG visualization is generated inside of Scala ○ Misses Python pipelines :( Regardless of language ● Can be difficult to determine which element failed ● Stack trace _sometimes_ helps (it did this time) ● take(1) + count() are your friends - but a lot of work :( ● persist can help a bit too. Arnaud Roberti
  • 31. Side note: Lambdas aren’t always your friend ● Lambda’s can make finding the error more challenging ● I love lambda x, y: x / y as much as the next human but when y is zero :( ● A small bit of refactoring for your debugging never hurt anyone* ● If your inner functions are causing errors it’s a good time to have tests for them! ● Difficult to put logs inside of them *A blatant lie, but…. it hurts less often than it helps Zoli Juhasz
  • 32. Testing - you should do it! ● spark-testing-base provides simple classes to build your Spark tests with ○ It’s available on pip & maven central ● That’s a talk unto itself though (and it's on YouTube)
  • 33. Adding your own logging: ● Java users use Log4J & friends ● Python users: use logging library (or even print!) ● Accumulators ○ Behave a bit weirdly, don’t put large amounts of data in them
  • 34. Also not all errors are “hard” errors ● Parsing input? Going to reject some malformed records ● flatMap or filter + map can make this simpler ● Still want to track number of rejected records (see accumulators) ● Invest in dead letter queues ○ e.g. write malformed records to an Apache Kafka topic Mustafasari
  • 35. So using names & logging & accs could be: data = sc.parallelize(range(10)) rejectedCount = sc.accumulator(0) def loggedDivZero(x): import logging try: return [x / 0] except Exception as e: rejectedCount.add(1) logging.warning("Error found " + repr(e)) return [] transform1 = data.flatMap(loggedDivZero) transform2 = transform1.map(add1) transform2.count() print("Reject " + str(rejectedCount.value))
  • 36. Ok what about if we run out of memory? In the middle of some Java stack traces: File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main process() File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func return f(iterator) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <genexpr> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "high_performance_pyspark/bad_pyspark.py", line 132, in generate_too_much return range(10000000000000) MemoryError
  • 37. Tubbs doesn’t always look the same ● Out of memory can be pure JVM (worker) ○ OOM exception during join ○ GC timelimit exceeded ● OutOfMemory error, Executors being killed by kernel, etc. ● Running in YARN? “Application overhead exceeded” ● JVM out of memory on the driver side from Py4J
  • 38. Reasons for JVM worker OOMs (w/PySpark) ● Unbalanced shuffles ● Buffering of Rows with PySpark + UDFs ○ If you have a down stream select move it up stream ● Individual jumbo records (after pickling) ● Off-heap storage ● Native code memory leak
  • 39. Reasons for Python worker OOMs (w/PySpark) ● Insufficient memory reserved for Python worker ● Jumbo records ● Eager entire partition evaluation (e.g. sort + mapPartitions) ● Too large partitions (unbalanced or not enough partitions)
  • 40. And loading invalid paths: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/doesnotexist at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
  • 41. Connecting Java Debuggers ● Add the JDWP incantation to your JVM launch: -agentlib:jdwp=transport=dt_socket,server=y,address=[ debugport] ○ spark.executor.extraJavaOptions to attach debugger on the executors ○ --driver-java-options to attach on the driver process ○ Add “suspend=y” if only debugging a single worker & exiting too quickly ● JDWP debugger is IDE specific - Eclipse & IntelliJ have docs shadow planet
  • 42. Connecting Python Debuggers ● You’re going to have to change your code a bit :( ● You can use broadcast + singleton “hack” to start pydev or desired remote debugging lib on all of the interpreters ● See https://wiki.python.org/moin/PythonDebuggingTools for your remote debugging options and pick the one that works with your toolchain shadow planet
  • 43. Alternative approaches: ● Move take(1) up the dependency chain ● DAG in the WebUI -- less useful for Python :( ● toDebugString -- also less useful in Python :( ● Sample data and run locally ● Running in cluster mode? Consider debugging in client mode Melissa Wilkins
  • 44. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action Coming soon: High Performance Spark Learning PySpark
  • 45. High Performance Spark! Available today! You can buy it from that scrappy Seattle bookstore, Jeff Bezos needs another newspaper and I want a cup of coffee. http://bit.ly/hkHighPerfSpark
  • 46. And some upcoming talks: ● Jan ○ Data Day Texas - Another talk @ 2PM ● Feb ○ FOSDEM - One on testing one on scaling ○ JFokus in Stockholm - Adding deep learning to Spark ○ I disappear for a week and pretend computers work ● March ○ Strata San Jose - Big Data Beyond the JVM
  • 47. k thnx bye :) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau Any PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?: http://bit.ly/pySparkUDF Pssst: Have feedback on the presentation? Give me a shout (holden@pigscanfly.ca) if you feel comfortable doing so :)