SlideShare a Scribd company logo
1 of 19
Download to read offline
© 2017 IBM Corporation
Spark 2.x Troubleshooting Guide
IBM Big Data Performance
Jesse Chen, jesse.f.chen@gmail.com
3/2017
© 2017 IBM Corporation2
Troubleshooting Spark 2.x
§  Building Spark
§  Running Spark
-  ‘--verbose’
-  Missing external JARs
-  OOM on Spark driver
-  OOM on executors
-  GC policies
-  Spark Thrift Server for JDBC apps
-  HDFS block distribution
-  HDFS blocksize vs Parquet blocksize
§  Profiling Spark
-  Collecting thread & heap dumps in-flight
-  Collecting core dumps after jobs fail
© 2017 IBM Corporation3
Lots of errors when building a new Spark release on my own…
§  Run ‘make-distribution.sh’ (generates ‘bin/spark-shell’, ‘bin/spark-submit’, etc.)
§  Does not always work
-  Wrong JRE version or no JRE found
-  No Maven installed
-  Support for certain components not default, e.g., ‘hive’ support
§  TIP #1: Always explicitly set the following in ‘.bashrc’ for ‘root’
# for Spark distribution compiling
export JAVA_HOME=/usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64
export JRE_HOME=$JAVA_HOME/jre
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
#set maven environment
M2_HOME=/TestAutomation/downloads/tmp/spark-master/build/apache-maven-3.3.9
export MAVEN_OPTS="-Xms256m -Xmx2048m -XX:MaxPermSize=512m"
export PATH=$M2_HOME/bin:$PATH
§  TIP #2: Specify support you want explicitly
-  To build Spark with YARN and Hive support, do:
./dev/make-distribution.sh --name spark-master-2.1 --tgz -Pyarn -Phadoop-2.7 -
Dhadoop.version=2.7.2 -Phive -Phive-thriftserver
© 2017 IBM Corporation4
Building a Spark release is extremely slow …
§  Use more cores to speed up the build process (default uses only 1 core)
§  Rebuild only modified source code (default is “clean”)
Edit the file ‘./dev/make-distribution.sh’, change line
BUILD_COMMAND=("$MVN" –T 1C clean package -DskipTests $@)
To:
BUILD_COMMAND=("$MVN" -T 48C package -DskipTests $@)
** Assuming your have 48 cores on your build machine
** Assuming you don’t need to always build clean, for iterative changes
§  Can cut build time from 45 min to 15 min on a typical 128GB-RAM 48-core node
© 2017 IBM Corporation5
Don’t know what settings used when running Spark …
§  Always use ‘–-verbose’ option on ‘spark-submit’ command to run your workload
§  Prints
-  All default properties
-  Command line options
-  Settings from spark ‘conf’ file
-  Settings from CLI
§  Example output
Spark properties used, including those specified through
--conf and those from the properties file /TestAutomation/spark-2.0/conf/spark-defaults.conf:
spark.yarn.queue -> default
spark.local.dir -> /data1/tmp,/data2/tmp,/data3/tmp,/data4/tmp
spark.history.kerberos.principal -> none
spark.sql.broadcastTimeout -> 800
spark.hadoop.yarn.timeline-service.enabled -> false
spark.yarn.max.executor.failures -> 3
spark.driver.memory -> 10g
spark.network.timeout -> 800
spark.yarn.historyServer.address -> node458.xyz.com:18080
spark.eventLog.enabled -> true
spark.history.ui.port -> 18080
spark.rpc.askTimeout -> 800
…
§  Example command:
spark-submit --driver-memory 10g --verbose --master yarn --executor-memory ….
© 2017 IBM Corporation6
Missing external jars
§  Compiled OK, but run-time NoClassDefFoundError:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/kafka/clients/producer/KafkaProducer
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
§  Use ‘--packages’ to include comma-separated list of Maven coordinates of JARs
§  Example
spark-submit --driver-memory 12g --verbose --master yarn-client --executor-memory 4096m --num-executors 20
--class com.ibm.biginsights.pqa.spark.SparkStreamingTest --packages org.apache.spark:spark-streaming-
kafka_2.10:1.5.1 …
§  This includes JARs on both driver and executor classpaths
§  Order of look-up
-  The local Maven repo – local machine
-  Maven central - Web
-  Additional remote repositories specified in –repositories
© 2017 IBM Corporation7
OutOfMemory related to Spark driver
§  Types of OOM related to Spark driver heap size
15/10/06 17:10:00 ERROR akka.ErrorMonitor: Uncaught fatal error from thread [sparkDriver-
akka.actor.default-dispatcher-29] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
Exception in thread "task-result-getter-0" java.lang.OutOfMemoryError: Java heap space
Subsequent error: Exception in thread "ResponseProcessor for block
BP-1697216913-9.30.104.154-1438974319723:blk_1073847224_106652" java.lang.OutOfMemoryError: Java heap
space
WARN nio.AbstractNioSelector: Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space at
org.jboss.netty.buffer.HeapChannelBuffer.<init>(HeapChannelBuffer.java:42)
§  Increase ‘--driver-memory’ usually resolves these
§  Default 512M is usually too small for serious workloads
§  Example: 8GB minimum needed for Spark SQL running TPCDS @ 1TB
§  Typical workloads that need large driver heap size
-  Spark SQL
-  Spark Streaming
© 2017 IBM Corporation8
OOM – GC overhead limit exceeded
15/12/09 19:57:02 WARN scheduler.TaskSetManager: Lost task 175.0 in stage 68.0 (TID 7588,
rhel8.cisco.com): java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:478)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:55)
§  Too much time is being spent in garbage collection (98% of the total time)
§  Less than 2% of the heap is recovered
§  From ‘top’, often see “1 CPU core fully used at 100%” but no work is done
§  Tuning #1: Increase executor heapsize
spark-submit … --executor-memory 4096m --num-executors 20 …
§  OR Tuning #2: Change GC policy (next slide)
© 2017 IBM Corporation9
GC policies
§  Choose between -XX:UseG1GC & -XX:UseParallelGC
§  Show current GC settings
% /usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/bin/java -XX:+PrintFlagsFinal
uintx GCHeapFreeLimit = 2 {product}
uintx GCLockerEdenExpansionPercent = 5 {product}
uintx GCLogFileSize = 8192 {product}
uintx GCTimeLimit = 98 {product}
uintx GCTimeRatio = 99 {product}
bool UseG1GC = false {product}
bool UseParallelGC := true {product}
§  Tuning options
-  Spark default is -XX:UseParallelGC
-  Try overwrite with –XX:G1GC
§  Performance Impact: “Mythical at best”, “It depends”
§  Default is pretty good!
§  Databricks blog on Tuning GC for Spark
-  https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-
applications.html
© 2017 IBM Corporation10
Support JDBC Apps via Spark Thrift Server
§  Spark SQL can act as a distributed query engine using its JDBC/ODBC interface
§  Supported by running the Thrift JDBC/ODBC server
§  Has a single SparkContext with multiple sessions supporting
-  Concurrency
-  re-usable connections (pool)
-  Shared cache (e.g., catalog, tables, etc.)
§  Can specify any amount of memory, CPUs through standard Spark-submit parameters:
-  Driver-memory
-  Executor-memory
-  Num-executors, etc.
§  Example, to start Thrift Server with 2.3TB of memory, 800 cores and YARN mode:
% $SPARK_HOME/sbin/start-thriftserver.sh --driver-memory 12g --verbose --master yarn --executor-memory 16g
--num-executors 100 --executor-cores 8 --conf spark.hadoop.yarn.timeline-service.enabled=false --conf
spark.yarn.executor.memoryOverhead=8192 --conf spark.driver.maxResultSize=5g
§  Default number of workers (sessions) = 500
§  Client tool bundled with Spark 2.0: Beeline
% $SPARK_HOME/bin/beeline -u "jdbc:hive2://node460.xyz.com:10013/my1tbdb" -n spark --force=true -f /test/
query_00_01_96.sql
© 2017 IBM Corporation11
Not all CPUs are busy …
§  Designed for big data
§  More cores and more memory always better (well, until it breaks!)
§  Ways to max out your cluster, for example:
-  40 vCores per node
-  128GB memory per node
-  5-node cluster = 200 vCores, ~500GB RAM
§  Method #1 – Start with evenly divided memory and cores
--executor-memory 2500m --num-executors 200
Total # of executors = 200 (default: 1-core each)
# of executors/node = 40 (fully using all cores)
Total memory used = 500 GB
§  Method #2 – When heap size non-negotiable
--executor-memory 6g --num-executors 80
Total # of executors = 80 (1-core each)
# of executors/node = 16 (40% CPU utilization)
Total memory used ~= 500 GB
Can increase cores per executor as:
--executor-memory 6g --num-executors 80 –executor-cores 2
Forcing 80% utilization, boosting 33% performance!
© 2017 IBM Corporation12
Spread out Spark “scratch” space
§  Typical error
stage 89.3 failed 4 times, most recent failure:
Lost task 38.4 in stage 89.3 (TID 30100, rhel4.cisco.com): java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
§ 
Complains about ‘/tmp’ is full
§  Controlled by ‘spark.local.dir’ parameter
-  Default is ‘/tmp’
-  Stores map output files and RDDs
§  Two reasons ‘/tmp’ is not an ideal place for Spark “scratch” space
-  ‘/tmp’ usually is small and for OS
-  ‘/tmp’ usually is a single disk, a potential IO bottleneck
§  To fix, add the following line to ‘spark-defaults.conf’ file:
spark.local.dir /data/disk1/tmp,/data/disk2/tmp,/data/disk3/tmp,/data/disk4/tmp,…
© 2017 IBM Corporation13
Max result size exceeded
§  Typical error
stream5/query_05_22_77.sql.out:Error: org.apache.spark.SparkException: Job aborted due to stage failure:
Total size of serialized results of 381610 tasks (5.0 GB) is bigger than spark.driver.maxResultSize (5.0
GB) (state=,code=0))
§  Likely to occur with complex SQL on large data volumes
§  Limit of total size of serialized results of all partitions for each Spark action (e.g., collect)
§  Controlled by ‘spark.driver.maxResultSize’ parameter
-  Default is 1G
-  Can be ‘0’ or ‘unlimited’
-  ‘unlimited’ will throw OOM on driver
§  To fix, add the following line to ‘spark-defaults.conf’ file:
spark.driver.maxResultSize 5g
** 5G is a learned value for Spark SQL running TPCDS queries at 1TB scale factors
© 2017 IBM Corporation14
Catalyst errors
§  Typical error
stream7/query_07_24_48.sql.out:Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
execute, tree: at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute
$1.apply(ShuffleExchange.scala:122)
at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute
$1.apply(ShuffleExchange.scala:113)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
... 96 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [800 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190)
... 208 more
§  On surface appears to be Catalyst error (optimizer)
§  Actually an internal Spark timeout error most likely to occur under concurrency
java.util.concurrent.TimeoutException: Futures timed out after [800 seconds]
§  Controlled by an unpublished Spark setting ‘spark.sql.broadcastTimeout’ parameter
-  Default in source code shows 300 seconds
§  To fix, add the following line to ‘spark-defaults.conf’ file or as CLI --conf
spark.sql.broadcastTimeout 1200
**1200 is the longest running query in a SQL workload in our case.
© 2017 IBM Corporation15
Other timeouts
§  Typical errors
16/07/09 01:14:18 ERROR spark.ContextCleaner: Error cleaning broadcast 28267
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [800 seconds]. This timeout is
controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$
$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:143)
And timeout exceptions related to the following:
spark.core.connection.ack.wait.timeout
spark.akka.timeout
spark.storage.blockManagerSlaveTimeoutMs
spark.shuffle.io.connectionTimeout
spark.rpc.askTimeout
spark.rpc.lookupTimeout
§  Depending on system resource usage, any of the above can occur (e.g., no heartbeats)
§  You can tune each individual setting OR use an “umbrella” timeout setting
§  Controlled by ‘spark.network.timeout’ parameter
-  Default is 120 seconds
-  Overrides all above timeout values
§  To fix, add the following line to ‘spark-defaults.conf’ file:
spark.network.timeout 700
© 2017 IBM Corporation16
Out of space on a few data nodes …
§  Unbalanced HDFS forces more IO over network
§  Run command ‘hdfs balancer’ to start rebalancing
§  dfs.datanode.balance.bandwidthPerSec
-  Default 6250000 or 6.25 MB/s network bandwidth
-  Increased to 6 GB/s on F1 to take advantage of fat pipe
§  dfs.datanode.balance.max.concurrent.moves
-  Default is undefined
-  Add this setting in hdfs-site
-  Set to 500 concurrent threads
-  Example shows 5.4 TB/hour balancing rate
16/10/05 10:17:24 INFO balancer.Balancer: 0 over-utilized: []
16/10/05 10:17:24 INFO balancer.Balancer: 0 underutilized: []
The cluster is balanced. Exiting...
Oct 5, 2016 10:17:24 AM         337   19.71 TB  0 B -1 B
Oct 5, 2016 10:17:24 AM  Balancing took 3.6939516666666665 hours
© 2017 IBM Corporation17
What block size to use in HDFS and in Parquet?
Take-away:
Keep block size for both at default (128MB)
Parquet Block
HDFS Block HDFS Block HDFS Block HDFS Block
Parquet Block Parquet Block
HDFS Block HDFS Block HDFS Block HDFS Block
Parquet Block Parquet Block Parquet Block Parquet Block
Remote reads occur when block boundaries cross
Slows down scan time
Prefer row group boundaries be at block boundaries
© 2017 IBM Corporation18
In-flight capturing of executor thread & heap dumps
§  Typically run as YARN containers across multiple nodes, e.g.,
yarn 355583 355580 91 09:15 ? 00:05:35 /usr/jdk64/java-1.8.0-
openjdk-1.8.0.45-28.b13.el6_6.x86_64/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m -Xmx6144m -
Djava.io.tmpdir=/data6/hadoop/yarn/local/usercache/biadmin/appcache/application_1452558922304_0075/
container_1452558922304_0075_01_000020/tmp -Dspark.driver.port=3110 -Dspark.history.ui.port=18080 -
Dspark.yarn.app.container.log.dir=/data1/hadoop/yarn/log/application_1452558922304_0075/
container_1452558922304_0075_01_000020 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
akka.tcp://sparkDriver@9.30.104.154:3110/user/CoarseGrainedScheduler --executor-id 19 –hostname
node133.yxz.com --cores 1 --app-id application_1452558922304_0075 --user-class-path file:/data6/hadoop/
yarn/local/usercache/biadmin/appcache/application_1452558922304_0075/
container_1452558922304_0075_01_000020/__app__.jar
§  OpenJDK has a set of tools for Java thread and heap dumps
jmap, jstack, jstat, jhat, etc.
§  Typical location of OpenJDK tools for IBM Hadoop platform
/usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/bin/
§  To get a full thread dump
% jstack –l 355583 > /TestAutomation/results/twitter/javacore.355583.1
% jstack –l –F 355583 > /TestAutomation/results/twitter/javacore-hung.355583.1
Use –F to attach to a non-responsive JVM
§  To get a full heap dump
% jmap -dump:live,format=b,file=/TestAutomation/results/dump.355583.2 355583
Dumping heap to /TestAutomation/results/sparkstreamtests/dump.355583.2 ...
Heap dump file created
© 2017 IBM Corporation19
Can’t find core dumps even when Spark says there are ….
§  Core dumps created by Spark jobs
16/11/14 16:45:05 WARN scheduler.TaskSetManager: Lost task 692.0 in stage 4.0 (TID 129021, node12.xyz.com,
executor 824): ExecutorLostFailure (executor 824 exited caused by one of the running tasks) Reason:
Container marked as failed: container_e69_1479156026828_0006_01_000825 on host: node12.xyz.com. Exit status:
134. Diagnostics: Exception from container-launch.
Exit code: 134
Container id: container_e69_1479156026828_0006_01_000825
Exception message: /bin/bash: line 1: 3694385 Aborted (core dumped) /usr/jdk64/java-1.8.0-
openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data2/hadoop/yarn/local/
….ontainer.log.dir=/data5/hadoop/…container_e69_1479156026828_0006_01_000825/com.univocity_univocity-
parsers-1.5.1.jar > /data5/hadoop/yarn/log/application_1479156026828_0006/
container_e69_1479156026828_0006_01_000825/stdout 2> /data5/hadoop/yarn/log/application_1479156026828_0006/
container_e69_1479156026828_0006_01_000825/stderr
Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 3694385 Aborted (core dumped) /usr/jdk64/
java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data2/hadoop/-…
container_e69_1479156026828_0006_01_000825/com.univocity_univocity-parsers-1.5.1.jar > /data5/hadoop/yarn/
log/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/stdout 2> /data5/hadoop/yarn/
log/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/stderr
§  YARN settings for core dump file retention
yarn.nodemanager.delete.debug-delay-sec default is 0, files deleted right after application finishes
Set it to enough time to get to files and copy them for debugging
§  Steps: 1. Find the hostname in the error log; 2. Find the local directory where ‘stderr’
resides; 3. Open the ‘stderr’, you will find lines similar to:
/data2/hadoop/yarn/local/usercache/spark/appcache/application_1479156026828_0006/
container_e69_1479156026828_0006_01_000825/hs_err_pid3694385.log
§  and core dump files too!
§  More on this setting https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-common/yarn-
default.xml
1
2

More Related Content

What's hot

Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015PostgreSQL-Consulting
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 

What's hot (20)

Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 

Viewers also liked

Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDataWorks Summit
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanDatabricks
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARNDataWorks Summit
 
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageHadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageSandeep Patil
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...gethue
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 

Viewers also liked (20)

Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark Application
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu Kasinathan
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageHadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better Storage
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
 
SocSciBot(01 Mar2010) - Korean Manual
SocSciBot(01 Mar2010) - Korean ManualSocSciBot(01 Mar2010) - Korean Manual
SocSciBot(01 Mar2010) - Korean Manual
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Proxy Servers
Proxy ServersProxy Servers
Proxy Servers
 
Proxy Server
Proxy ServerProxy Server
Proxy Server
 

Similar to Spark 2.x Troubleshooting Guide

Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsSamir Bessalah
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogVadim Semenov
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in productionParis Data Engineers !
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Yoshiyasu SAEKI
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
ContainerWorkloadwithSemeru.pdf
ContainerWorkloadwithSemeru.pdfContainerWorkloadwithSemeru.pdf
ContainerWorkloadwithSemeru.pdfSumanMitra22
 
Salesforce at Stacki Atlanta Meetup February 2016
Salesforce at Stacki Atlanta Meetup February 2016Salesforce at Stacki Atlanta Meetup February 2016
Salesforce at Stacki Atlanta Meetup February 2016StackIQ
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
 
Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Santosh Kangane
 
Running your Java EE 6 Applications in the Cloud
Running your Java EE 6 Applications in the CloudRunning your Java EE 6 Applications in the Cloud
Running your Java EE 6 Applications in the CloudArun Gupta
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Postgres the hardway
Postgres the hardwayPostgres the hardway
Postgres the hardwayDave Pitts
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDatabricks
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkAnu Shetty
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit
 
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011Running your Java EE 6 Apps in the Cloud - JavaOne India 2011
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011Arun Gupta
 
JavaOne India 2011 - Running your Java EE 6 Apps in the Cloud
JavaOne India 2011 - Running your Java EE 6 Apps in the CloudJavaOne India 2011 - Running your Java EE 6 Apps in the Cloud
JavaOne India 2011 - Running your Java EE 6 Apps in the CloudArun Gupta
 

Similar to Spark 2.x Troubleshooting Guide (20)

Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
ContainerWorkloadwithSemeru.pdf
ContainerWorkloadwithSemeru.pdfContainerWorkloadwithSemeru.pdf
ContainerWorkloadwithSemeru.pdf
 
Salesforce at Stacki Atlanta Meetup February 2016
Salesforce at Stacki Atlanta Meetup February 2016Salesforce at Stacki Atlanta Meetup February 2016
Salesforce at Stacki Atlanta Meetup February 2016
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
infra-as-code
infra-as-codeinfra-as-code
infra-as-code
 
Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0
 
Running your Java EE 6 Applications in the Cloud
Running your Java EE 6 Applications in the CloudRunning your Java EE 6 Applications in the Cloud
Running your Java EE 6 Applications in the Cloud
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Postgres the hardway
Postgres the hardwayPostgres the hardway
Postgres the hardway
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing spark
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011Running your Java EE 6 Apps in the Cloud - JavaOne India 2011
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011
 
JavaOne India 2011 - Running your Java EE 6 Apps in the Cloud
JavaOne India 2011 - Running your Java EE 6 Apps in the CloudJavaOne India 2011 - Running your Java EE 6 Apps in the Cloud
JavaOne India 2011 - Running your Java EE 6 Apps in the Cloud
 

Recently uploaded

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 

Recently uploaded (20)

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 

Spark 2.x Troubleshooting Guide

  • 1. © 2017 IBM Corporation Spark 2.x Troubleshooting Guide IBM Big Data Performance Jesse Chen, jesse.f.chen@gmail.com 3/2017
  • 2. © 2017 IBM Corporation2 Troubleshooting Spark 2.x §  Building Spark §  Running Spark -  ‘--verbose’ -  Missing external JARs -  OOM on Spark driver -  OOM on executors -  GC policies -  Spark Thrift Server for JDBC apps -  HDFS block distribution -  HDFS blocksize vs Parquet blocksize §  Profiling Spark -  Collecting thread & heap dumps in-flight -  Collecting core dumps after jobs fail
  • 3. © 2017 IBM Corporation3 Lots of errors when building a new Spark release on my own… §  Run ‘make-distribution.sh’ (generates ‘bin/spark-shell’, ‘bin/spark-submit’, etc.) §  Does not always work -  Wrong JRE version or no JRE found -  No Maven installed -  Support for certain components not default, e.g., ‘hive’ support §  TIP #1: Always explicitly set the following in ‘.bashrc’ for ‘root’ # for Spark distribution compiling export JAVA_HOME=/usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64 export JRE_HOME=$JAVA_HOME/jre export PATH=$JAVA_HOME/bin:$PATH export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH #set maven environment M2_HOME=/TestAutomation/downloads/tmp/spark-master/build/apache-maven-3.3.9 export MAVEN_OPTS="-Xms256m -Xmx2048m -XX:MaxPermSize=512m" export PATH=$M2_HOME/bin:$PATH §  TIP #2: Specify support you want explicitly -  To build Spark with YARN and Hive support, do: ./dev/make-distribution.sh --name spark-master-2.1 --tgz -Pyarn -Phadoop-2.7 - Dhadoop.version=2.7.2 -Phive -Phive-thriftserver
  • 4. © 2017 IBM Corporation4 Building a Spark release is extremely slow … §  Use more cores to speed up the build process (default uses only 1 core) §  Rebuild only modified source code (default is “clean”) Edit the file ‘./dev/make-distribution.sh’, change line BUILD_COMMAND=("$MVN" –T 1C clean package -DskipTests $@) To: BUILD_COMMAND=("$MVN" -T 48C package -DskipTests $@) ** Assuming your have 48 cores on your build machine ** Assuming you don’t need to always build clean, for iterative changes §  Can cut build time from 45 min to 15 min on a typical 128GB-RAM 48-core node
  • 5. © 2017 IBM Corporation5 Don’t know what settings used when running Spark … §  Always use ‘–-verbose’ option on ‘spark-submit’ command to run your workload §  Prints -  All default properties -  Command line options -  Settings from spark ‘conf’ file -  Settings from CLI §  Example output Spark properties used, including those specified through --conf and those from the properties file /TestAutomation/spark-2.0/conf/spark-defaults.conf: spark.yarn.queue -> default spark.local.dir -> /data1/tmp,/data2/tmp,/data3/tmp,/data4/tmp spark.history.kerberos.principal -> none spark.sql.broadcastTimeout -> 800 spark.hadoop.yarn.timeline-service.enabled -> false spark.yarn.max.executor.failures -> 3 spark.driver.memory -> 10g spark.network.timeout -> 800 spark.yarn.historyServer.address -> node458.xyz.com:18080 spark.eventLog.enabled -> true spark.history.ui.port -> 18080 spark.rpc.askTimeout -> 800 … §  Example command: spark-submit --driver-memory 10g --verbose --master yarn --executor-memory ….
  • 6. © 2017 IBM Corporation6 Missing external jars §  Compiled OK, but run-time NoClassDefFoundError: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/kafka/clients/producer/KafkaProducer at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) at java.lang.Class.privateGetMethodRecursive(Class.java:3048) at java.lang.Class.getMethod0(Class.java:3018) §  Use ‘--packages’ to include comma-separated list of Maven coordinates of JARs §  Example spark-submit --driver-memory 12g --verbose --master yarn-client --executor-memory 4096m --num-executors 20 --class com.ibm.biginsights.pqa.spark.SparkStreamingTest --packages org.apache.spark:spark-streaming- kafka_2.10:1.5.1 … §  This includes JARs on both driver and executor classpaths §  Order of look-up -  The local Maven repo – local machine -  Maven central - Web -  Additional remote repositories specified in –repositories
  • 7. © 2017 IBM Corporation7 OutOfMemory related to Spark driver §  Types of OOM related to Spark driver heap size 15/10/06 17:10:00 ERROR akka.ErrorMonitor: Uncaught fatal error from thread [sparkDriver- akka.actor.default-dispatcher-29] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space Exception in thread "task-result-getter-0" java.lang.OutOfMemoryError: Java heap space Subsequent error: Exception in thread "ResponseProcessor for block BP-1697216913-9.30.104.154-1438974319723:blk_1073847224_106652" java.lang.OutOfMemoryError: Java heap space WARN nio.AbstractNioSelector: Unexpected exception in the selector loop. java.lang.OutOfMemoryError: Java heap space at org.jboss.netty.buffer.HeapChannelBuffer.<init>(HeapChannelBuffer.java:42) §  Increase ‘--driver-memory’ usually resolves these §  Default 512M is usually too small for serious workloads §  Example: 8GB minimum needed for Spark SQL running TPCDS @ 1TB §  Typical workloads that need large driver heap size -  Spark SQL -  Spark Streaming
  • 8. © 2017 IBM Corporation8 OOM – GC overhead limit exceeded 15/12/09 19:57:02 WARN scheduler.TaskSetManager: Lost task 175.0 in stage 68.0 (TID 7588, rhel8.cisco.com): java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:478) at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:55) §  Too much time is being spent in garbage collection (98% of the total time) §  Less than 2% of the heap is recovered §  From ‘top’, often see “1 CPU core fully used at 100%” but no work is done §  Tuning #1: Increase executor heapsize spark-submit … --executor-memory 4096m --num-executors 20 … §  OR Tuning #2: Change GC policy (next slide)
  • 9. © 2017 IBM Corporation9 GC policies §  Choose between -XX:UseG1GC & -XX:UseParallelGC §  Show current GC settings % /usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/bin/java -XX:+PrintFlagsFinal uintx GCHeapFreeLimit = 2 {product} uintx GCLockerEdenExpansionPercent = 5 {product} uintx GCLogFileSize = 8192 {product} uintx GCTimeLimit = 98 {product} uintx GCTimeRatio = 99 {product} bool UseG1GC = false {product} bool UseParallelGC := true {product} §  Tuning options -  Spark default is -XX:UseParallelGC -  Try overwrite with –XX:G1GC §  Performance Impact: “Mythical at best”, “It depends” §  Default is pretty good! §  Databricks blog on Tuning GC for Spark -  https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark- applications.html
  • 10. © 2017 IBM Corporation10 Support JDBC Apps via Spark Thrift Server §  Spark SQL can act as a distributed query engine using its JDBC/ODBC interface §  Supported by running the Thrift JDBC/ODBC server §  Has a single SparkContext with multiple sessions supporting -  Concurrency -  re-usable connections (pool) -  Shared cache (e.g., catalog, tables, etc.) §  Can specify any amount of memory, CPUs through standard Spark-submit parameters: -  Driver-memory -  Executor-memory -  Num-executors, etc. §  Example, to start Thrift Server with 2.3TB of memory, 800 cores and YARN mode: % $SPARK_HOME/sbin/start-thriftserver.sh --driver-memory 12g --verbose --master yarn --executor-memory 16g --num-executors 100 --executor-cores 8 --conf spark.hadoop.yarn.timeline-service.enabled=false --conf spark.yarn.executor.memoryOverhead=8192 --conf spark.driver.maxResultSize=5g §  Default number of workers (sessions) = 500 §  Client tool bundled with Spark 2.0: Beeline % $SPARK_HOME/bin/beeline -u "jdbc:hive2://node460.xyz.com:10013/my1tbdb" -n spark --force=true -f /test/ query_00_01_96.sql
  • 11. © 2017 IBM Corporation11 Not all CPUs are busy … §  Designed for big data §  More cores and more memory always better (well, until it breaks!) §  Ways to max out your cluster, for example: -  40 vCores per node -  128GB memory per node -  5-node cluster = 200 vCores, ~500GB RAM §  Method #1 – Start with evenly divided memory and cores --executor-memory 2500m --num-executors 200 Total # of executors = 200 (default: 1-core each) # of executors/node = 40 (fully using all cores) Total memory used = 500 GB §  Method #2 – When heap size non-negotiable --executor-memory 6g --num-executors 80 Total # of executors = 80 (1-core each) # of executors/node = 16 (40% CPU utilization) Total memory used ~= 500 GB Can increase cores per executor as: --executor-memory 6g --num-executors 80 –executor-cores 2 Forcing 80% utilization, boosting 33% performance!
  • 12. © 2017 IBM Corporation12 Spread out Spark “scratch” space §  Typical error stage 89.3 failed 4 times, most recent failure: Lost task 38.4 in stage 89.3 (TID 30100, rhel4.cisco.com): java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:326) at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) §  Complains about ‘/tmp’ is full §  Controlled by ‘spark.local.dir’ parameter -  Default is ‘/tmp’ -  Stores map output files and RDDs §  Two reasons ‘/tmp’ is not an ideal place for Spark “scratch” space -  ‘/tmp’ usually is small and for OS -  ‘/tmp’ usually is a single disk, a potential IO bottleneck §  To fix, add the following line to ‘spark-defaults.conf’ file: spark.local.dir /data/disk1/tmp,/data/disk2/tmp,/data/disk3/tmp,/data/disk4/tmp,…
  • 13. © 2017 IBM Corporation13 Max result size exceeded §  Typical error stream5/query_05_22_77.sql.out:Error: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 381610 tasks (5.0 GB) is bigger than spark.driver.maxResultSize (5.0 GB) (state=,code=0)) §  Likely to occur with complex SQL on large data volumes §  Limit of total size of serialized results of all partitions for each Spark action (e.g., collect) §  Controlled by ‘spark.driver.maxResultSize’ parameter -  Default is 1G -  Can be ‘0’ or ‘unlimited’ -  ‘unlimited’ will throw OOM on driver §  To fix, add the following line to ‘spark-defaults.conf’ file: spark.driver.maxResultSize 5g ** 5G is a learned value for Spark SQL running TPCDS queries at 1TB scale factors
  • 14. © 2017 IBM Corporation14 Catalyst errors §  Typical error stream7/query_07_24_48.sql.out:Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute $1.apply(ShuffleExchange.scala:122) at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute $1.apply(ShuffleExchange.scala:113) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) ... 96 more Caused by: java.util.concurrent.TimeoutException: Futures timed out after [800 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190) ... 208 more §  On surface appears to be Catalyst error (optimizer) §  Actually an internal Spark timeout error most likely to occur under concurrency java.util.concurrent.TimeoutException: Futures timed out after [800 seconds] §  Controlled by an unpublished Spark setting ‘spark.sql.broadcastTimeout’ parameter -  Default in source code shows 300 seconds §  To fix, add the following line to ‘spark-defaults.conf’ file or as CLI --conf spark.sql.broadcastTimeout 1200 **1200 is the longest running query in a SQL workload in our case.
  • 15. © 2017 IBM Corporation15 Other timeouts §  Typical errors 16/07/09 01:14:18 ERROR spark.ContextCleaner: Error cleaning broadcast 28267 org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [800 seconds]. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$ $createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:143) And timeout exceptions related to the following: spark.core.connection.ack.wait.timeout spark.akka.timeout spark.storage.blockManagerSlaveTimeoutMs spark.shuffle.io.connectionTimeout spark.rpc.askTimeout spark.rpc.lookupTimeout §  Depending on system resource usage, any of the above can occur (e.g., no heartbeats) §  You can tune each individual setting OR use an “umbrella” timeout setting §  Controlled by ‘spark.network.timeout’ parameter -  Default is 120 seconds -  Overrides all above timeout values §  To fix, add the following line to ‘spark-defaults.conf’ file: spark.network.timeout 700
  • 16. © 2017 IBM Corporation16 Out of space on a few data nodes … §  Unbalanced HDFS forces more IO over network §  Run command ‘hdfs balancer’ to start rebalancing §  dfs.datanode.balance.bandwidthPerSec -  Default 6250000 or 6.25 MB/s network bandwidth -  Increased to 6 GB/s on F1 to take advantage of fat pipe §  dfs.datanode.balance.max.concurrent.moves -  Default is undefined -  Add this setting in hdfs-site -  Set to 500 concurrent threads -  Example shows 5.4 TB/hour balancing rate 16/10/05 10:17:24 INFO balancer.Balancer: 0 over-utilized: [] 16/10/05 10:17:24 INFO balancer.Balancer: 0 underutilized: [] The cluster is balanced. Exiting... Oct 5, 2016 10:17:24 AM         337   19.71 TB  0 B -1 B Oct 5, 2016 10:17:24 AM  Balancing took 3.6939516666666665 hours
  • 17. © 2017 IBM Corporation17 What block size to use in HDFS and in Parquet? Take-away: Keep block size for both at default (128MB) Parquet Block HDFS Block HDFS Block HDFS Block HDFS Block Parquet Block Parquet Block HDFS Block HDFS Block HDFS Block HDFS Block Parquet Block Parquet Block Parquet Block Parquet Block Remote reads occur when block boundaries cross Slows down scan time Prefer row group boundaries be at block boundaries
  • 18. © 2017 IBM Corporation18 In-flight capturing of executor thread & heap dumps §  Typically run as YARN containers across multiple nodes, e.g., yarn 355583 355580 91 09:15 ? 00:05:35 /usr/jdk64/java-1.8.0- openjdk-1.8.0.45-28.b13.el6_6.x86_64/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m -Xmx6144m - Djava.io.tmpdir=/data6/hadoop/yarn/local/usercache/biadmin/appcache/application_1452558922304_0075/ container_1452558922304_0075_01_000020/tmp -Dspark.driver.port=3110 -Dspark.history.ui.port=18080 - Dspark.yarn.app.container.log.dir=/data1/hadoop/yarn/log/application_1452558922304_0075/ container_1452558922304_0075_01_000020 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver@9.30.104.154:3110/user/CoarseGrainedScheduler --executor-id 19 –hostname node133.yxz.com --cores 1 --app-id application_1452558922304_0075 --user-class-path file:/data6/hadoop/ yarn/local/usercache/biadmin/appcache/application_1452558922304_0075/ container_1452558922304_0075_01_000020/__app__.jar §  OpenJDK has a set of tools for Java thread and heap dumps jmap, jstack, jstat, jhat, etc. §  Typical location of OpenJDK tools for IBM Hadoop platform /usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/bin/ §  To get a full thread dump % jstack –l 355583 > /TestAutomation/results/twitter/javacore.355583.1 % jstack –l –F 355583 > /TestAutomation/results/twitter/javacore-hung.355583.1 Use –F to attach to a non-responsive JVM §  To get a full heap dump % jmap -dump:live,format=b,file=/TestAutomation/results/dump.355583.2 355583 Dumping heap to /TestAutomation/results/sparkstreamtests/dump.355583.2 ... Heap dump file created
  • 19. © 2017 IBM Corporation19 Can’t find core dumps even when Spark says there are …. §  Core dumps created by Spark jobs 16/11/14 16:45:05 WARN scheduler.TaskSetManager: Lost task 692.0 in stage 4.0 (TID 129021, node12.xyz.com, executor 824): ExecutorLostFailure (executor 824 exited caused by one of the running tasks) Reason: Container marked as failed: container_e69_1479156026828_0006_01_000825 on host: node12.xyz.com. Exit status: 134. Diagnostics: Exception from container-launch. Exit code: 134 Container id: container_e69_1479156026828_0006_01_000825 Exception message: /bin/bash: line 1: 3694385 Aborted (core dumped) /usr/jdk64/java-1.8.0- openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data2/hadoop/yarn/local/ ….ontainer.log.dir=/data5/hadoop/…container_e69_1479156026828_0006_01_000825/com.univocity_univocity- parsers-1.5.1.jar > /data5/hadoop/yarn/log/application_1479156026828_0006/ container_e69_1479156026828_0006_01_000825/stdout 2> /data5/hadoop/yarn/log/application_1479156026828_0006/ container_e69_1479156026828_0006_01_000825/stderr Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 3694385 Aborted (core dumped) /usr/jdk64/ java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data2/hadoop/-… container_e69_1479156026828_0006_01_000825/com.univocity_univocity-parsers-1.5.1.jar > /data5/hadoop/yarn/ log/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/stdout 2> /data5/hadoop/yarn/ log/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/stderr §  YARN settings for core dump file retention yarn.nodemanager.delete.debug-delay-sec default is 0, files deleted right after application finishes Set it to enough time to get to files and copy them for debugging §  Steps: 1. Find the hostname in the error log; 2. Find the local directory where ‘stderr’ resides; 3. Open the ‘stderr’, you will find lines similar to: /data2/hadoop/yarn/local/usercache/spark/appcache/application_1479156026828_0006/ container_e69_1479156026828_0006_01_000825/hs_err_pid3694385.log §  and core dump files too! §  More on this setting https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-common/yarn- default.xml 1 2