SlideShare a Scribd company logo
1 of 21
Download to read offline
© 2016 IBM Corporation
Spark 2.0 Tuning Guide
Big Data Performance Team
08/2016
© 2016 IBM Corporation2
Troubleshooting and Tuning Spark 2.0
§  Building Spark
§  Running Spark
-  ‘--verbose’
-  Missing external JARs
-  OOM on Spark driver
-  OOM on executors
-  GC policies
-  Spark Thrift Server for JDBC apps
§  Tuning Spark
-  Ways to give more cores to your Spark workload
-  Spark Run Time Errors and Resolutions
-  Spark History Server tuning
-  Spread out Spark “scratch” space
-  SQL: Catalyst errors
-  SQL: Timeouts
-  Ways to monitor and tune streaming
§  Profiling Spark
-  Thread & heap dumps
© 2016 IBM Corporation3
Tips on building a new Spark release
§  Run ‘make-distribution.sh’ (generates ‘bin/spark-shell’, ‘bin/spark-submit’, etc.)
§  Official documentation at http://spark.apache.org/docs/latest/building-spark.html
§  Does not always work
-  Wrong JRE version or no JRE found
-  No Maven installed
-  Support for certain components not default, e.g., ‘hive’ support
§  TIP #1: Always explicitly set the following in ‘.bashrc’ for ‘root’
# for Spark distribution compiling
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.75.x86_64
export JRE_HOME=$JAVA_HOME/jre
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
#set maven environment
M2_HOME=/TestAutomation/spark-1.5-preview/build/apache-maven-3.3.3
export MAVEN_OPTS="-Xms512m –Xmx1024m -XX:MaxPermSize=256m"
export PATH=$M2_HOME/bin:$PATH
§  TIP #2: Specify support you want explicitly
-  To build Spark with YARN and Hive support, do:
./make-distribution.sh -Pyarn -Phive -Phive-thriftserver
For Spark 2.0, ./dev/make-distribution.sh -Pyarn -Phive -Phive-thriftserver
© 2016 IBM Corporation4
Running Spark: spark-submit ‘--verbose’
§  Always use ‘–-verbose’ option on ‘spark-submit’ command to run your workload
§  Prints
-  All default properties
-  Command line options
-  Settings from spark ‘conf’ file
-  Settings from CLI
§  Very low overhead since it prints only additional configuration info
§  Example output
spark.yarn.queue -> default
spark.history.kerberos.principal -> none
spark.driver.memory -> 12g
spark.yarn.jar -> hdfs://bigaperf132.svl.ibm.com:8020/iop/apps/4.1.0.0/spark/jars/spark-assembly.jar
spark.yarn.max.executor.failures -> 3
spark.yarn.historyServer.address -> bigaperf132.svl.ibm.com:18080
spark.eventLog.enabled -> true
spark.history.ui.port -> 18080
spark.yarn.applicationMaster.waitTries -> 10
…
§  Example command:
spark-submit --driver-memory 12g --verbose --master yarn-client 
--executor-memory 4096m --num-executors 20 …
© 2016 IBM Corporation5
Missing external jars
§  Compiled OK, but run-time NoClassDefFoundError:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/kafka/clients/producer/KafkaProducer
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
§  Use ‘--packages’ to include comma-separated list of Maven coordinates of JARs
§  Example
spark-submit --driver-memory 12g --verbose --master yarn-client --executor-memory 4096m --num-executors 20
--class com.ibm.biginsights.pqa.spark.SparkStreamingTest --packages org.apache.spark:spark-streaming-
kafka_2.10:1.5.1 …
§  This includes JARs on both driver and executor classpaths
§  Order of look-up
-  The local Maven repo – local machine
-  Maven central - Web
-  Additional remote repositories specified in –repositories
© 2016 IBM Corporation6
OOM related to Spark driver
§  Types of OOM related to Spark driver heap size
15/10/06 17:10:00 ERROR akka.ErrorMonitor: Uncaught fatal error from thread [sparkDriver-
akka.actor.default-dispatcher-29] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
Exception in thread "task-result-getter-0" java.lang.OutOfMemoryError: Java heap space
Subsequent error: Exception in thread "ResponseProcessor for block
BP-1697216913-9.30.104.154-1438974319723:blk_1073847224_106652" java.lang.OutOfMemoryError: Java heap
space
WARN nio.AbstractNioSelector: Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space at
org.jboss.netty.buffer.HeapChannelBuffer.<init>(HeapChannelBuffer.java:42)
§  Increase ‘--driver-memory’ usually resolves these
§  Default 512M is usually too small for serious workloads
§  Example: 8GB minimum needed for Spark SQL running TPCDS @ 1TB
§  Typical workloads that need large driver heap size
-  Spark SQL
-  Spark Streaming
§  Also, 1.6.1 requires more driver heap size than 1.5 or earlier
© 2016 IBM Corporation7
OOM – GC overhead limit exceeded
15/12/09 19:57:02 WARN scheduler.TaskSetManager: Lost task 175.0 in stage 68.0 (TID 7588,
rhel8.cisco.com): java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:478)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:55)
§  Too much time is being spent in garbage collection (98% of the total time)
§  Less than 2% of the heap is recovered
§  From ‘top’, often see “1 CPU core fully used at 100%” but no work is done
§  Tuning #1: Increase executor heapsize
spark-submit … --executor-memory 4096m --num-executors 20 …
§  OR Tuning #2: Increase spark.storage.memoryFraction (fraction of 1)
-  Fraction of Java heap to use for Spark's memory cache, default 0.6
-  Note: this is deprecated in Spark 1.6.0
-  Example: Running TPCDS query 67 in Spark SQL
property: spark.storage.memoryFraction=0.6 – causes OOM
property: spark.storage.memoryFraction=0.8 – completes OK
§  OR Tuning #3: Change GC policy (next slide)
© 2016 IBM Corporation8
GC policies
§  Choose between -XX:UseG1GC & -XX:UseParallelGC
§  Show current GC settings
% /usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/bin/java -XX:+PrintFlagsFinal
uintx GCHeapFreeLimit = 2 {product}
uintx GCLockerEdenExpansionPercent = 5 {product}
uintx GCLogFileSize = 8192 {product}
uintx GCTimeLimit = 98 {product}
uintx GCTimeRatio = 99 {product}
bool UseG1GC = false {product}
bool UseParallelGC := true {product}
§  Tuning options
-  Spark default is -XX:UseParallelGC
-  Try overwrite with –XX:G1GC
§  Performance Impact: “Mythical at best”, “It depends”
§  Default is GOOD ENOUGH in most cases!
§  Databricks blog on Tuning GC for Spark
-  https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-
applications.html
© 2016 IBM Corporation9
Support JDBC Apps via Spark Thrift Server
§  Spark SQL can act as a distributed query engine using its JDBC/ODBC interface
§  Supported by running the Thrift JDBC/ODBC server
§  Has a single SparkContext with multiple sessions supporting
-  Concurrency
-  re-usable connections (pool)
-  Shared cache (e.g., catalog, tables, etc.)
§  Can specify any amount of memory, CPUs through standard Spark-submit parameters:
-  Driver-memory
-  Executor-memory
-  Num-executors, etc.
§  Example, to start Thrift Server with 2.3TB of memory, 800 cores and YARN mode:
% $SPARK_HOME/sbin/start-thriftserver.sh --driver-memory 12g --verbose --master yarn --executor-memory 16g
--num-executors 100 --executor-cores 8 --conf spark.hadoop.yarn.timeline-service.enabled=false --conf
spark.yarn.executor.memoryOverhead=8192 --conf spark.driver.maxResultSize=5g
§  Default number of workers (sessions) = 500
§  Client tool bundled with Spark 2.0: Beeline
% $SPARK_HOME/bin/beeline -u "jdbc:hive2://bigaperf460.svl.ibm.com:10013/my1tbdb" -n spark --force=true -f /
test/query_00_01_96.sql
© 2016 IBM Corporation10
Ways to give more cores to your Spark job
§  Designed for big data
§  More cores and more memory always better (well, until it breaks!)
§  Ways to max out your cluster, for example:
-  40 vCores per node
-  128GB memory per node
-  5-node cluster = 200 vCores, ~500GB RAM
§  Method #1 – Start with evenly divided memory and cores
--executor-memory 2500m --num-executors 200
Total # of executors = 200 (default: 1-core each)
# of executors/node = 40 (fully using all cores)
Total memory used = 500 GB
§  Method #2 – Heap size non-negotiable
--executor-memory 6g --num-executors 80
Total # of executors = 80 (1-core each)
# of executors/node = 16 (40% CPU utilization)
Total memory used ~= 500 GB
Can increase cores per executor as:
--executor-memory 6g --num-executors 80 –executor-cores 2
Forcing 80% utilization, boosting 33% performance!
© 2016 IBM Corporation11
History Server Tuning
§  Spark History Server – viewing job stats after the fact
§  Uses Spark event logs (by default in JSON)
§  Can have OOM when dealing with long-running jobs
§  Allocated 1GB by default with SPARK_DAEMON_MEMORY
§  Recommend SPARK_DAEMON_MEMORY=4g
§  Note: not “-Xmx4g”!
-  cd /usr/iop/current/spark-client/conf
-  vi spark-env.sh
-  Add ‘export SPARK_DAEMON_MEMORY=4g’
-  Restart Spark History Server from Ambari console
© 2016 IBM Corporation12
Spread out Spark “scratch” space
§  Typical error
stage 89.3 failed 4 times, most recent failure:
Lost task 38.4 in stage 89.3 (TID 30100, rhel4.cisco.com): java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
§  Complains about ‘/tmp’ is full
§  Controlled by ‘spark.local.dir’ parameter
-  Default is ‘/tmp’
-  Stores map output files and RDDs
§  Two reasons ‘/tmp’ is not an ideal place for Spark “scratch” space
-  ‘/tmp’ usually is small and for OS
-  ‘/tmp’ usually is a single disk, a potential IO bottleneck
§  To fix, add the following line to ‘spark-defaults.conf’ file:
spark.local.dir /data/disk1/tmp,/data/disk2/tmp,/data/disk3/tmp,/data/disk4/tmp,…
© 2016 IBM Corporation13
Max result size exceeded
§  Typical error
stream5/query_05_22_77.sql.out:Error: org.apache.spark.SparkException: Job aborted due to stage failure:
Total size of serialized results of 381610 tasks (5.0 GB) is bigger than spark.driver.maxResultSize (5.0
GB) (state=,code=0))
§  Likely to occur with complex SQL on large data volumes
§  Limit of total size of serialized results of all partitions for each Spark action (e.g., collect)
§  Controlled by ‘spark.driver.maxResultSize’ parameter
-  Default is 1G
-  Can be ‘0’ or ‘unlimited’
-  ‘unlimited’ will throw OOM on driver
§  To fix, add the following line to ‘spark-defaults.conf’ file:
spark.driver.maxResultSize 5g
** 5G is a learned value for Spark SQL running TPCDS queries at 1TB scale factors
© 2016 IBM Corporation14
Catalyst errors
§  Typical error
stream7/query_07_24_48.sql.out:Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
execute, tree: at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute
$1.apply(ShuffleExchange.scala:122)
at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute
$1.apply(ShuffleExchange.scala:113)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
... 96 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [800 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190)
... 208 more
§  On surface appears to be Catalyst error (optimizer)
§  Actually an internal Spark timeout error most likely to occur under concurrency
java.util.concurrent.TimeoutException: Futures timed out after [800 seconds]
§  Controlled by an unpublished Spark setting ‘spark.sql.broadcastTimeout’ parameter
-  Default in source code shows 300 seconds
§  To fix, add the following line to ‘spark-defaults.conf’ file or as CLI --conf
spark.sql.broadcastTimeout 1200
**1200 is the longest running query in a SQL workload in our case.
© 2016 IBM Corporation15
Other timeouts
§  Typical errors
16/07/09 01:14:18 ERROR spark.ContextCleaner: Error cleaning broadcast 28267
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [800 seconds]. This timeout is
controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$
$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:143)
And timeout exceptions related to the following:
spark.core.connection.ack.wait.timeout
spark.akka.timeout
spark.storage.blockManagerSlaveTimeoutMs
spark.shuffle.io.connectionTimeout
spark.rpc.askTimeout
spark.rpc.lookupTimeout
§  Depending on system resource usage, any of the above can occur (e.g., no heartbeats)
§  You can tune each individual setting OR use an “umbrella” timeout setting
§  Controlled by ‘spark.network.timeout’ parameter
-  Default is 120 seconds
-  Overrides all above timeout values
§  To fix, add the following line to ‘spark-defaults.conf’ file:
spark.network.timeout 700
© 2016 IBM Corporation16
SQL Query – SortMergeJoin vs ShuffleHashJoin
§  Certain multi-way join can force SortMergeJoin on fact tables
§  Query will run for a long time
§  See right for such a plan from Spark 2.0
§  By default
-  spark.sql.join.preferSortMergeJoin is true
§  Can set it to false and use ShuffleHashJoin instead
§  To tune, set it to false in ‘spark-defaults.conf’ file:
spark.sql.join.preferSortMergeJoin false
© 2016 IBM Corporation17
Spark streaming monitoring and tuning
§  Existing Spark UI pretty good
-  Timelines
-  Histograms
-  Average response times
-  Queue
§  Ways to tune
-  If response time > batch window
Increase batch window size
-  If more executors don’t generate
more throughput
Increase # of receivers
val numStreams=5
val multiTweetStreams=(1 to numStreams)
.map {i => KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,
kafkaParams, topicsSet) }
// unified stream
val tweetStream= ssc.union(multiTweetStreams)
§  Very useful blog
https://databricks.com/blog/2015/
07/08/new-visualizations-for-
understanding-spark-streaming-
applications.html
1
3
2
3
2
1
© 2016 IBM Corporation18
Capture thread & heap dumps for Spark executors
§  Typically run as YARN containers across multiple nodes, e.g.,
yarn 355583 355580 91 09:15 ? 00:05:35 /usr/jdk64/java-1.8.0-
openjdk-1.8.0.45-28.b13.el6_6.x86_64/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m -Xmx6144m -
Djava.io.tmpdir=/data6/hadoop/yarn/local/usercache/biadmin/appcache/application_1452558922304_0075/
container_1452558922304_0075_01_000020/tmp -Dspark.driver.port=3110 -Dspark.history.ui.port=18080 -
Dspark.yarn.app.container.log.dir=/data1/hadoop/yarn/log/application_1452558922304_0075/
container_1452558922304_0075_01_000020 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
akka.tcp://sparkDriver@9.30.104.154:3110/user/CoarseGrainedScheduler --executor-id 19 --hostname
bigaperf133.svl.ibm.com --cores 1 --app-id application_1452558922304_0075 --user-class-path file:/data6/
hadoop/yarn/local/usercache/biadmin/appcache/application_1452558922304_0075/
container_1452558922304_0075_01_000020/__app__.jar
§  OpenJDK has a set of tools for Java thread and heap dumps
jmap, jstack, jstat, jhat, etc.
§  Typical location of OpenJDK tools for IBM Hadoop platform
/usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/bin/
§  To get a full thread dump
% jstack –l 355583 > /TestAutomation/results/twitter/javacore.355583.1
% jstack –l –F 355583 > /TestAutomation/results/twitter/javacore-hung.355583.1
Use –F to attach to a non-responsive JVM
§  To get a full heap dump
% jmap -dump:live,format=b,file=/TestAutomation/results/dump.355583.2 355583
Dumping heap to /TestAutomation/results/sparkstreamtests/dump.355583.2 ...
Heap dump file created
© 2016 IBM Corporation19
Contact
Jesse Chen, Performance Engineer, jfchen@us.ibm.com
Roy Cecil, Performance Engineer, roy.cecil@ie.ibm.com
© 2016 IBM Corporation21
Legal Disclaimer
•  © IBM Corporation 2014. All Rights Reserved.
•  The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained
in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are
subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing
contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and
conditions of the applicable license agreement governing the use of IBM software.
•  References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or
capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to
future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you
will result in any specific sales, revenue growth or other results.
•  If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete:
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will
experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage
configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
•  If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete:
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs
and performance characteristics may vary by customer.
•  Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM Lotus®
Sametime® Unyte™). Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server). Please refer
to http://www.ibm.com/legal/copytrade.shtml for guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your presentation. All
product names must be used as adjectives rather than nouns. Please list all of the trademarks that you use in your presentation as follows; delete any not included in your presentation.
IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of International Business Machines
Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both.
•  If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete:
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries.
•  If you reference Java™ in the text, please mark the first use and include the following; otherwise delete:
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
•  If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete:
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.
•  If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete:
Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and
other countries.
•  If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete:
UNIX is a registered trademark of The Open Group in the United States and other countries.
•  If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete:
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of
others.
•  If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, Zeta
Bank, Acme) please update and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration
purposes only.

More Related Content

Recently uploaded

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 

Recently uploaded (20)

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 

Featured

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 

Featured (20)

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 

Apache Spark 2.0 Tuning Guide

  • 1. © 2016 IBM Corporation Spark 2.0 Tuning Guide Big Data Performance Team 08/2016
  • 2. © 2016 IBM Corporation2 Troubleshooting and Tuning Spark 2.0 §  Building Spark §  Running Spark -  ‘--verbose’ -  Missing external JARs -  OOM on Spark driver -  OOM on executors -  GC policies -  Spark Thrift Server for JDBC apps §  Tuning Spark -  Ways to give more cores to your Spark workload -  Spark Run Time Errors and Resolutions -  Spark History Server tuning -  Spread out Spark “scratch” space -  SQL: Catalyst errors -  SQL: Timeouts -  Ways to monitor and tune streaming §  Profiling Spark -  Thread & heap dumps
  • 3. © 2016 IBM Corporation3 Tips on building a new Spark release §  Run ‘make-distribution.sh’ (generates ‘bin/spark-shell’, ‘bin/spark-submit’, etc.) §  Official documentation at http://spark.apache.org/docs/latest/building-spark.html §  Does not always work -  Wrong JRE version or no JRE found -  No Maven installed -  Support for certain components not default, e.g., ‘hive’ support §  TIP #1: Always explicitly set the following in ‘.bashrc’ for ‘root’ # for Spark distribution compiling export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.75.x86_64 export JRE_HOME=$JAVA_HOME/jre export PATH=$JAVA_HOME/bin:$PATH export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH #set maven environment M2_HOME=/TestAutomation/spark-1.5-preview/build/apache-maven-3.3.3 export MAVEN_OPTS="-Xms512m –Xmx1024m -XX:MaxPermSize=256m" export PATH=$M2_HOME/bin:$PATH §  TIP #2: Specify support you want explicitly -  To build Spark with YARN and Hive support, do: ./make-distribution.sh -Pyarn -Phive -Phive-thriftserver For Spark 2.0, ./dev/make-distribution.sh -Pyarn -Phive -Phive-thriftserver
  • 4. © 2016 IBM Corporation4 Running Spark: spark-submit ‘--verbose’ §  Always use ‘–-verbose’ option on ‘spark-submit’ command to run your workload §  Prints -  All default properties -  Command line options -  Settings from spark ‘conf’ file -  Settings from CLI §  Very low overhead since it prints only additional configuration info §  Example output spark.yarn.queue -> default spark.history.kerberos.principal -> none spark.driver.memory -> 12g spark.yarn.jar -> hdfs://bigaperf132.svl.ibm.com:8020/iop/apps/4.1.0.0/spark/jars/spark-assembly.jar spark.yarn.max.executor.failures -> 3 spark.yarn.historyServer.address -> bigaperf132.svl.ibm.com:18080 spark.eventLog.enabled -> true spark.history.ui.port -> 18080 spark.yarn.applicationMaster.waitTries -> 10 … §  Example command: spark-submit --driver-memory 12g --verbose --master yarn-client --executor-memory 4096m --num-executors 20 …
  • 5. © 2016 IBM Corporation5 Missing external jars §  Compiled OK, but run-time NoClassDefFoundError: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/kafka/clients/producer/KafkaProducer at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) at java.lang.Class.privateGetMethodRecursive(Class.java:3048) at java.lang.Class.getMethod0(Class.java:3018) §  Use ‘--packages’ to include comma-separated list of Maven coordinates of JARs §  Example spark-submit --driver-memory 12g --verbose --master yarn-client --executor-memory 4096m --num-executors 20 --class com.ibm.biginsights.pqa.spark.SparkStreamingTest --packages org.apache.spark:spark-streaming- kafka_2.10:1.5.1 … §  This includes JARs on both driver and executor classpaths §  Order of look-up -  The local Maven repo – local machine -  Maven central - Web -  Additional remote repositories specified in –repositories
  • 6. © 2016 IBM Corporation6 OOM related to Spark driver §  Types of OOM related to Spark driver heap size 15/10/06 17:10:00 ERROR akka.ErrorMonitor: Uncaught fatal error from thread [sparkDriver- akka.actor.default-dispatcher-29] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space Exception in thread "task-result-getter-0" java.lang.OutOfMemoryError: Java heap space Subsequent error: Exception in thread "ResponseProcessor for block BP-1697216913-9.30.104.154-1438974319723:blk_1073847224_106652" java.lang.OutOfMemoryError: Java heap space WARN nio.AbstractNioSelector: Unexpected exception in the selector loop. java.lang.OutOfMemoryError: Java heap space at org.jboss.netty.buffer.HeapChannelBuffer.<init>(HeapChannelBuffer.java:42) §  Increase ‘--driver-memory’ usually resolves these §  Default 512M is usually too small for serious workloads §  Example: 8GB minimum needed for Spark SQL running TPCDS @ 1TB §  Typical workloads that need large driver heap size -  Spark SQL -  Spark Streaming §  Also, 1.6.1 requires more driver heap size than 1.5 or earlier
  • 7. © 2016 IBM Corporation7 OOM – GC overhead limit exceeded 15/12/09 19:57:02 WARN scheduler.TaskSetManager: Lost task 175.0 in stage 68.0 (TID 7588, rhel8.cisco.com): java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:478) at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:55) §  Too much time is being spent in garbage collection (98% of the total time) §  Less than 2% of the heap is recovered §  From ‘top’, often see “1 CPU core fully used at 100%” but no work is done §  Tuning #1: Increase executor heapsize spark-submit … --executor-memory 4096m --num-executors 20 … §  OR Tuning #2: Increase spark.storage.memoryFraction (fraction of 1) -  Fraction of Java heap to use for Spark's memory cache, default 0.6 -  Note: this is deprecated in Spark 1.6.0 -  Example: Running TPCDS query 67 in Spark SQL property: spark.storage.memoryFraction=0.6 – causes OOM property: spark.storage.memoryFraction=0.8 – completes OK §  OR Tuning #3: Change GC policy (next slide)
  • 8. © 2016 IBM Corporation8 GC policies §  Choose between -XX:UseG1GC & -XX:UseParallelGC §  Show current GC settings % /usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/bin/java -XX:+PrintFlagsFinal uintx GCHeapFreeLimit = 2 {product} uintx GCLockerEdenExpansionPercent = 5 {product} uintx GCLogFileSize = 8192 {product} uintx GCTimeLimit = 98 {product} uintx GCTimeRatio = 99 {product} bool UseG1GC = false {product} bool UseParallelGC := true {product} §  Tuning options -  Spark default is -XX:UseParallelGC -  Try overwrite with –XX:G1GC §  Performance Impact: “Mythical at best”, “It depends” §  Default is GOOD ENOUGH in most cases! §  Databricks blog on Tuning GC for Spark -  https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark- applications.html
  • 9. © 2016 IBM Corporation9 Support JDBC Apps via Spark Thrift Server §  Spark SQL can act as a distributed query engine using its JDBC/ODBC interface §  Supported by running the Thrift JDBC/ODBC server §  Has a single SparkContext with multiple sessions supporting -  Concurrency -  re-usable connections (pool) -  Shared cache (e.g., catalog, tables, etc.) §  Can specify any amount of memory, CPUs through standard Spark-submit parameters: -  Driver-memory -  Executor-memory -  Num-executors, etc. §  Example, to start Thrift Server with 2.3TB of memory, 800 cores and YARN mode: % $SPARK_HOME/sbin/start-thriftserver.sh --driver-memory 12g --verbose --master yarn --executor-memory 16g --num-executors 100 --executor-cores 8 --conf spark.hadoop.yarn.timeline-service.enabled=false --conf spark.yarn.executor.memoryOverhead=8192 --conf spark.driver.maxResultSize=5g §  Default number of workers (sessions) = 500 §  Client tool bundled with Spark 2.0: Beeline % $SPARK_HOME/bin/beeline -u "jdbc:hive2://bigaperf460.svl.ibm.com:10013/my1tbdb" -n spark --force=true -f / test/query_00_01_96.sql
  • 10. © 2016 IBM Corporation10 Ways to give more cores to your Spark job §  Designed for big data §  More cores and more memory always better (well, until it breaks!) §  Ways to max out your cluster, for example: -  40 vCores per node -  128GB memory per node -  5-node cluster = 200 vCores, ~500GB RAM §  Method #1 – Start with evenly divided memory and cores --executor-memory 2500m --num-executors 200 Total # of executors = 200 (default: 1-core each) # of executors/node = 40 (fully using all cores) Total memory used = 500 GB §  Method #2 – Heap size non-negotiable --executor-memory 6g --num-executors 80 Total # of executors = 80 (1-core each) # of executors/node = 16 (40% CPU utilization) Total memory used ~= 500 GB Can increase cores per executor as: --executor-memory 6g --num-executors 80 –executor-cores 2 Forcing 80% utilization, boosting 33% performance!
  • 11. © 2016 IBM Corporation11 History Server Tuning §  Spark History Server – viewing job stats after the fact §  Uses Spark event logs (by default in JSON) §  Can have OOM when dealing with long-running jobs §  Allocated 1GB by default with SPARK_DAEMON_MEMORY §  Recommend SPARK_DAEMON_MEMORY=4g §  Note: not “-Xmx4g”! -  cd /usr/iop/current/spark-client/conf -  vi spark-env.sh -  Add ‘export SPARK_DAEMON_MEMORY=4g’ -  Restart Spark History Server from Ambari console
  • 12. © 2016 IBM Corporation12 Spread out Spark “scratch” space §  Typical error stage 89.3 failed 4 times, most recent failure: Lost task 38.4 in stage 89.3 (TID 30100, rhel4.cisco.com): java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:326) at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) §  Complains about ‘/tmp’ is full §  Controlled by ‘spark.local.dir’ parameter -  Default is ‘/tmp’ -  Stores map output files and RDDs §  Two reasons ‘/tmp’ is not an ideal place for Spark “scratch” space -  ‘/tmp’ usually is small and for OS -  ‘/tmp’ usually is a single disk, a potential IO bottleneck §  To fix, add the following line to ‘spark-defaults.conf’ file: spark.local.dir /data/disk1/tmp,/data/disk2/tmp,/data/disk3/tmp,/data/disk4/tmp,…
  • 13. © 2016 IBM Corporation13 Max result size exceeded §  Typical error stream5/query_05_22_77.sql.out:Error: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 381610 tasks (5.0 GB) is bigger than spark.driver.maxResultSize (5.0 GB) (state=,code=0)) §  Likely to occur with complex SQL on large data volumes §  Limit of total size of serialized results of all partitions for each Spark action (e.g., collect) §  Controlled by ‘spark.driver.maxResultSize’ parameter -  Default is 1G -  Can be ‘0’ or ‘unlimited’ -  ‘unlimited’ will throw OOM on driver §  To fix, add the following line to ‘spark-defaults.conf’ file: spark.driver.maxResultSize 5g ** 5G is a learned value for Spark SQL running TPCDS queries at 1TB scale factors
  • 14. © 2016 IBM Corporation14 Catalyst errors §  Typical error stream7/query_07_24_48.sql.out:Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute $1.apply(ShuffleExchange.scala:122) at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute $1.apply(ShuffleExchange.scala:113) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) ... 96 more Caused by: java.util.concurrent.TimeoutException: Futures timed out after [800 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190) ... 208 more §  On surface appears to be Catalyst error (optimizer) §  Actually an internal Spark timeout error most likely to occur under concurrency java.util.concurrent.TimeoutException: Futures timed out after [800 seconds] §  Controlled by an unpublished Spark setting ‘spark.sql.broadcastTimeout’ parameter -  Default in source code shows 300 seconds §  To fix, add the following line to ‘spark-defaults.conf’ file or as CLI --conf spark.sql.broadcastTimeout 1200 **1200 is the longest running query in a SQL workload in our case.
  • 15. © 2016 IBM Corporation15 Other timeouts §  Typical errors 16/07/09 01:14:18 ERROR spark.ContextCleaner: Error cleaning broadcast 28267 org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [800 seconds]. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$ $createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:143) And timeout exceptions related to the following: spark.core.connection.ack.wait.timeout spark.akka.timeout spark.storage.blockManagerSlaveTimeoutMs spark.shuffle.io.connectionTimeout spark.rpc.askTimeout spark.rpc.lookupTimeout §  Depending on system resource usage, any of the above can occur (e.g., no heartbeats) §  You can tune each individual setting OR use an “umbrella” timeout setting §  Controlled by ‘spark.network.timeout’ parameter -  Default is 120 seconds -  Overrides all above timeout values §  To fix, add the following line to ‘spark-defaults.conf’ file: spark.network.timeout 700
  • 16. © 2016 IBM Corporation16 SQL Query – SortMergeJoin vs ShuffleHashJoin §  Certain multi-way join can force SortMergeJoin on fact tables §  Query will run for a long time §  See right for such a plan from Spark 2.0 §  By default -  spark.sql.join.preferSortMergeJoin is true §  Can set it to false and use ShuffleHashJoin instead §  To tune, set it to false in ‘spark-defaults.conf’ file: spark.sql.join.preferSortMergeJoin false
  • 17. © 2016 IBM Corporation17 Spark streaming monitoring and tuning §  Existing Spark UI pretty good -  Timelines -  Histograms -  Average response times -  Queue §  Ways to tune -  If response time > batch window Increase batch window size -  If more executors don’t generate more throughput Increase # of receivers val numStreams=5 val multiTweetStreams=(1 to numStreams) .map {i => KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet) } // unified stream val tweetStream= ssc.union(multiTweetStreams) §  Very useful blog https://databricks.com/blog/2015/ 07/08/new-visualizations-for- understanding-spark-streaming- applications.html 1 3 2 3 2 1
  • 18. © 2016 IBM Corporation18 Capture thread & heap dumps for Spark executors §  Typically run as YARN containers across multiple nodes, e.g., yarn 355583 355580 91 09:15 ? 00:05:35 /usr/jdk64/java-1.8.0- openjdk-1.8.0.45-28.b13.el6_6.x86_64/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m -Xmx6144m - Djava.io.tmpdir=/data6/hadoop/yarn/local/usercache/biadmin/appcache/application_1452558922304_0075/ container_1452558922304_0075_01_000020/tmp -Dspark.driver.port=3110 -Dspark.history.ui.port=18080 - Dspark.yarn.app.container.log.dir=/data1/hadoop/yarn/log/application_1452558922304_0075/ container_1452558922304_0075_01_000020 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver@9.30.104.154:3110/user/CoarseGrainedScheduler --executor-id 19 --hostname bigaperf133.svl.ibm.com --cores 1 --app-id application_1452558922304_0075 --user-class-path file:/data6/ hadoop/yarn/local/usercache/biadmin/appcache/application_1452558922304_0075/ container_1452558922304_0075_01_000020/__app__.jar §  OpenJDK has a set of tools for Java thread and heap dumps jmap, jstack, jstat, jhat, etc. §  Typical location of OpenJDK tools for IBM Hadoop platform /usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/bin/ §  To get a full thread dump % jstack –l 355583 > /TestAutomation/results/twitter/javacore.355583.1 % jstack –l –F 355583 > /TestAutomation/results/twitter/javacore-hung.355583.1 Use –F to attach to a non-responsive JVM §  To get a full heap dump % jmap -dump:live,format=b,file=/TestAutomation/results/dump.355583.2 355583 Dumping heap to /TestAutomation/results/sparkstreamtests/dump.355583.2 ... Heap dump file created
  • 19. © 2016 IBM Corporation19 Contact Jesse Chen, Performance Engineer, jfchen@us.ibm.com Roy Cecil, Performance Engineer, roy.cecil@ie.ibm.com
  • 20.
  • 21. © 2016 IBM Corporation21 Legal Disclaimer •  © IBM Corporation 2014. All Rights Reserved. •  The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. •  References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. •  If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete: Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. •  If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete: All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. •  Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM Lotus® Sametime® Unyte™). Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server). Please refer to http://www.ibm.com/legal/copytrade.shtml for guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your presentation. All product names must be used as adjectives rather than nouns. Please list all of the trademarks that you use in your presentation as follows; delete any not included in your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of International Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both. •  If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete: Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. •  If you reference Java™ in the text, please mark the first use and include the following; otherwise delete: Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. •  If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete: Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both. •  If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete: Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. •  If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete: UNIX is a registered trademark of The Open Group in the United States and other countries. •  If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete: Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. •  If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, Zeta Bank, Acme) please update and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration purposes only.