JVM and OS Tuning for accelerating Spark application

© 2015 IBM Corporation
JVM, OSレベルのチューニングによる
Spark アプリケーションの最適化
Feb. 8, 2016
Tatsuhiro Chiba (chiba@jp.ibm.com)
IBM Research - Tokyo

© 2016 IBM CorporationHadoop / Spark Conference Japan 2016
Performance Innovation Laboratory, IBM Research - Tokyo
Who am I ?
 Tatsuhiro Chiba (千葉立寛)
 Staff Researcher at IBM Research – Tokyo
 Research Interests
– Parallel Distributed System and Middleware
– Parallel Distributed Programming Language
– High Performance Computing
 Twitter: @tatsuhiro
 Today’s contents appear in,
– 付録D in “Sparkによる実践データ解析” - O’reilly Japan
– “Workload Characterization and Optimization of TPC-H Queries on Apache Spark”, IBM
Research Reports.
2

Summary – after applying JVM and OS tuning
3
Machine Spec : CPU: POWER8 3.3GHz(2Sockets x 12cores), Memory: 1TB, Disk: 1TB
OS: Ubuntu 14.10(Kernel: 3.16.0-31-generic)
Optimized JVM Option : -Xmx24g –Xms24g –Xmn12g -Xgcthreads12 -Xtrace:none –Xnoloa
–XlockReservation –Xgcthreads6 –Xnocompactgc –Xdisableexplicitgc
-XX:-RuntimeInstrumentation –Xlp
Executor JVMs : 4
OS Settings : NUMA aware affinity=enabled, large page=enabled
Spark Version : 1.4.1
JVM Version : java version “1.8.0” (IBM J9 VM, build pxl6480sr2-20151023_01(SR2))
-50.0%
-45.0%
-40.0%
-35.0%
-30.0%
-25.0%
-20.0%
-15.0%
-10.0%
-5.0%
0.0%
0
50
100
150
200
250
300
350
400
450
Q1 Q3 Q5 Q9 kmeans
ExecutionTIme(sec.)
original optimized speedup (%)

Benchmark 1 – Kmeans
// input data is cached
val data = sc.textFile(“file:///tmp/kmeans-data”, 2)
val parsedData = data.map(s => Vectors.dense(
s.split(' ').map(_.toDouble))).persist()
// run Kmeans with varying # of clusters
val bestK = (100,1)
for (k <- 2 to 11) {
val clusters = new KMeans()
.setK(k).setMaxIterations(5)
.setRuns(1).setInitializationMode("random")
.setEpsilon(1e-30).run(parsedData)
// evaluate
val error = clusters.computeCost(parsedData)
if (bestK._1 > error) {
bestK = (errors,k)
}
}
Kmeans
 Kmeans application
– Varied clustering number ‘K’ for the same dataset
– The first Kmeans job takes much time due to data loading into memory
 Synthetic data generator program
– Used BigDataBench published at http://prof.ict.ac.cn/
– Generated 6GB dataset which includes over 65M data points

Benchmark 2 - TPC-H
 TPC-H Benchmark on Spark SQL
– TPC-H is often used for SQL on Hadoop system
– Spark SQL can run Hive QL directly through hiveserver2 (thrift server) and beeline
(JDBC client)
– We modified TPC-H Queries published at https://github.com/rxin/TPC-H-Hive
 Table data generator
– Used DBGEN program and generated 100GB dataset (scale factor = 100)
– Loaded data into Hive tables with Parquet format and Snappy compression
5
select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc, count(*) as count_order
from lineitem
where l_shipdate <= '1998-09-01'
group by l_returnflag, l_linestatus
order by l_returnflag, l_linestatus;
TPC-H Q1 (Hive)

Machine & Software Spec and Spark Settings
6
Processor # Core SMT Memory OS
POWER8
3.30 GHz * 2
24 cores
(2 sockets * 12 cores)
8
(total 192 hardware threads)
1TB Ubuntu
14.10 (kernel 3.16.0-31)
Xeon E5-2699 v3
2.30 GHz
36 cores
(2 sockets x 18 cores)
2
(total 72 hardware threads)
755GB Ubuntu
15.04 (kernel 3.19.0-26)
software version
Spark 1.4.1, 1.5.2, 1.6.0
Hadoop (HDFS) 2.6.0
Java 1.8.0 (IBM J9 VM SR2)
Scala 2.10.4
 Default Spark Settings
– # of Executor JVMs: 1
– # of worker threads: 48
– Total Heap size: 192GB (nursery = 48g, tenure = 144g)

JVM Tuning – Heap Space Sizing
 Garbage Collection tuning points
– GC algorithms
– GC threads
– Heap sizing
 Heap sizing is simplest way to
reduce GC overhead
– Bigger young space helps to achieve
over 30% improvement
 But, small old space may cause
many global GC
– Cached RDD stays in Java heap
7
0
50
100
150
200
250
300
350
400
450
Xmn48g Xmn96g Xmn144g Xmn48g Xmn96g Xmn144g
Kmeans TPC-H Q9
ExecutionTime(sec.)
Young Space
(-Xmn)
Execution Time
(sec)
GC ratio (%) Minor GC Avg.
pause time
Minor GC Major GC
48g (default) 400 s 20 % 2.1 s 39 1
96g 306 s 18 % 3.4 s 22 1
144g 300 s 14 % 3.6 s 14 0

JVM Tuning – Other Options
 JVM options tuning point
– Monitor threads tuning
– GC tuning
– Java thread tuning
– JIT tuning , etc.
 Result
– Proper JVM options helps to improve
application performance over 20%
8
-25.0%
-20.0%
-15.0%
-10.0%
-5.0%
0.0%
0
20
40
60
80
100
120
option 0 option 1 option 2 option 3 option 4
ExecutionTIme(sec.)
Q1 Q5 speedup Q1 (%) speedup Q5 (%)
# JVM Options
Option 0
(baseline)
-Xmn96g –Xdump:heap:none –Xdump:system:none -XX:+RuntimeInstrumentation
-agentpath:/path/to/libjvmti_oprofile.so -verbose:gc –Xverbosegclog:/tmp/gc.log
-Xjit:verbose={compileStart,compileEnd},vlog=/tmp/jit.log
Option 1
(Monitor)
Option 0 + “-Xtrace:none”
Option 2
(GC)
Option 1 + “-Xgcthreads48 –Xnoloa –Xnocompactgc –Xdisableexplicitgc”
Option 3
(Thread)
Option 2 + “-XlockReservation”
Option 4
(JIT)
Option 3 + “-XX:-RuntimeInstrumentation”

JVM Tuning – JVM Counts
 Experiment
– Kept # worker threads and total heap
– Changed # Executor JVMs
– 1JVM : 48 worker threads & 192GB heap
– 2JVMs : 24 worker threads & 96GB heap
– 4JVMs : 12 worker threads & 48GB heap
 Result
– Using a single big Executor JVM is not
always best
– By dividing into smaller JVMs,
• Helps to reduce GC overhead
• Helps to reduce resource contention
 Kmeans case
– Performance gap comes from the
first Kmeans job, especially from data
loading
– After loading RDD in memory,
computation performance is similar
9
-16%
-14%
-12%
-10%
-8%
-6%
-4%
-2%
0%
2%
4%
6%
0
50
100
150
200
250
300
Q1 Q3 Q5 Q9 Kmeans
improvement
ExecutionTime(sec.)
1JVM 2JVM 4JVM
2JVM (%) 4JVM (%)
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10
ExecutionTime(sec.)
Kmeans Clustering Job Iterations (K = 2, 3, .. 11)
1JVM
2JVM
4JVM

OS Tuning – NUMA aware process affinity
 Setting NUMA aware process affinity to each Executor JVM helps to
speed-up
– By reducing scheduling overhead
– By reducing cache miss and stall cycles
 Result
– Achieved 3 – 14% improvement in all benchmarks without any bad effects
10
NUMA1NUMA0 NUMA2 NUMA3
JVM 0
12threads
JVM 1
12threads
JVM 2
12threads
JVM 3
12threads
Socket 0 Socket 1
Processors
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
numactl -c [0-7],[8-15],[16-23],[24-31],[32-39],[40-47]
Spark Executor JVMs
-16.0%
-14.0%
-12.0%
-10.0%
-8.0%
-6.0%
-4.0%
-2.0%
0.0%
0
50
100
150
200
250
Q1 Q5 Q9 Kmeans
ExecutionTIme(sec.)
NUMA off NUMA on speedup (%)

OS Tuning – Large Page
 How to use large page
– reserve large page on Linux by
changing kernel parameter
– Append “-Xlp” to Executor JVM
option
 Result
– Achieved 3 – 5 % improvement
11
0
20
40
60
80
100
120
140
160
180
200
PageSize=64K PageSize=16M PageSize=64K PageSize=16M
NUMA off NUMA on
ExecutionTime(sec.)
Kmeans

Comparison of Default and Optimized w/ 1.4.1, 1.5.2, and 1.6.0
 Newer version basically
achieved good performance
 JVM & OS tuning are still
helpful to improve Spark
performance
 Tungsten & other new
features (e.g. Unified Memory
Management) can reduce GC
overhead drastically
12
0
20
40
60
80
100
120
140
160
1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0
Q1 Q3 Q5
ExecutionTime(sec.)
default optimized
0
50
100
150
200
250
300
350
1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0
Q9 Q19 Q21
ExecutionTime(sec.)
default optimized
711
632

JVM and OS Tuning for accelerating Spark application

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to JVM and OS Tuning for accelerating Spark application

Similar to JVM and OS Tuning for accelerating Spark application (20)

Recently uploaded

Recently uploaded (20)

JVM and OS Tuning for accelerating Spark application