Submit Search
Upload
JVM and OS Tuning for accelerating Spark application
•
Download as PPTX, PDF
•
5 likes
•
6,366 views
Tatsuhiro Chiba
Follow
This presentation is used in my talk at Hadoop Spark Conference Japan 2016.
Read less
Read more
Software
Report
Share
Report
Share
1 of 12
Download now
Recommended
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
Ryo 亮 Kawahara 河原
Exploiting GPUs in Spark
Exploiting GPUs in Spark
Kazuaki Ishizaki
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
Etu Solution
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
J On The Beach
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
Nicolas Poggi
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Databricks
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
DataWorks Summit
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
Nicolas Poggi
Recommended
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
Ryo 亮 Kawahara 河原
Exploiting GPUs in Spark
Exploiting GPUs in Spark
Kazuaki Ishizaki
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
Etu Solution
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
J On The Beach
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
Nicolas Poggi
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Databricks
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
DataWorks Summit
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
Nicolas Poggi
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
Databricks
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
Nicolas Poggi
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
Nicolas Poggi
Getting The Best Performance With PySpark
Getting The Best Performance With PySpark
Spark Summit
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
Jen Aman
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
Exploiting GPUs in Spark
Exploiting GPUs in Spark
Kazuaki Ishizaki
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
DataWorks Summit/Hadoop Summit
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
Spark Summit
Demystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
Kazuaki Ishizaki
Distributed Model Training using MXNet with Horovod
Distributed Model Training using MXNet with Horovod
Lin Yuan
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Databricks
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Databricks
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
Jen Aman
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
Hadoop / Spark Conference Japan
Life of PySpark - A tale of two environments
Life of PySpark - A tale of two environments
Shankar M S
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
DataWorks Summit/Hadoop Summit
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
Spark Summit
Spark on Mesos
Spark on Mesos
Jen Aman
Low Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Jen Aman
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Hadoop / Spark Conference Japan
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan
More Related Content
What's hot
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
Databricks
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
Nicolas Poggi
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
Nicolas Poggi
Getting The Best Performance With PySpark
Getting The Best Performance With PySpark
Spark Summit
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
Jen Aman
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
Exploiting GPUs in Spark
Exploiting GPUs in Spark
Kazuaki Ishizaki
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
DataWorks Summit/Hadoop Summit
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
Spark Summit
Demystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
Kazuaki Ishizaki
Distributed Model Training using MXNet with Horovod
Distributed Model Training using MXNet with Horovod
Lin Yuan
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Databricks
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Databricks
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
Jen Aman
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
Hadoop / Spark Conference Japan
Life of PySpark - A tale of two environments
Life of PySpark - A tale of two environments
Shankar M S
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
DataWorks Summit/Hadoop Summit
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
Spark Summit
Spark on Mesos
Spark on Mesos
Jen Aman
Low Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Jen Aman
What's hot
(20)
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
Getting The Best Performance With PySpark
Getting The Best Performance With PySpark
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Exploiting GPUs in Spark
Exploiting GPUs in Spark
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
Demystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
Distributed Model Training using MXNet with Horovod
Distributed Model Training using MXNet with Horovod
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
Life of PySpark - A tale of two environments
Life of PySpark - A tale of two environments
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
Spark on Mesos
Spark on Mesos
Low Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Viewers also liked
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Hadoop / Spark Conference Japan
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan
Hadoop Conference Japan 2016 LT資料 グラフデータベース事始め
Hadoop Conference Japan 2016 LT資料 グラフデータベース事始め
オラクルエンジニア通信
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Hadoop / Spark Conference Japan
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
Hadoop / Spark Conference Japan
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
Yu Ishikawa
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Nagato Kasaki
sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16
Yifeng Jiang
Viewers also liked
(8)
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop Conference Japan 2016 LT資料 グラフデータベース事始め
Hadoop Conference Japan 2016 LT資料 グラフデータベース事始め
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16
Similar to JVM and OS Tuning for accelerating Spark application
Profiling & Testing with Spark
Profiling & Testing with Spark
Roger Rafanell Mas
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
AdamRobertsIBM
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
Ryousei Takano
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Ryousei Takano
Apache Spark Performance Observations
Apache Spark Performance Observations
Adam Roberts
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
OpenACC
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
inside-BigData.com
DevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on Kubernetes
Dinakar Guniguntala
Java Performance and Profiling
Java Performance and Profiling
WSO2
Ch1
Ch1
Elizabeth de Leon Aler
Ch1
Ch1
Elizabeth de Leon Aler
Toronto meetup 20190917
Toronto meetup 20190917
Bill Liu
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Akihiro Hayashi
Performance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12c
Ajith Narayanan
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
Ahsan Javed Awan
BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
Linaro
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
RCCSRENKEI
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Intel® Software
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
Jeff Larkin
Similar to JVM and OS Tuning for accelerating Spark application
(20)
Profiling & Testing with Spark
Profiling & Testing with Spark
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Apache Spark Performance Observations
Apache Spark Performance Observations
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
DevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on Kubernetes
Java Performance and Profiling
Java Performance and Profiling
Ch1
Ch1
Ch1
Ch1
Toronto meetup 20190917
Toronto meetup 20190917
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Performance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12c
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
Recently uploaded
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
Alexandre Beguel
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
akrivarotava
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
Anthony Dahanne
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
ABSYZ Inc
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
osttopstonverter
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
KrzysztofKkol1
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
Christian Birchler
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
BradBedford3
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
rahul_net
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
preethippts
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
vaideheekore1
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
Jean Silva
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
OnePlan Solutions
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
kalichargn70th171
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
OnePlan Solutions
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
Roberto Pérez Alcolea
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
ssuser9e7c64
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
Shane Coughlan
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
RTS corp
Recently uploaded
(20)
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
JVM and OS Tuning for accelerating Spark application
1.
© 2015 IBM
Corporation JVM, OSレベルのチューニングによる Spark アプリケーションの最適化 Feb. 8, 2016 Tatsuhiro Chiba (chiba@jp.ibm.com) IBM Research - Tokyo
2.
© 2016 IBM
CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo Who am I ? Tatsuhiro Chiba (千葉 立寛) Staff Researcher at IBM Research – Tokyo Research Interests – Parallel Distributed System and Middleware – Parallel Distributed Programming Language – High Performance Computing Twitter: @tatsuhiro Today’s contents appear in, – 付録D in “Sparkによる実践データ解析” - O’reilly Japan – “Workload Characterization and Optimization of TPC-H Queries on Apache Spark”, IBM Research Reports. 2
3.
© 2016 IBM
CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo Summary – after applying JVM and OS tuning 3 Machine Spec : CPU: POWER8 3.3GHz(2Sockets x 12cores), Memory: 1TB, Disk: 1TB OS: Ubuntu 14.10(Kernel: 3.16.0-31-generic) Optimized JVM Option : -Xmx24g –Xms24g –Xmn12g -Xgcthreads12 -Xtrace:none –Xnoloa –XlockReservation –Xgcthreads6 –Xnocompactgc –Xdisableexplicitgc -XX:-RuntimeInstrumentation –Xlp Executor JVMs : 4 OS Settings : NUMA aware affinity=enabled, large page=enabled Spark Version : 1.4.1 JVM Version : java version “1.8.0” (IBM J9 VM, build pxl6480sr2-20151023_01(SR2)) -50.0% -45.0% -40.0% -35.0% -30.0% -25.0% -20.0% -15.0% -10.0% -5.0% 0.0% 0 50 100 150 200 250 300 350 400 450 Q1 Q3 Q5 Q9 kmeans ExecutionTIme(sec.) original optimized speedup (%)
4.
© 2016 IBM
CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo Benchmark 1 – Kmeans // input data is cached val data = sc.textFile(“file:///tmp/kmeans-data”, 2) val parsedData = data.map(s => Vectors.dense( s.split(' ').map(_.toDouble))).persist() // run Kmeans with varying # of clusters val bestK = (100,1) for (k <- 2 to 11) { val clusters = new KMeans() .setK(k).setMaxIterations(5) .setRuns(1).setInitializationMode("random") .setEpsilon(1e-30).run(parsedData) // evaluate val error = clusters.computeCost(parsedData) if (bestK._1 > error) { bestK = (errors,k) } } Kmeans Kmeans application – Varied clustering number ‘K’ for the same dataset – The first Kmeans job takes much time due to data loading into memory Synthetic data generator program – Used BigDataBench published at http://prof.ict.ac.cn/ – Generated 6GB dataset which includes over 65M data points
5.
© 2016 IBM
CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo Benchmark 2 - TPC-H TPC-H Benchmark on Spark SQL – TPC-H is often used for SQL on Hadoop system – Spark SQL can run Hive QL directly through hiveserver2 (thrift server) and beeline (JDBC client) – We modified TPC-H Queries published at https://github.com/rxin/TPC-H-Hive Table data generator – Used DBGEN program and generated 100GB dataset (scale factor = 100) – Loaded data into Hive tables with Parquet format and Snappy compression 5 select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= '1998-09-01' group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus; TPC-H Q1 (Hive)
6.
© 2016 IBM
CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo Machine & Software Spec and Spark Settings 6 Processor # Core SMT Memory OS POWER8 3.30 GHz * 2 24 cores (2 sockets * 12 cores) 8 (total 192 hardware threads) 1TB Ubuntu 14.10 (kernel 3.16.0-31) Xeon E5-2699 v3 2.30 GHz 36 cores (2 sockets x 18 cores) 2 (total 72 hardware threads) 755GB Ubuntu 15.04 (kernel 3.19.0-26) software version Spark 1.4.1, 1.5.2, 1.6.0 Hadoop (HDFS) 2.6.0 Java 1.8.0 (IBM J9 VM SR2) Scala 2.10.4 Default Spark Settings – # of Executor JVMs: 1 – # of worker threads: 48 – Total Heap size: 192GB (nursery = 48g, tenure = 144g)
7.
© 2016 IBM
CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo JVM Tuning – Heap Space Sizing Garbage Collection tuning points – GC algorithms – GC threads – Heap sizing Heap sizing is simplest way to reduce GC overhead – Bigger young space helps to achieve over 30% improvement But, small old space may cause many global GC – Cached RDD stays in Java heap 7 0 50 100 150 200 250 300 350 400 450 Xmn48g Xmn96g Xmn144g Xmn48g Xmn96g Xmn144g Kmeans TPC-H Q9 ExecutionTime(sec.) Young Space (-Xmn) Execution Time (sec) GC ratio (%) Minor GC Avg. pause time Minor GC Major GC 48g (default) 400 s 20 % 2.1 s 39 1 96g 306 s 18 % 3.4 s 22 1 144g 300 s 14 % 3.6 s 14 0
8.
© 2016 IBM
CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo JVM Tuning – Other Options JVM options tuning point – Monitor threads tuning – GC tuning – Java thread tuning – JIT tuning , etc. Result – Proper JVM options helps to improve application performance over 20% 8 -25.0% -20.0% -15.0% -10.0% -5.0% 0.0% 0 20 40 60 80 100 120 option 0 option 1 option 2 option 3 option 4 ExecutionTIme(sec.) Q1 Q5 speedup Q1 (%) speedup Q5 (%) # JVM Options Option 0 (baseline) -Xmn96g –Xdump:heap:none –Xdump:system:none -XX:+RuntimeInstrumentation -agentpath:/path/to/libjvmti_oprofile.so -verbose:gc –Xverbosegclog:/tmp/gc.log -Xjit:verbose={compileStart,compileEnd},vlog=/tmp/jit.log Option 1 (Monitor) Option 0 + “-Xtrace:none” Option 2 (GC) Option 1 + “-Xgcthreads48 –Xnoloa –Xnocompactgc –Xdisableexplicitgc” Option 3 (Thread) Option 2 + “-XlockReservation” Option 4 (JIT) Option 3 + “-XX:-RuntimeInstrumentation”
9.
© 2016 IBM
CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo JVM Tuning – JVM Counts Experiment – Kept # worker threads and total heap – Changed # Executor JVMs – 1JVM : 48 worker threads & 192GB heap – 2JVMs : 24 worker threads & 96GB heap – 4JVMs : 12 worker threads & 48GB heap Result – Using a single big Executor JVM is not always best – By dividing into smaller JVMs, • Helps to reduce GC overhead • Helps to reduce resource contention Kmeans case – Performance gap comes from the first Kmeans job, especially from data loading – After loading RDD in memory, computation performance is similar 9 -16% -14% -12% -10% -8% -6% -4% -2% 0% 2% 4% 6% 0 50 100 150 200 250 300 Q1 Q3 Q5 Q9 Kmeans improvement ExecutionTime(sec.) 1JVM 2JVM 4JVM 2JVM (%) 4JVM (%) 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 ExecutionTime(sec.) Kmeans Clustering Job Iterations (K = 2, 3, .. 11) 1JVM 2JVM 4JVM
10.
© 2016 IBM
CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo OS Tuning – NUMA aware process affinity Setting NUMA aware process affinity to each Executor JVM helps to speed-up – By reducing scheduling overhead – By reducing cache miss and stall cycles Result – Achieved 3 – 14% improvement in all benchmarks without any bad effects 10 NUMA1NUMA0 NUMA2 NUMA3 JVM 0 12threads JVM 1 12threads JVM 2 12threads JVM 3 12threads Socket 0 Socket 1 Processors DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM numactl -c [0-7],[8-15],[16-23],[24-31],[32-39],[40-47] Spark Executor JVMs -16.0% -14.0% -12.0% -10.0% -8.0% -6.0% -4.0% -2.0% 0.0% 0 50 100 150 200 250 Q1 Q5 Q9 Kmeans ExecutionTIme(sec.) NUMA off NUMA on speedup (%)
11.
© 2016 IBM
CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo OS Tuning – Large Page How to use large page – reserve large page on Linux by changing kernel parameter – Append “-Xlp” to Executor JVM option Result – Achieved 3 – 5 % improvement 11 0 20 40 60 80 100 120 140 160 180 200 PageSize=64K PageSize=16M PageSize=64K PageSize=16M NUMA off NUMA on ExecutionTime(sec.) Kmeans
12.
© 2016 IBM
CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo Comparison of Default and Optimized w/ 1.4.1, 1.5.2, and 1.6.0 Newer version basically achieved good performance JVM & OS tuning are still helpful to improve Spark performance Tungsten & other new features (e.g. Unified Memory Management) can reduce GC overhead drastically 12 0 20 40 60 80 100 120 140 160 1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0 Q1 Q3 Q5 ExecutionTime(sec.) default optimized 0 50 100 150 200 250 300 350 1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0 Q9 Q19 Q21 ExecutionTime(sec.) default optimized 711 632
Download now