SlideShare a Scribd company logo
1 of 19
Hive, Presto, and Spark
on
TPC-DS benchmark
Dongwon Kim, PhD
SK Telecom
Contents
• Experimental setup
• Experimental results
[Experimental setup] TPC-DS dataset and query
• Hive
• Entirely depend on github.com/hortonworks/hive-testbench
• Distributed data generator
• A small dataset (100GB)
• A large dataset (1TB)
• DDLs
• External table declaration
• Partitioned table declaration (ORC)
• 66 queries provided (out of 99 TPC-DS queries)
• Presto
• Use hive-hadoop2 connector to read the same partitioned table
• Use the same query
• Spark
• Connected to Hive MetaStore to read the same partitioned table
• Use the same query
[Experimental setup] Cluster setup
• A single master node with 5 slave nodes
• Two 12-core processes with total 48 hyper threads
• 128GB main memory
• 10 HDDs
• Hadoop 2.7.3
• Hive 2.1.1 + Tez 0.8.4
• A LLAP worker on a node uses 192 cores and 80GB
• Presto 0.162
• A Presto worker on a node uses 192 cores and 80GB
• distributed-joins-enabled = false
• Spark 2.0.2
• 4 Spark executors on a node uses 192 cores and 80GB
[Experimental setup] Performance monitoring tool
• github.com/eastcirclek/swimlane-graphs
• Hive/Presto/Spark task swimlane graph + Ganglia resource utilization graph
• To observe the main cause of performance bottleneck
[Experimental results] Characteristics of each engine
• Hive
• Improve significantly through LLAP
• Good for both small and large workload
• Especially good for IO-bound workloads
• Spark
• Improve CPU performance through Whole Stage Code Generation
• Especially good for CPU-bound workloads
• Does not outperform Hive and Presto for IO-bound workloads
• Presto
• Pipelined execution to reduce unnecessary disk IOs
• Good for simple queries
• Works okay only when data is fit into memory
[Experimental results] Query execution time (100GB)
with query72 without query72
Pairwise comparison
reduction in
sum of running times
Pairwise comparison
reduction in
sum of running times
Spark > Hive 26.3 %
(1668s  1229s)
Hive > Spark 19.8 %
(1143s  916s)
Hive > Presto 55.6 %
(2797s  1241s)
Hive > Presto 50.2 %
(982s  489s)
Spark > Presto 62.0 %
(2932s  1114s)
Spark > Presto 5.2%
(1116s  1057s)
Spark > Hive >>> Presto Hive > Spark >= Presto
Reversed
Gap reduced
significantly
Hive with LLAP is good even for small workload
* When comparing each pair of engines, I count queries that are completed by both of two engines.
[Experimental results] Query execution time (1TB)
with query72 without query72
Pairwise comparison
reduction in
sum of running times
Pairwise comparison
reduction in
sum of running times
Hive > Spark 28.2 %
(6445s  4625s)
Hive > Spark 41.3 %
(6165s  3629s)
Hive > Presto 56.4 %
(5567s  2426s)
Hive > Presto 25.5 %
(1460s  1087s)
Spark > Presto 29.2 %
(5685s  4026s)
Presto > Spark 58.6%
(3812s  1578s)
Hive > Spark >>> Presto Hive > Presto > Spark
Reversed
* When comparing each pair of engines, I count queries that are completed by both of two engines.
Hive with LLAP is good for large, I/O-bound workload
Presto works okay if data fit into memory
[Experimental results] Curse of Query72 on Hive and Presto
0
500
1000
1500
2000
72 89 43 63 19 3 51 52 42 55 82 29 17 25 49 39 91 40 21 13 12 73 96 48 20 85 34 84 79 32 7 26 45 27 46 88 68 15 97 92 93 66 87 24 28 56 71 83 60 76 31 64 74
Presto Spark
100GB dataset
0
500
1000
1500
2000
72 89 43 63 19 3 42 52 55 25 49 29 17 75 82 40 88 31 13 71 91 56 85 60 68 28 46 48 26 7 51 66 21 34 27 20 45 12 87 73 79 15 32 96 84 76 39 93 92 97
Presto Hive
0
500
1000
1500
2000
72 22 67 51 97 92 95 82 39 93 21 94 96 84 73 12 18 79 32 3 91 20 43 98 52 42 89 63 45 15 55 34 48 27 7 26 90 46 68 13 19 85 87 40 66 76 28 88 50 60 29 17 56 58 71 54 25 49 31 80
Spark Hive
0
1000
2000
3000
4000
72 75 91 49 13 26 88 71 40 66 56 60 68 7 31 27 48 79 87 46 45 73 15 34 84 20 21 12 39 32 96 76 28 51 92 85 93
Presto Hive
0
1000
2000
3000
4000
72 67 82 92 97 95 22 51 21 12 20 15 26 19 96 39 27 84 3 13 18 28 43 45 52 7 34 48 42 46 55 32 73 89 87 63 68 90 79 91 76 66 40 71 93 50 94 56 85 54 60 58 25 17 31 29 88 49 74 80
Spark Hive
0
1000
2000
3000
4000
72 26 13 91 21 15 20 12 27 7 84 39 45 96 48 46 34 40 71 68 66 73 87 92 97 79 32 51 28 76 85 56 93 83 60 24 49 88 31 74
Presto Spark
[Experimental results] Curse of Query72 on Hive and Presto 1TB dataset
[Experimental results] Curse of Query72 on Presto and Hive
Presto SparkHive
High CPU utilization
for a long time
(looks like “plateau”)
No plateau observed
thanks to
WholeStageCodeGen
plateau
w/ WholeStageCodeGen w/o WholeStageCodeGen
CPU
Network
Disk
[Experimental results] Whole Stage Code Generation
Presto Hive
Without WholeStageCodeGeneration,
“plateau” is observed like Presto and Hive
 10th stage takes much longer
without WholeStageCodeGeneration
plateau
[Experimental results] Whole Stage Code Generation
1
10
100
1000
75 71 64 83 73 40 56 55 43 85 89 63 84 27 92 82 28 93 91 12 96 32 7 66 20 45 98 42 68 52 87 90 94 79 3 48 26 60 46 34 67 51 15 29 65 76 50 31 49 80 19 18 21 58 17 13 97 22 95 54 25 39 88 24 72
w/ WSCG w/o WSCG
100GB dataset
[Experimental results] Performance of Hive w/ and w/o LLAP
1
10
100
1000
10000
67 51 70 97 92 68 46 42 71 73 90 34 39 48 84 52 43 63 98 13 3 7 79 66 20 26 27 96 21 45 89 80 12 56 15 32 40 18 49 19 55 60 54 87 31 76 75 82 22 88 28 58 94 93 95 72
llap container
LLAP shows improvement over container-based Tez for most queries
100GB
dataset
1TB
dataset
1
10
100
1000
10000
80 84 96 79 94 51 91 31 56 67 27 32 95 46 72 39 19 48 90 52 21 45 15 3 34 55 42 20 12 97 43 60 63 22 26 76 75 83 88 68 70 89 65 28 13 87 40 58 18 85 25 17 29 93 50 92 66 73 71
llap container
All queries : 78.3% reduction
All except 72 : 36.1% reduction
All queries : 44.9% reduction
All except 72 : 27.9% reduction
[Experimental results][Query 75] without and with LLAP
without LLAP without LLAP
with LLAP
with LLAP
[Experimental results][Query 93] without and with LLAP
without LLAP without LLAP with LLAP
with LLAP
[Experimental results][Query 94] without and with LLAP
without LLAP without LLAP with LLAP
with LLAP
[Experimental results][Query 93] Difference pattern of resource utilization
Network  CPU
Presto Hive
CPU  Network
Spark
The end

More Related Content

What's hot

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergWalaa Eldin Moustafa
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 

What's hot (20)

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 

Viewers also liked

A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkDongwon Kim
 
Docker and kubernetes
Docker and kubernetesDocker and kubernetes
Docker and kubernetesDongwon Kim
 
Kubernetes introduction
Kubernetes introductionKubernetes introduction
Kubernetes introductionDongwon Kim
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduceEdureka!
 
Predictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache FlinkPredictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache FlinkDongwon Kim
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016kbajda
 
Presto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FuturePresto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FutureDataWorks Summit
 
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CAkbajda
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine kiran palaka
 
How to ensure Presto scalability 
in multi use case
How to ensure Presto scalability 
in multi use case How to ensure Presto scalability 
in multi use case
How to ensure Presto scalability 
in multi use case Kai Sasaki
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageKai Sasaki
 

Viewers also liked (17)

A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Docker and kubernetes
Docker and kubernetesDocker and kubernetes
Docker and kubernetes
 
Kubernetes introduction
Kubernetes introductionKubernetes introduction
Kubernetes introduction
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
Predictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache FlinkPredictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache Flink
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Presto - SQL on anything
Presto  - SQL on anythingPresto  - SQL on anything
Presto - SQL on anything
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
Presto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FuturePresto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and Future
 
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
How to ensure Presto scalability 
in multi use case
How to ensure Presto scalability 
in multi use case How to ensure Presto scalability 
in multi use case
How to ensure Presto scalability 
in multi use case
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
 
Presto
PrestoPresto
Presto
 

Similar to Hive, Presto & Spark on TPC-DS: A Comparison of Query Performance

Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance SmackdownDataWorks Summit
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudNicolas Poggi
 
Scaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insertScaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insertChris Adkin
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services
 
Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization Methods in the Hadoop Cluster with TerasortUsing Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization Methods in the Hadoop Cluster with TerasortAnhanguera Educacional S/A
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance BenchmarkBigstep
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
 
Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)Nicolas Poggi
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongCeph Community
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightDataStax Academy
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Tier1 App
 
USE_OF_PACKET_CAPTURE.pptx
USE_OF_PACKET_CAPTURE.pptxUSE_OF_PACKET_CAPTURE.pptx
USE_OF_PACKET_CAPTURE.pptxrajaguru91
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemAvleen Vig
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
 
Cloud Performance Benchmarking
Cloud Performance BenchmarkingCloud Performance Benchmarking
Cloud Performance BenchmarkingSantanu Dey
 
ClusterPresentation
ClusterPresentationClusterPresentation
ClusterPresentationWill Dixon
 

Similar to Hive, Presto & Spark on TPC-DS: A Comparison of Query Performance (20)

Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Scaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insertScaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insert
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
 
Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization Methods in the Hadoop Cluster with TerasortUsing Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
 
QCon London.pdf
QCon London.pdfQCon London.pdf
QCon London.pdf
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance Benchmark
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
 
USE_OF_PACKET_CAPTURE.pptx
USE_OF_PACKET_CAPTURE.pptxUSE_OF_PACKET_CAPTURE.pptx
USE_OF_PACKET_CAPTURE.pptx
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
 
Silent stores
Silent storesSilent stores
Silent stores
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Cloud Performance Benchmarking
Cloud Performance BenchmarkingCloud Performance Benchmarking
Cloud Performance Benchmarking
 
ClusterPresentation
ClusterPresentationClusterPresentation
ClusterPresentation
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Hive, Presto & Spark on TPC-DS: A Comparison of Query Performance

  • 1. Hive, Presto, and Spark on TPC-DS benchmark Dongwon Kim, PhD SK Telecom
  • 2. Contents • Experimental setup • Experimental results
  • 3. [Experimental setup] TPC-DS dataset and query • Hive • Entirely depend on github.com/hortonworks/hive-testbench • Distributed data generator • A small dataset (100GB) • A large dataset (1TB) • DDLs • External table declaration • Partitioned table declaration (ORC) • 66 queries provided (out of 99 TPC-DS queries) • Presto • Use hive-hadoop2 connector to read the same partitioned table • Use the same query • Spark • Connected to Hive MetaStore to read the same partitioned table • Use the same query
  • 4. [Experimental setup] Cluster setup • A single master node with 5 slave nodes • Two 12-core processes with total 48 hyper threads • 128GB main memory • 10 HDDs • Hadoop 2.7.3 • Hive 2.1.1 + Tez 0.8.4 • A LLAP worker on a node uses 192 cores and 80GB • Presto 0.162 • A Presto worker on a node uses 192 cores and 80GB • distributed-joins-enabled = false • Spark 2.0.2 • 4 Spark executors on a node uses 192 cores and 80GB
  • 5. [Experimental setup] Performance monitoring tool • github.com/eastcirclek/swimlane-graphs • Hive/Presto/Spark task swimlane graph + Ganglia resource utilization graph • To observe the main cause of performance bottleneck
  • 6. [Experimental results] Characteristics of each engine • Hive • Improve significantly through LLAP • Good for both small and large workload • Especially good for IO-bound workloads • Spark • Improve CPU performance through Whole Stage Code Generation • Especially good for CPU-bound workloads • Does not outperform Hive and Presto for IO-bound workloads • Presto • Pipelined execution to reduce unnecessary disk IOs • Good for simple queries • Works okay only when data is fit into memory
  • 7. [Experimental results] Query execution time (100GB) with query72 without query72 Pairwise comparison reduction in sum of running times Pairwise comparison reduction in sum of running times Spark > Hive 26.3 % (1668s  1229s) Hive > Spark 19.8 % (1143s  916s) Hive > Presto 55.6 % (2797s  1241s) Hive > Presto 50.2 % (982s  489s) Spark > Presto 62.0 % (2932s  1114s) Spark > Presto 5.2% (1116s  1057s) Spark > Hive >>> Presto Hive > Spark >= Presto Reversed Gap reduced significantly Hive with LLAP is good even for small workload * When comparing each pair of engines, I count queries that are completed by both of two engines.
  • 8. [Experimental results] Query execution time (1TB) with query72 without query72 Pairwise comparison reduction in sum of running times Pairwise comparison reduction in sum of running times Hive > Spark 28.2 % (6445s  4625s) Hive > Spark 41.3 % (6165s  3629s) Hive > Presto 56.4 % (5567s  2426s) Hive > Presto 25.5 % (1460s  1087s) Spark > Presto 29.2 % (5685s  4026s) Presto > Spark 58.6% (3812s  1578s) Hive > Spark >>> Presto Hive > Presto > Spark Reversed * When comparing each pair of engines, I count queries that are completed by both of two engines. Hive with LLAP is good for large, I/O-bound workload Presto works okay if data fit into memory
  • 9. [Experimental results] Curse of Query72 on Hive and Presto 0 500 1000 1500 2000 72 89 43 63 19 3 51 52 42 55 82 29 17 25 49 39 91 40 21 13 12 73 96 48 20 85 34 84 79 32 7 26 45 27 46 88 68 15 97 92 93 66 87 24 28 56 71 83 60 76 31 64 74 Presto Spark 100GB dataset 0 500 1000 1500 2000 72 89 43 63 19 3 42 52 55 25 49 29 17 75 82 40 88 31 13 71 91 56 85 60 68 28 46 48 26 7 51 66 21 34 27 20 45 12 87 73 79 15 32 96 84 76 39 93 92 97 Presto Hive 0 500 1000 1500 2000 72 22 67 51 97 92 95 82 39 93 21 94 96 84 73 12 18 79 32 3 91 20 43 98 52 42 89 63 45 15 55 34 48 27 7 26 90 46 68 13 19 85 87 40 66 76 28 88 50 60 29 17 56 58 71 54 25 49 31 80 Spark Hive
  • 10. 0 1000 2000 3000 4000 72 75 91 49 13 26 88 71 40 66 56 60 68 7 31 27 48 79 87 46 45 73 15 34 84 20 21 12 39 32 96 76 28 51 92 85 93 Presto Hive 0 1000 2000 3000 4000 72 67 82 92 97 95 22 51 21 12 20 15 26 19 96 39 27 84 3 13 18 28 43 45 52 7 34 48 42 46 55 32 73 89 87 63 68 90 79 91 76 66 40 71 93 50 94 56 85 54 60 58 25 17 31 29 88 49 74 80 Spark Hive 0 1000 2000 3000 4000 72 26 13 91 21 15 20 12 27 7 84 39 45 96 48 46 34 40 71 68 66 73 87 92 97 79 32 51 28 76 85 56 93 83 60 24 49 88 31 74 Presto Spark [Experimental results] Curse of Query72 on Hive and Presto 1TB dataset
  • 11. [Experimental results] Curse of Query72 on Presto and Hive Presto SparkHive High CPU utilization for a long time (looks like “plateau”) No plateau observed thanks to WholeStageCodeGen plateau
  • 12. w/ WholeStageCodeGen w/o WholeStageCodeGen CPU Network Disk [Experimental results] Whole Stage Code Generation Presto Hive Without WholeStageCodeGeneration, “plateau” is observed like Presto and Hive  10th stage takes much longer without WholeStageCodeGeneration plateau
  • 13. [Experimental results] Whole Stage Code Generation 1 10 100 1000 75 71 64 83 73 40 56 55 43 85 89 63 84 27 92 82 28 93 91 12 96 32 7 66 20 45 98 42 68 52 87 90 94 79 3 48 26 60 46 34 67 51 15 29 65 76 50 31 49 80 19 18 21 58 17 13 97 22 95 54 25 39 88 24 72 w/ WSCG w/o WSCG 100GB dataset
  • 14. [Experimental results] Performance of Hive w/ and w/o LLAP 1 10 100 1000 10000 67 51 70 97 92 68 46 42 71 73 90 34 39 48 84 52 43 63 98 13 3 7 79 66 20 26 27 96 21 45 89 80 12 56 15 32 40 18 49 19 55 60 54 87 31 76 75 82 22 88 28 58 94 93 95 72 llap container LLAP shows improvement over container-based Tez for most queries 100GB dataset 1TB dataset 1 10 100 1000 10000 80 84 96 79 94 51 91 31 56 67 27 32 95 46 72 39 19 48 90 52 21 45 15 3 34 55 42 20 12 97 43 60 63 22 26 76 75 83 88 68 70 89 65 28 13 87 40 58 18 85 25 17 29 93 50 92 66 73 71 llap container All queries : 78.3% reduction All except 72 : 36.1% reduction All queries : 44.9% reduction All except 72 : 27.9% reduction
  • 15. [Experimental results][Query 75] without and with LLAP without LLAP without LLAP with LLAP with LLAP
  • 16. [Experimental results][Query 93] without and with LLAP without LLAP without LLAP with LLAP with LLAP
  • 17. [Experimental results][Query 94] without and with LLAP without LLAP without LLAP with LLAP with LLAP
  • 18. [Experimental results][Query 93] Difference pattern of resource utilization Network  CPU Presto Hive CPU  Network Spark