SlideShare a Scribd company logo
1 of 40
Download to read offline
Our Presto use case
and
performance test
Hironori Ogibayashi
Shin Matsuura
About us
● Hironori Ogibayashi(@angostura11)
● Shin Matsuura
○ IT Infrastructure team in Japanese
telecommunications carrier
○ Mainly working on middleware - test,
installation, deployment.
Todays Topic
● Presto use case
○ Deployment
○ Use case
○ Challenges
○ Future work
● Performance comparison between
Hive+Tez and Presto
Presto use case
Log Collection Flow
Fluentd
Aggregator
Hadoop Cluster Application
WebHDFS
・1500 Fluentd instances
・25,000 msg / sec
・400GB / day
・150 types of log
Log Usage
● Systems Infrastructure team
○ Checking trends in server performance
○ Performance analysis of Oracle
Database
● Application development team
○ Improving system and business
operations.
Application for Oracle DB Performance Analysis
- Check existing/potential problems of
Oracle database, for certain system,
certain period.
- Utilize logs stored in HDFS. Queries
were executed on Hive.
- But, it took more than one hour to
get the result...
- (So, we migrated to Presto.)
Why Presto?
● Frequent use of Interactive / ad-hoc
queries.
● Of cource, faster is better.
Hadoop Slave
Presto Deployment
Hadoop Slave
DataNode
TaskTracker
Presto Worker
Presto
Coordinator
Hive Metastore
Application/Client
・・・
● A decicated physical machine as a
Coordinator.
● Workers run on each Hadoop slaves.
● Logs in HDFS are periodically
converted to RCfiles.
● Presto versions
○ 0.66⇒0.73⇒0.75⇒0.82
Deployment Effect - Elapsed time of a single query
230sec
7sec
- Elapsed time of one of
the queries issued by the
application.
- Query was run on CDH4
(MRv1) cluster.
Deployment and Operation
● Deployment
○ Easy;Just extract binaries in each server and modify
configuration file.
○ Automated by Ansible + yum.
● What we use in operation
○ Query history
■ Coordinator Web UI
○ Logs
■ /var/presto/data/logs/{server.log,launcher.log}
○ Metrics
■ presto-metrics(https://github.com/xerial/presto-
metrics)⇒Fluentd⇒Elasticsearch + Kibana
○ sys schema
Challenges
● Worker crash / hang.
○ OutOfMemory. In case of hanging, we resolve to “kill -9”.
○ We Modified the memory parameter: task.shard.max-
threads×task.max-memory < -Xmx
● At first, we set node-scheduler.include-coordinator=true.
In which case, Coordinator crashed due to heavy query.
● SQL difference from HiveQL
○ At first our Application used both Hive and Presto because we used
Presto experimentally.Hence the Application had to support both
HiveQL and Presto(ANSI SQL).
○ Now, the application no longer use Hive.
Future work
● Improve Coodinator’s availability.
● Security
○ Now, all queries are executed as Presto’s daemon user.
● Resource isolation between Presto and Hadoop daemons.
Presto VS Hive+Tez
Contents
From a Performance perspective
Presto VS Hive+Tez
(not tuning any parameteres)
Conclusion
Presto VS Hive+Tez
Win Lose
How Fast??
Presto VS Hive+Tez
2.0~136 times
more details
Testing environment Configurations 2p12c
64GB Mem
36TB Disk
NN
DN DN DN
Hadoop(HDP2.1)
Presto(0.82)
Coodinator
Worker Worker Worker
Master : 3nodes
Slave : 3nodes
NN
Metastore
Sample data
300GB
csv file
50 columns
1.1B records
Performance measurement perspectives
• Query patterns
• Data format patterns
• Repetitive Querying
Query patterns
Queries
Query1: select count(*) from TestTBL
Query2: select * from TestTBL where col1 = ‘XXX’
Query3: select * from TestTBL where col1 = ‘XXX’ and col2 = ‘YYY’
Query4: select col1, count(*) from TestTBL group by col1
Query5: select col1, count(*) from TestTBL where col2 = ‘YYY’ group by col1
data format :Txt
Results: Query patterns
data format :Txt
Results: Query patterns
100x faster
Presto was faster in processing speed than
Hive+Tez in all queries.
Data format patterns
Data formats
• Text File (Textfile)
• Record Columnar File (RCfile)
• Optimized Row Columnar File (ORCfile)
Results: Data format patterns
※Query: Query2
Results: Data format patterns
※Query: Query2
Presto was faster in processing speed
than Hive+Tez in all data formats.
Repetitive Querying
Change in processing time with repetitions(Presto)
※Query: Query2
※Data format: Txt
Change in processing time with repetitions (Presto)
※Query: Query2
※Data format: Txt
Became faster After the second time.
Cache ???
2.5x faster
Change in processing time with repetitions (Hive+Tez)
※Query: Query2
※Data format: Txt
Change in processing time with repetitions (Hive+Tez)
※Query: Query2
※Data format: Txt
No real change in processing time
+α
Engine:Presto
Query × Data format
Engine:Presto
Query × Data format
Is using RCfile the most stable and fastest
way ??
Summary
Result
● Presto was faster than Hive+Tez in all queries.
● Presto was faster than Hive+Tez in all data formats.
● With repetitive Querying, presto became faster.
● By Using RCfile, Presto was the most stable and fastest.
Next
● Benchmark from node scaling and data volumn
perspectives.
● Benchmark while using compression functions of
ORCfile.
● Benchmark with HDP2.2.
Appendix
ほぼすべての条件で
2回目以降高速

More Related Content

What's hot

Presto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FuturePresto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and Future
DataWorks Summit
 

What's hot (20)

Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Presto Meetup 2016 Small Start
Presto Meetup 2016 Small StartPresto Meetup 2016 Small Start
Presto Meetup 2016 Small Start
 
Presto
PrestoPresto
Presto
 
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
 
Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @Facebook
 
Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
Presto
PrestoPresto
Presto
 
How to ensure Presto scalability 
in multi use case
How to ensure Presto scalability 
in multi use case How to ensure Presto scalability 
in multi use case
How to ensure Presto scalability 
in multi use case
 
Presto updates to 0.178
Presto updates to 0.178Presto updates to 0.178
Presto updates to 0.178
 
Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Boston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the EnterpriseBoston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the Enterprise
 
Presto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FuturePresto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and Future
 
Technologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessTechnologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise Business
 
Tale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench ToolsTale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench Tools
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 

Viewers also liked

Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
Sadayuki Furuhashi
 

Viewers also liked (6)

Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetup
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
A Benchmark Test on Presto, Spark Sql and Hive on Tez
A Benchmark Test on Presto, Spark Sql and Hive on TezA Benchmark Test on Presto, Spark Sql and Hive on Tez
A Benchmark Test on Presto, Spark Sql and Hive on Tez
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
SQL on Hadoop 比較検証 【2014月11日における検証レポート】
SQL on Hadoop 比較検証 【2014月11日における検証レポート】SQL on Hadoop 比較検証 【2014月11日における検証レポート】
SQL on Hadoop 比較検証 【2014月11日における検証レポート】
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 

Similar to 20140120 presto meetup_en

Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
Zhenxiao Luo
 

Similar to 20140120 presto meetup_en (20)

Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Job Queues Overview
Job Queues OverviewJob Queues Overview
Job Queues Overview
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.com
 
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

20140120 presto meetup_en

  • 1. Our Presto use case and performance test Hironori Ogibayashi Shin Matsuura
  • 2. About us ● Hironori Ogibayashi(@angostura11) ● Shin Matsuura ○ IT Infrastructure team in Japanese telecommunications carrier ○ Mainly working on middleware - test, installation, deployment.
  • 3. Todays Topic ● Presto use case ○ Deployment ○ Use case ○ Challenges ○ Future work ● Performance comparison between Hive+Tez and Presto
  • 5. Log Collection Flow Fluentd Aggregator Hadoop Cluster Application WebHDFS ・1500 Fluentd instances ・25,000 msg / sec ・400GB / day ・150 types of log
  • 6. Log Usage ● Systems Infrastructure team ○ Checking trends in server performance ○ Performance analysis of Oracle Database ● Application development team ○ Improving system and business operations.
  • 7. Application for Oracle DB Performance Analysis - Check existing/potential problems of Oracle database, for certain system, certain period. - Utilize logs stored in HDFS. Queries were executed on Hive. - But, it took more than one hour to get the result... - (So, we migrated to Presto.)
  • 8. Why Presto? ● Frequent use of Interactive / ad-hoc queries. ● Of cource, faster is better.
  • 9. Hadoop Slave Presto Deployment Hadoop Slave DataNode TaskTracker Presto Worker Presto Coordinator Hive Metastore Application/Client ・・・ ● A decicated physical machine as a Coordinator. ● Workers run on each Hadoop slaves. ● Logs in HDFS are periodically converted to RCfiles. ● Presto versions ○ 0.66⇒0.73⇒0.75⇒0.82
  • 10. Deployment Effect - Elapsed time of a single query 230sec 7sec - Elapsed time of one of the queries issued by the application. - Query was run on CDH4 (MRv1) cluster.
  • 11. Deployment and Operation ● Deployment ○ Easy;Just extract binaries in each server and modify configuration file. ○ Automated by Ansible + yum. ● What we use in operation ○ Query history ■ Coordinator Web UI ○ Logs ■ /var/presto/data/logs/{server.log,launcher.log} ○ Metrics ■ presto-metrics(https://github.com/xerial/presto- metrics)⇒Fluentd⇒Elasticsearch + Kibana ○ sys schema
  • 12. Challenges ● Worker crash / hang. ○ OutOfMemory. In case of hanging, we resolve to “kill -9”. ○ We Modified the memory parameter: task.shard.max- threads×task.max-memory < -Xmx ● At first, we set node-scheduler.include-coordinator=true. In which case, Coordinator crashed due to heavy query. ● SQL difference from HiveQL ○ At first our Application used both Hive and Presto because we used Presto experimentally.Hence the Application had to support both HiveQL and Presto(ANSI SQL). ○ Now, the application no longer use Hive.
  • 13. Future work ● Improve Coodinator’s availability. ● Security ○ Now, all queries are executed as Presto’s daemon user. ● Resource isolation between Presto and Hadoop daemons.
  • 15. Contents From a Performance perspective Presto VS Hive+Tez (not tuning any parameteres)
  • 17. How Fast?? Presto VS Hive+Tez 2.0~136 times
  • 19. Testing environment Configurations 2p12c 64GB Mem 36TB Disk NN DN DN DN Hadoop(HDP2.1) Presto(0.82) Coodinator Worker Worker Worker Master : 3nodes Slave : 3nodes NN Metastore
  • 20. Sample data 300GB csv file 50 columns 1.1B records
  • 21. Performance measurement perspectives • Query patterns • Data format patterns • Repetitive Querying
  • 23. Queries Query1: select count(*) from TestTBL Query2: select * from TestTBL where col1 = ‘XXX’ Query3: select * from TestTBL where col1 = ‘XXX’ and col2 = ‘YYY’ Query4: select col1, count(*) from TestTBL group by col1 Query5: select col1, count(*) from TestTBL where col2 = ‘YYY’ group by col1
  • 24. data format :Txt Results: Query patterns
  • 25. data format :Txt Results: Query patterns 100x faster Presto was faster in processing speed than Hive+Tez in all queries.
  • 27. Data formats • Text File (Textfile) • Record Columnar File (RCfile) • Optimized Row Columnar File (ORCfile)
  • 28. Results: Data format patterns ※Query: Query2
  • 29. Results: Data format patterns ※Query: Query2 Presto was faster in processing speed than Hive+Tez in all data formats.
  • 31. Change in processing time with repetitions(Presto) ※Query: Query2 ※Data format: Txt
  • 32. Change in processing time with repetitions (Presto) ※Query: Query2 ※Data format: Txt Became faster After the second time. Cache ??? 2.5x faster
  • 33. Change in processing time with repetitions (Hive+Tez) ※Query: Query2 ※Data format: Txt
  • 34. Change in processing time with repetitions (Hive+Tez) ※Query: Query2 ※Data format: Txt No real change in processing time
  • 35.
  • 37. Engine:Presto Query × Data format Is using RCfile the most stable and fastest way ??
  • 38. Summary Result ● Presto was faster than Hive+Tez in all queries. ● Presto was faster than Hive+Tez in all data formats. ● With repetitive Querying, presto became faster. ● By Using RCfile, Presto was the most stable and fastest. Next ● Benchmark from node scaling and data volumn perspectives. ● Benchmark while using compression functions of ORCfile. ● Benchmark with HDP2.2.