20140120 presto meetup_en

Our Presto use case
and
performance test
Hironori Ogibayashi
Shin Matsuura

About us
● Hironori Ogibayashi(@angostura11)
● Shin Matsuura
○ IT Infrastructure team in Japanese
telecommunications carrier
○ Mainly working on middleware - test,
installation, deployment.

Todays Topic
● Presto use case
○ Deployment
○ Use case
○ Challenges
○ Future work
● Performance comparison between
Hive+Tez and Presto

Log Collection Flow
Fluentd
Aggregator
Hadoop Cluster Application
WebHDFS
・1500 Fluentd instances
・25,000 msg / sec
・400GB / day
・150 types of log

Log Usage
● Systems Infrastructure team
○ Checking trends in server performance
○ Performance analysis of Oracle
Database
● Application development team
○ Improving system and business
operations.

Application for Oracle DB Performance Analysis
- Check existing/potential problems of
Oracle database, for certain system,
certain period.
- Utilize logs stored in HDFS. Queries
were executed on Hive.
- But, it took more than one hour to
get the result...
- (So, we migrated to Presto.)

Why Presto?
● Frequent use of Interactive / ad-hoc
queries.
● Of cource, faster is better.

Hadoop Slave
Presto Deployment
Hadoop Slave
DataNode
TaskTracker
Presto Worker
Presto
Coordinator
Hive Metastore
Application/Client
・・・
● A decicated physical machine as a
Coordinator.
● Workers run on each Hadoop slaves.
● Logs in HDFS are periodically
converted to RCfiles.
● Presto versions
○ 0.66⇒0.73⇒0.75⇒0.82

Deployment Effect - Elapsed time of a single query
230sec
7sec
- Elapsed time of one of
the queries issued by the
application.
- Query was run on CDH4
(MRv1) cluster.

Deployment and Operation
● Deployment
○ Easy;Just extract binaries in each server and modify
configuration file.
○ Automated by Ansible + yum.
● What we use in operation
○ Query history
■ Coordinator Web UI
○ Logs
■ /var/presto/data/logs/{server.log,launcher.log}
○ Metrics
■ presto-metrics(https://github.com/xerial/presto-
metrics)⇒Fluentd⇒Elasticsearch + Kibana
○ sys schema

Challenges
● Worker crash / hang.
○ OutOfMemory. In case of hanging, we resolve to “kill -9”.
○ We Modified the memory parameter: task.shard.max-
threads×task.max-memory < -Xmx
● At first, we set node-scheduler.include-coordinator=true.
In which case, Coordinator crashed due to heavy query.
● SQL difference from HiveQL
○ At first our Application used both Hive and Presto because we used
Presto experimentally.Hence the Application had to support both
HiveQL and Presto(ANSI SQL).
○ Now, the application no longer use Hive.

Future work
● Improve Coodinator’s availability.
● Security
○ Now, all queries are executed as Presto’s daemon user.
● Resource isolation between Presto and Hadoop daemons.

Contents
From a Performance perspective
Presto VS Hive+Tez
(not tuning any parameteres)

Conclusion
Presto VS Hive+Tez
Win Lose

How Fast??
Presto VS Hive+Tez
2.0~136 times

Testing environment Configurations 2p12c
64GB Mem
36TB Disk
NN
DN DN DN
Hadoop（HDP2.1）
Presto（0.82）
Coodinator
Worker Worker Worker
Master : 3nodes
Slave : 3nodes
NN
Metastore

Sample data
300GB
csv file
50 columns
1.1B records

Performance measurement perspectives
• Query patterns
• Data format patterns
• Repetitive Querying

Queries
Query1: select count(*) from TestTBL
Query2: select * from TestTBL where col1 = ‘XXX’
Query3: select * from TestTBL where col1 = ‘XXX’ and col2 = ‘YYY’
Query4: select col1, count(*) from TestTBL group by col1
Query5: select col1, count(*) from TestTBL where col2 = ‘YYY’ group by col1

data format :Txt
Results: Query patterns

data format :Txt
Results: Query patterns
100x faster
Presto was faster in processing speed than
Hive+Tez in all queries.

Data formats
• Text File (Textfile)
• Record Columnar File (RCfile)
• Optimized Row Columnar File (ORCfile)

Results: Data format patterns
※Query: Query2

Results: Data format patterns
※Query: Query2
Presto was faster in processing speed
than Hive+Tez in all data formats.

Change in processing time with repetitions(Presto)
※Query: Query2
※Data format: Txt

Change in processing time with repetitions (Presto)
※Query: Query2
※Data format: Txt
Became faster After the second time.
Cache ???
2.5x faster

Change in processing time with repetitions (Hive+Tez)
※Query: Query2
※Data format: Txt

Change in processing time with repetitions (Hive+Tez)
※Query: Query2
※Data format: Txt
No real change in processing time

Engine:Presto
Query × Data format

Engine:Presto
Query × Data format
Is using RCfile the most stable and fastest
way ??

Summary
Result
● Presto was faster than Hive+Tez in all queries.
● Presto was faster than Hive+Tez in all data formats.
● With repetitive Querying, presto became faster.
● By Using RCfile, Presto was the most stable and fastest.
Next
● Benchmark from node scaling and data volumn
perspectives.
● Benchmark while using compression functions of
ORCfile.
● Benchmark with HDP2.2.

ほぼすべての条件で
2回目以降高速

20140120 presto meetup_en

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to 20140120 presto meetup_en

Similar to 20140120 presto meetup_en (20)

Recently uploaded

Recently uploaded (20)

20140120 presto meetup_en