Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Presto as a Service - Tips for operation and monitoring
1. Presto as a Service
Tips for operation and monitoring
Taro L. Saito
Treasure Data, Inc.
leo@treasure-data.com
January 20, 2015
Presto Meetup Japan @ FreakOut, Roppongi
2. About Me: @taroleo
• 2007 University of Tokyo. Ph.D.
– XML DBMS, Transaction Processing
• Relational-Style XML Query. ACM SIGMOD 2008
• ~ 2014 Assistant Professor at University of Tokyo
– Genome Science Research
• Distributed Computing
• 2014.3月~ Treasure Data
– Software Engineer, MPP Team Leader
2
3. My Open Source Projects
• sqlite-jdbc
– SQLite DBMS for Java
– 1file =1DB
• snappy-java
– Fast compression library
– More than 100,000 downloads/month
• Used in Spark, Parquet, etc.
• msgpack-java
• UT Genome Browser (UTGB)
– Visualization of massive amount of
genome science data
3
4. Topics
• Presto as a Service in Treasure Data
– Error Recovery
– Presto Deployment
• Tips for Monitoring Presto
– JSON API
– Presto + Fluentd
4
6. Hive
TD API /
Web ConsoleInteractive query
batch query
Presto
Treasure Data
PlazmaDB:
MessagePack Columnar Storage
td-presto connector
7. Deployment
• Building Presto takes more than 20 minutes.
• Facebook frequently releases new versions
• Let CircleCI build Presto
– Deploy jar files to private Maven repository
– We sometime use non-release versions
• for fixing serious bugs
• hot-fix patches
• Integration Test
– td-presto connector
• PlazmaDB, Multi-tenant query scheduler
• Query optimizer
– Run test queries on staging cluster
7
8. Production: Blue-Green Deployment
• http://martinfowler.com/bliki/BlueGreenDeployment.html
• 2 Presto Coordinators (Blue/Green)
– Route Presto queries to the active cluster
– No down-time upon deployment
• Launch Presto worker instances with chef <- less than 5 min. in AWS
• Inactive clusters is used for pre-production testing and customer support
– Investigation and tuning of customer query performance
– Trouble shooting
8
9. Error Recovery
• Presto has no fault tolerance
• Error types
– User error
• Syntax errors
– SQL syntax, missing function
• Semantic errors
– missing tables/columns
– Insufficient resource
• Exceeded task memory size
– Internal failure
• I/O error
– S3/Riak CS
• worker failure
• etc.
9
Worth A Retry!
13. Monitoring Presto
• REST API for monitoring Presto state
– JSON format
• (presto server IP):8080/v1/query
– List of recent queries (BasicQueryInfo class)
• (presto server IP):8080/v1/query/(query id)
– Detailed query state information
– Query plan, tasks and running worker IDs
– Processed rows/data size
13
24. Detecting Anomaly
• Started Query Rate (in 5min/15min)
– If no query has started, cluster may be down (or not started properly)
• Processed rows in a query
– Sum up the number of the processed rows from all of the sub stages
– Simple, but the most reliable measure
• Send an alert
– HipChat notification
– PagerDuty call
• JP/US team rotation
24
25. Benchmarking
• Query performance comparison
– between two versions of Presto
• Benchmark
– Run query set multiple times
– Store the results to TD
– Report the result with Presto
• Aggregation query
25