3. 3
History of Presto
FALL 2012
6 developers
start Presto
development
FALL 2014
88 Releases
41 Contributors
3943 Commits
FALL 2015
132 Releases
105 Contributors
6300 Commits
---------
Teradata part of
Presto community
& offers support
SPRING 2013
Presto rolled out
within Facebook
FALL 2013
Facebook open
sources Presto
FALL 2008
Facebook open
sources Hive
4. 4
➔ 100% open source distributed ANSI SQL engine for Big Data
➔ Optimized for low latency, Interactive querying
◆ Cross platform query capability, not only SQL on Hadoop
◆ Distributed under the Apache license, now supported by Teradata
◆ Used by a community of well known, well respected technology companies
◆ Modern code base
◆ Proven scalability
What is Presto?
5. 5
High level architecture
Data stream API
Worker
Data stream API
Worker
Coordinator
Metadata
API
Parser/
analyzer Planner Scheduler
Worker
Client
Data location
API
Pluggable
7. 7
Presto Extensibility – connector interfaces
Parser/
analyzer Planner
Worker
Data location API
Hive
Cassandra
Kafka
MySQL
…
Metadata API
Hive
Cassandra
Kafka
MySQL
…
Data stream API
Hive
Cassandra
Kafka
MySQL
…
Scheduler
Coordinator
8. 8
Presto Extensibility – plugins
➔ Connectors
➔ Data types
➔ Extra functions
➔ Security providers
9. 9
➔ Facebook
◆ Multiple production clusters (100s of nodes total)
● Including 300PB Hadoop data warehouse
● Single cluster size order of 10s of nodes
◆ 1000s of internal daily active users
◆ Millions of queries each month
◆ Multiple PBs scanned every day
◆ Trillions of rows a day
◆ ORC format
➔ Netflix
◆ Over 250-node production cluster on EC2
◆ Over 15 PB in S3 (Parquet format)
◆ Over 300 users and 2.5K queries daily
◆ presto-cli, R, Python, BI tools
◆ 50% queries under 4s
Some usage facts
10. 10
Netflix Data Pipeline
Suro / Kafka Cassandra
AegisthusUrsula
Amazon S3
TVs mobile laptop
dimensionsevents
TD
TVs mobile laptopTVs mobile laptop
11. 11
Presto use-cases at Facebook
➔ three use cases
◆ Data warehouse - big data
◆ User facing - small data
◆ User facing - medium data
13. 13
Presto use-cases at Facebook (data warehouse)
➔ Multiple clusters
➔ O(103
) of users
➔ O(106
) queries per month
➔ petabytes of data scanned every day
➔ 100s of concurrent queries
14. 14
Presto use-cases at Facebook (data warehouse)
Loader
Client
Presto
Data Node
Presto
Data Node
M/R
Data Node
M/R
Data Node
Presto
Data Node
Presto
Hive
17. 17
Presto use-cases at Facebook (realtime)
Requirements
➔ User facing
➔ 0.1-5 seconds latency
➔ Support for data updates
➔ highly available
➔ 10-15 way joins
18. 18
Presto use-cases at Facebook (realtime)
Loader
Client
mysql
Presto
Presto
Presto
mysql
mysql
mysql
mysql
19. 19
Presto use-cases at Facebook (semi realtime)
Requirements
➔ Large data sets (smaller than warehouse)
➔ seconds to minutes latency
➔ predictable performance
➔ 5-15 minutes load latency
➔ 100s concurrent queries
22. 22
Presto use-cases at Facebook (semi realtime)
Raptor
Loader
Client
Presto
Flash
Presto
Flash
Presto
Flash
Presto
Flash
Presto
mysql
Kafka
Kafka
Kafka
Kafka
Loader
Gluster
Gluster
backup tier
INSERT INTO raptor_table SELECT *
from kafka_table where token
BETWEEN ${last_token} AND
${next_token}
MARK LOAD in
PROGRESS in MySQL
23. 23
Presto use-cases at Facebook (semi realtime)
Extra features
➔ Physical data reorganization
➔ Fully fledged and atomic DDL
➔ Atomic data loading
➔ Tiered architecture
24. 24
➔ Data stays in memory during execution and is pipelined across nodes MPP-
style
➔ Vectorized columnar processing
➔ Presto is written in highly tuned Java
◆ Efficient in-memory data structures
◆ Very careful coding of inner loops
◆ Bytecode generation
➔ Optimized ORC reader
➔ Predicates push-down
➔ Query optimizer
Presto = Performance