2. 2
What is Presto?
100% open source distributed SQL query engine
Originally developed by Facebook
Key Differentiators:
Performance & Scale
Cross platform query capability, not only SQL on Hadoop
Apache licensed, hosted on GitHub
Certified distro & support from Teradata
3. 3
Brief history of Presto
FALL 2012
6 developers
start Presto
development
FALL 2014
88 Releases
41 Contributors
3943 Commits
SPRING 2016
141 Releases
116 Contributors
6879 Commits
SPRING 2013
Presto rolled out
within Facebook
FALL 2013
Facebook open
sources Presto
FALL 2008
Facebook open
sources Hive
4. 4
• Facebook
– Multiple production clusters (100s of nodes total)
- Massive 300PB Hadoop data warehouse
- Very large sharded MySQL installation
- Growing usage of Raptor SSD-based storage
– 1000s of internal daily active users
– 10-100s of concurrent queries
• Netflix
– Over 200-node production cluster on EC2
– Over 25 PB in S3 (Parquet format)
– Over 350 active users and 3K queries daily
Presto in Production
5. 5
Presto Architecture
Data stream API
Worker
Data stream API
Worker
Coordinator
Metadata
API
Parser/
analyzer
Planner Scheduler
Worker
Client
Data location
API
Pluggable
6. 6
Presto Extensibility – connectors
Parser/
analyzer
Planner
Worker
Data location API
HDFS/S3
NoSQL
DBMS
Custom
…
Metadata API
HDFS/S3
NoSQL
DBMS
Custom
…
Data stream API
HDFS/S3
NoSQL
DBMS
Custom
…
Scheduler
Coordinator
7. 7
• Hadoop/Hive connector & file formats:
– HDFS & S3 + HCatalog
– ORC, RCFile, Parquet, SequenceFile, Text
• Open source data stores:
– MySQL & PostgreSQL (non-parallel)
– Cassandra
– Kafka
– Redis
• In development by community:
– MongoDB
– ElasticSearch
– HBase
Supported data sources & file formats
8. 8
• In-memory processing
• Pipelined execution across nodes MPP-style
• Vectorized columnar processing
• Multithreaded execution keeps all CPU cores busy
• Presto is written in highly tuned Java
– Efficient flat-memory data structures (minimizes GC)
– Very careful coding of inner loops
– Runtime bytecode generation
• Optimized ORC & Parquet readers
• Excellent performance with interactive SQL analytics
Presto – Query Execution Performance
9. 9
[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expr [, ...]
[ FROM table1 [[ INNER | OUTER ] JOIN table2 ON (…)]
[ WHERE condition ]
[ GROUP BY expression [, ...] ]
[ HAVING condition]
[ UNION [ ALL | DISTINCT ] select ]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
[ LIMIT [ count | ALL ] ]
In addition:
• Windowing functions
• Statistical and approximate aggregate functions
• UNNEST, TABLESAMPLE
In development:
• Complex subqueries
• EXISTS, INTERSECT, EXCEPT
• ROLLUP, CUBE
ANSI SQL Support
10. 10
• Cluster deployment models for Presto:
– on premise (appliance or commodity clusters)
– VM (OpenStack, etc.)
– cloud (Amazon, etc)
• Types of Hadoop deployments:
– on Hadoop/YARN cluster (all or subset of nodes)
– on a dedicated cluster
– mixed
Deployment models
11. 11
Open source initiative
• Announced in June 2015 at Hadoop Summit
– Growing interest and adoption
• Collaboration with Facebook and Presto community
– Joint development, conference talks, meetups and webinars
• Major commitment from Teradata Labs:
– 20 full-time engineers
– Free and open source contributions
– Enterprise-ready distribution
"A special shout out goes to Teradata — which joined the Presto community this year
with a focus on enhancing enterprise features and providing support — for having
seven of our top 10 external contributors."
— Facebook
12. 12
Implement Integrate Proliferate
• Installer
• Documentation
• Monitoring & Support
Tools
• Management Tool
Integration
• YARN Integration
• ODBC Driver
• JDBC Driver
• BI Certification
• Security
• Cloud features
Commercial Support
Phase 1 Phase 2 Phase 3
June 8, 2015 Q4 2015 2016
Expanding ANSI SQL Coverage
Teradata Contributions to Presto
13. 13
Recent developments and roadmap
• Q1 release:
– Fully-featured ODBC & JDBC drivers
– Kerberos support
– DECIMAL support
• Later 2016:
– BI tools certification
– TPC-H and TPC-DS unmodified
– Spill to disk