Presentation on Presto (http://prestodb.io) basics, design and Teradata's open source involvement. Presented on Sept 24th 2015 by Wojciech Biela and Łukasz Osipiuk at the #20 Warsaw Hadoop User Group meetup http://www.meetup.com/warsaw-hug/events/224872317
2. 2
➔ History of Teradata Center for Hadoop
◆ Formerly Hadapt Founded in July, 2010 by Justin Borgman, Kamil Bajda-
Pawlikowski, and Daniel Abadi
◆ Pioneered SQL-on-Hadoop market
◆ Based on work done by database research group in Yale Computer Science
Department
◆ Hybrid of Hadoop scalability and DBMS performance
➔ Today
◆ Acquired by Teradata in July, 2014, renamed Teradata Center for Hadoop
◆ 20+ developers with deep Hadoop and database expertise
◆ Headquarters in Boston, MA
◆ Teams in US (MA, CA) and Poland (Warsaw)
◆ Contributors to open source project Presto
Who are we? - Teradata Center for Hadoop!
3. 3
➔ What is Presto?
➔ What is Teradata doing?
➔ Can I see a Demo?
➔ How can I contribute?
Talk Agenda
4. 4
➔ 100% open source distributed ANSI SQL engine for Big Data
◆ Modern code base
◆ Proven scalability
➔ Optimized for low latency, Interactive querying
◆ Cross platform query capability, not only SQL on Hadoop
◆ Distributed under the Apache license, now supported by Teradata
◆ Used by a community of well known, well respected technology companies
What is Presto?
5. 5
History of Presto
FALL 2012
6 developers
start Presto
development
FALL 2014
88 Releases
41 Contributors
3943 Commits
SPRING 2015
98 Releases
65 Contributors
4587 Commits
---------
Teradata joins
Presto community
& offers support
SPRING 2013
Presto rolled out
within Facebook
FALL 2013
Facebook open
sources Presto
FALL 2008
Facebook open
sources Hive
6. 6
Query Execution
Data stream API
Worker
Data stream API
Worker
Coordinator
Metadata
API
Parser/
analyzer Planner Scheduler
Worker
Client
Data location
API
Pluggable
7. 7
Query Execution
Data stream API
Worker
Data stream API
Worker
Coordinator
Data Location
API
Metadata
API
Parser/
analyzer Planner Scheduler
Worker
Client
Pluggable
8. 8
Query Execution
Data stream API
Worker
Data stream API
Worker
Coordinator
Data Location
API
Metadata
API
Parser/
analyzer Planner Scheduler
Worker
Client
Pluggable
9. 9
Query Execution
Data stream API
Worker
Data stream API
Worker
Coordinator
Data Location
API
Metadata
API
Parser/
analyzer Planner Scheduler
Worker
Client
Pluggable
12. 12
Query Execution
Data stream API
Worker
Data stream API
Worker
Coordinator
Data Location
API
Metadata
API
Parser/
analyzer Planner Scheduler
Worker
Client
Pluggable
13. 13
Query Execution
Data stream API
Worker
Data stream API
Worker
Coordinator
Data Location
API
Metadata
API
Parser/
analyzer Planner Scheduler
Worker
Client
Pluggable
14. 14
Query Execution
Data stream API
Worker
Data stream API
Worker
Coordinator
Data Location
API
Metadata
API
Parser/
analyzer Planner Scheduler
Worker
Client
Pluggable
page 1
blockA
blockB
page
blockA
blockB ...
15. 15
Query Execution
Data stream API
Worker
Data stream API
Worker
Coordinator
Data Location
API
Metadata
API
Parser/
analyzer Planner Scheduler
Worker
Client
Pluggable
17. 17
Presto Extensibility – plugins
➔ Connectors
➔ Data types
➔ Extra functions
➔ (new) Security providers
18. 18
Presto Extensibility – connector interfaces
Parser/
analyzer Planner
Worker
Data location API
Hive
Cassandra
Kafka
MySQL
…
Metadata API
Hive
Cassandra
Kafka
MySQL
…
Data stream API
Hive
Cassandra
Kafka
MySQL
…
Scheduler
Coordinator
20. 20
➔ Data stays in memory during execution and is pipelined across nodes MPP-
style
➔ Vectorized columnar processing
➔ Presto is written in highly tuned Java
◆ Efficient in-memory data structures
◆ Very careful coding of inner loops
◆ Bytecode generation
➔ Optimized ORC reader
➔ Predicates push-down
➔ Query optimizer
Presto = Performance
21. 21
➔ Facebook
◆ Multiple production clusters (100s of nodes total)
● Including 300PB Hadoop data warehouse
◆ 1000s of internal daily active users
◆ Millions of queries each month
◆ Multiple PBs scanned every day
◆ Trillions of rows a day
➔ Netflix
◆ Over 200-node production cluster on EC2
◆ Over 15 PB in S3 (Parquet format)
◆ Over 300 users and 2.5K queries daily
Presto in Production
22. 22
➔ 100% open source contributions to Presto to
increase adoption in the enterprise
➔ A multi-year roadmap commitment to phased
enhancements of the open source code
➔ The first ever commercial support offering for
Presto
What is Teradata Doing?
Teradata Certified Presto
www.teradata.com/presto
23. 23
➔ Hadoop Distro Agnostic
➔ Modern Code Base
◆ Presto is well-designed open source software with proper database architecture
➔ Strong Like-Minded Community
➔ Push down processing across multiple data platforms
➔ Leverage Teradata expertise to make SQL for Hadoop viable
Why is Teradata Contributing to Presto?
24. 24
Implement Integrate Proliferate
Installer
Documentation
Monitoring & Support Tools
ODBC / JDBC Drivers
BI Certification
Security
Connectors
Commercial Support
Phase 1 Phase 2 Phase 3
June 8, 2015 Q4 2015 2016
Expanding ANSI SQL Coverage
Teradata Contributions to Presto
Management Tools
Integration
YARN Integration
25. 25
➔ Ease of install and management via Presto-Admin tool
◆ www.github.com/prestodb/presto-admin
◆ Packaging Presto as an RPM
➔ Testing Framework for Presto
◆ www.github.com/prestodb/tempto
◆ Added large number of tests
➔ JDBC driver for JAVA 6
➔ Various SQL improvements
Teradata’s Contributions
26. 26
➔ Continued SQL Improvements
➔ Security – Authentication & Authorization
➔ More Connectors – e.g. Hbase
➔ ODBC & JDBC Drivers that actually work
➔ BI tool certifications – e.g. Tableau
➔ YARN Integration
➔ Ambari Integration
➔ Open Source our Docker based Dev Env - WIP
➔ Open our Continuous Integration platform to the community
Teradata’s Contribution Product Roadmap
28. 28
“Presto is an integral part of the Airbnb data infrastructure stack with hundreds
of employees running queries each day with the technology. We are excited to
see Teradata joining the Presto open source community and are encouraged by
the direction of their contributions”
- James Mayfield, product lead, Airbnb.
"We are excited to see Teradata's commitment to Presto and adding capabilities
in the open source domain. This will create interesting opportunities within our
technical and business teams to open up more access options to our critical
data. We think this is a positive for Teradata and for the community as a whole”
- Steve Deasy, vice president of Engineering, Groupon.
Early Feedback is Extremely Positive