Presented at ODSC West 2019:
https://odsc.com/training/portfolio/big-fast-queries-with-presto-on-openshift-2/
Abstract:
Next generation data platforms are embracing the proliferation of technologies that help organizations discover, catalog, process, and derive insight from their data. OpenShift, and OpenShift Container Storage are at the forefront of this transition and provide a foundation for building a self service environment for developers, data engineers, and data scientists. In this demo we'll share how Starburst Presto on OpenShift can power your interactive and ad-hoc data discovery. SQL on anything means fast, secure access to data in OpenShift Container Storage, and federated access to data anywhere. With Starburst on OpenShift you have access to the world’s fastest open source SQL query engine, enterprise ready, across clouds public and private.
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber. In the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
1. With Presto on OpenShift
BIG FAST SQL
Kamil Bajda-Pawlikowski
Co-founder / CTO
Michael St-Jean
Principal Marketing Manager
1
ODSCWest-2019@SanFrancisco
5. Why Presto?
Community-driven
open source
project
High performance ANSI SQL
engine
• New Cost-Based Query Optimizer
• Proven scalability
• High concurrency
Separation of compute
and storage
• Scale storage and
compute independently
• No ETL or data integration
necessary to get to
insights
• SQL-on-anything
No vendor lock-in
• No Hadoop distro vendor lock-in
• No storage engine vendor lock-in
• No cloud vendor lock-in
9. Administrative challenges
● Configuring and managing clusters
● Autotuning properties based on the hardware provisioned
● High Availability for Presto Coordinator
● Scaling cluster elastically based on query load
● Gracefully decommissioning Presto Workers to avoid killing queries
● Monitoring of hardware and software layers
https://www.starburstdata.com/technical-blog/presto-on-kubernetes/
13. ● Massively scalable
○ 10’s-100’s of PBs
○ Billions of objects
○ 100’s of gigabits
● Erasure coding drives storage efficiency
● High level of fidelity with S3 API
● Open source software - LGPL 2.1 / 3
14. ● Operators: OCS, Rook-Ceph, Rook-Noobaa
● Ceph for block (RWO), file (RWO/RWX), object (S3)
● Noobaa for data federation in the hybrid cloud