SlideShare a Scribd company logo
1 of 27
Download to read offline
Confidential Use Only – Do Not Share
David Phillips
Software Engineer
Facebook
Presto: Fast SQL on Everything
What is Presto?
• Open source distributed SQL query engine
• ANSI SQL compliant
• Originally developed by Facebook
• Used in production at many well known companies
Commercial Offerings
Notable Characteristics
• Adaptive multi-tenant system
• Run hundreds of concurrent queries on thousands of nodes
• Extensible, federated design
• Plugins provide connectors, functions, types, security
• Flexible design supports many different use cases
• High performance
• Many optimizations, code generation, long-lived JVM
Use Cases at Facebook
Interactive Analytics
• Facebook has a massive multi-tenant data warehouse
• Employees need to quickly analyze small data (~50GB-3TB)
• Visualizations, dashboards, notebooks, BI tools
• Clusters run 50-100 concurrent queries w/ diverse shapes
• Queries usually execute in seconds or minutes
• Users are latency sensitive
• Fast improves productivity, slow blocks their work
Batch ETL
• Populate and process data in the warehouse
• Jobs are scheduled using a workflow management system
• Similar to Azkaban or Airflow
• Manages dependencies between jobs
• Queries are typically written by data engineers
• More expensive in CPU and data volume than Interactive
• Throughput and efficiency more important than latency
A/B Testing
• Evaluate product changes via statistical hypothesis testing
• Results need to be available in hours (not days)
• Data must be complete and accurate
• Arbitrary slice and dice at interactive latency (~5 -30s)
• Cannot pre-aggregate data, must compute results on the fly
• Producing results requires joining multiple large data sets
• Web interface generates restricted query shapes
App Analytics
• External-user facing custom reporting tools
• Facebook Analytics offers analytics to application developers
• Web interface generates small set of query shapes
• Highly selective queries over large aggregate data volumes
• Application developers can only access their own data
• Very strict latency requirements (~100ms-5s)
• Highly available, hundreds of concurrent queries
System Design
Worker
Data Source APIProcessor
Worker
Coordinator
Planner/Optimizer Scheduler
Metadata API Data Location API
Queue
Processor
Query
Results Data Source APIProcessor
Worker
External
Storage
System
Presto
Architecture
Predicate Pushdown
• Engine provides connectors with a two part constraint:
1. Domain of values: ranges and nullability
2. “Black box” predicate for filtering
• Connectors report the domain they can guarantee
• Engine can elide redundant filtering
• Optimizer can make further use of this information
Data Layouts
• Optimizer takes advantage of physical layout of data
• Properties: partitioning, sorting, grouping, indexes
• Tables can have multiple layouts with different properties
• Layouts can have a subset of columns or data
• Optimizer chooses best layout for query
• Tune queries by adding new physical layouts
LeftJoin
LocalShuffle
Stage 2
Stage 4
partitioned-shuffle
Hash
Filter
Scan
Hash
Scan
AggregateFinal
Hash
Stage 0
Output
Stage 1
Stage 3
collecting-shuffle
partitioned-shuffle partitioned-shuffle
AggregatePartial
Stage 0
LeftJoin
LocalShuffle
Stage 1collecting-shuffle
Hash
Scan
Aggregate
Output
Hash
Filter
Scan
Optimized plan using
data layout properties
Original plan
without any
data layout
properties
Pre-computing Hashes
• Computing hashes can be expensive
• Especially for strings or complex types
• Push computation to the lowest level of the plan tree
• Re-use for aggregations, joins, local or remote shuffles
Intra-node Parallelism
• Use multiple threads on a single node
• More efficient than parallelism across nodes
• Little latency overhead
• Efficiently share state (e.g., hash tables) between threads
• Needed due to skew or table transforms
LookupJoin
HashBuild
LocalShuffle
ScanHashScanFilterHash
HashBuild
Pipeline 0
Pipeline 1
Pipeline 2
Stage 0
Task 0
Stage 1
Task 0 Task 1
Task 3..n
Task 2
HashAggregate
ScanHash
Physical Execution Plan
Pipeline 1 is parallelized
across multiple threads
Stage Scheduling
• Two scheduling policies:
1. All-at-once: minimize latency
2. Phased: minimize resource usage
Split Scheduling
• Splits are enumerated as the query executes, not up front
• For Hive, both partition metadata and discovering files
• Start executing immediately
• Queries often finish early (LIMIT or interactive)
• Reduces metadata memory usage on coordinator
• Splits are assigned to worker with shortest queue
Operating on Compressed Data
• Process dictionaries directly instead of values
• Shared dictionaries can be larger than rows
• Use heuristics to determine if speculation is working
• Hash table creation takes advantage of dictionaries
• Joins can produce dictionary encoded data
Page Layout in Memory
Page 0
partkey returnflag shipinstruct
52470
50600
18866
72387
7429
44077
148102
101228
"F" x 8
0: "IN PERSON"
1: "COD"
2: "RETURN"
3: "NONE"
LongBlock RLEBlock DictionaryBlock
Indices
1
0
1
2
0
2
2
1
Dictionary
Page 1
partkey returnflag
164648
35173
139350
40227
87261
184817
153099
"O" x 7
LongBlock RLEBlock DictionaryBlock
Indices2
2
2
0
1
3
2
Dictionary
shipinstruct
Writer Scaling
• Write performance dominated by concurrency
• Too few writers causes the query to be slow
• Too many writers creates small files
• Expensive to read later (metadata, IO, latency)
• Inefficient for storage system
• Add writers as needed when producer buffers are full, as
long as data written exceeds a configured threshold
Code Generation
• SQL → JVM bytecode → machine code
• Filter, project, sort comparators, aggregations
• Auto-vectorization, branch prediction, register use
• Eliminate virtual calls and allow inlining
• Profile each task independently based on data processed
• Avoid profile pollution across tasks and queries
• Profile can change during execution as data changes
CPU Time Improvements for Bytecode Generation
0
1000
2000
3000
4000
5000
6000
7000
Baseline 1 Transform 2 Transforms 3 Transforms
AvgCPUTime(seconds)
Generated Naïve
Fault Tolerance
• Node crash causes query failure
• In practice, failures are rare, even on large clusters
• Checkpointing or other recovery mechanisms have a cost
• Re-run failures rather than making everything expensive
• Limit runtime to a few hours to reduce waste and latency
• Clients retry on failure
Presto: Fast SQL on Everything

More Related Content

What's hot

Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeBizTalk360
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Databricks
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
 
10 Things Learned Releasing Databricks Enterprise Wide
10 Things Learned Releasing Databricks Enterprise Wide10 Things Learned Releasing Databricks Enterprise Wide
10 Things Learned Releasing Databricks Enterprise WideDatabricks
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
Integration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data LakeIntegration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data LakeTom Kerkhove
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLSingleStore
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Guglielmo Iozzia
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
Converging Database Transactions and Analytics
Converging Database Transactions and Analytics Converging Database Transactions and Analytics
Converging Database Transactions and Analytics SingleStore
 
Northwestern Mutual Journey – Transform BI Space to Cloud
Northwestern Mutual Journey – Transform BI Space to CloudNorthwestern Mutual Journey – Transform BI Space to Cloud
Northwestern Mutual Journey – Transform BI Space to CloudDatabricks
 
How Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and AnalyticsHow Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and AnalyticsSingleStore
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Building a Federated Data Directory Platform for Public Health
Building a Federated Data Directory Platform for Public HealthBuilding a Federated Data Directory Platform for Public Health
Building a Federated Data Directory Platform for Public HealthDatabricks
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeTorsten Steinbach
 
Presto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FuturePresto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FutureDataWorks Summit
 
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"Fwdays
 
Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...
Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...
Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...HostedbyConfluent
 

What's hot (20)

Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data Lake
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
 
10 Things Learned Releasing Databricks Enterprise Wide
10 Things Learned Releasing Databricks Enterprise Wide10 Things Learned Releasing Databricks Enterprise Wide
10 Things Learned Releasing Databricks Enterprise Wide
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Integration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data LakeIntegration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data Lake
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQL
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Converging Database Transactions and Analytics
Converging Database Transactions and Analytics Converging Database Transactions and Analytics
Converging Database Transactions and Analytics
 
Northwestern Mutual Journey – Transform BI Space to Cloud
Northwestern Mutual Journey – Transform BI Space to CloudNorthwestern Mutual Journey – Transform BI Space to Cloud
Northwestern Mutual Journey – Transform BI Space to Cloud
 
How Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and AnalyticsHow Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and Analytics
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Building a Federated Data Directory Platform for Public Health
Building a Federated Data Directory Platform for Public HealthBuilding a Federated Data Directory Platform for Public Health
Building a Federated Data Directory Platform for Public Health
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lake
 
Presto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FuturePresto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and Future
 
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
 
Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...
Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...
Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...
 

Similar to Presto: Fast SQL on Everything

Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learnJohn D Almon
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol ValidationBIOVIA
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersLucidworks
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
Boost the Performance of SharePoint Today!
Boost the Performance of SharePoint Today!Boost the Performance of SharePoint Today!
Boost the Performance of SharePoint Today!Brian Culver
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at AlibabaMichael Stack
 
Architecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionArchitecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionNguyen Tung
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL DatabasesEmanuel Calvo
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale
 
Capacity planning for your data stores
Capacity planning for your data storesCapacity planning for your data stores
Capacity planning for your data storesColin Charles
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Petter Skodvin-Hvammen
 
Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Remy Rosenbaum
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017Roy Russo
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2GokulD
 
Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017Casey Kinsey
 
Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3Wen-Tien Chang
 
World-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftWorld-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftLars Kamp
 
What's new in JBoss ON 3.2
What's new in JBoss ON 3.2What's new in JBoss ON 3.2
What's new in JBoss ON 3.2Thomas Segismont
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 

Similar to Presto: Fast SQL on Everything (20)

Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Boost the Performance of SharePoint Today!
Boost the Performance of SharePoint Today!Boost the Performance of SharePoint Today!
Boost the Performance of SharePoint Today!
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
Architecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionArchitecture Patterns - Open Discussion
Architecture Patterns - Open Discussion
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL Databases
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Capacity planning for your data stores
Capacity planning for your data storesCapacity planning for your data stores
Capacity planning for your data stores
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)
 
Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017
 
Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3
 
World-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftWorld-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon Redshift
 
What's new in JBoss ON 3.2
What's new in JBoss ON 3.2What's new in JBoss ON 3.2
What's new in JBoss ON 3.2
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 

Recently uploaded

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Presto: Fast SQL on Everything

  • 1. Confidential Use Only – Do Not Share David Phillips Software Engineer Facebook Presto: Fast SQL on Everything
  • 2. What is Presto? • Open source distributed SQL query engine • ANSI SQL compliant • Originally developed by Facebook • Used in production at many well known companies
  • 3.
  • 5. Notable Characteristics • Adaptive multi-tenant system • Run hundreds of concurrent queries on thousands of nodes • Extensible, federated design • Plugins provide connectors, functions, types, security • Flexible design supports many different use cases • High performance • Many optimizations, code generation, long-lived JVM
  • 6. Use Cases at Facebook
  • 7. Interactive Analytics • Facebook has a massive multi-tenant data warehouse • Employees need to quickly analyze small data (~50GB-3TB) • Visualizations, dashboards, notebooks, BI tools • Clusters run 50-100 concurrent queries w/ diverse shapes • Queries usually execute in seconds or minutes • Users are latency sensitive • Fast improves productivity, slow blocks their work
  • 8. Batch ETL • Populate and process data in the warehouse • Jobs are scheduled using a workflow management system • Similar to Azkaban or Airflow • Manages dependencies between jobs • Queries are typically written by data engineers • More expensive in CPU and data volume than Interactive • Throughput and efficiency more important than latency
  • 9. A/B Testing • Evaluate product changes via statistical hypothesis testing • Results need to be available in hours (not days) • Data must be complete and accurate • Arbitrary slice and dice at interactive latency (~5 -30s) • Cannot pre-aggregate data, must compute results on the fly • Producing results requires joining multiple large data sets • Web interface generates restricted query shapes
  • 10. App Analytics • External-user facing custom reporting tools • Facebook Analytics offers analytics to application developers • Web interface generates small set of query shapes • Highly selective queries over large aggregate data volumes • Application developers can only access their own data • Very strict latency requirements (~100ms-5s) • Highly available, hundreds of concurrent queries
  • 12. Worker Data Source APIProcessor Worker Coordinator Planner/Optimizer Scheduler Metadata API Data Location API Queue Processor Query Results Data Source APIProcessor Worker External Storage System Presto Architecture
  • 13. Predicate Pushdown • Engine provides connectors with a two part constraint: 1. Domain of values: ranges and nullability 2. “Black box” predicate for filtering • Connectors report the domain they can guarantee • Engine can elide redundant filtering • Optimizer can make further use of this information
  • 14. Data Layouts • Optimizer takes advantage of physical layout of data • Properties: partitioning, sorting, grouping, indexes • Tables can have multiple layouts with different properties • Layouts can have a subset of columns or data • Optimizer chooses best layout for query • Tune queries by adding new physical layouts
  • 15. LeftJoin LocalShuffle Stage 2 Stage 4 partitioned-shuffle Hash Filter Scan Hash Scan AggregateFinal Hash Stage 0 Output Stage 1 Stage 3 collecting-shuffle partitioned-shuffle partitioned-shuffle AggregatePartial Stage 0 LeftJoin LocalShuffle Stage 1collecting-shuffle Hash Scan Aggregate Output Hash Filter Scan Optimized plan using data layout properties Original plan without any data layout properties
  • 16. Pre-computing Hashes • Computing hashes can be expensive • Especially for strings or complex types • Push computation to the lowest level of the plan tree • Re-use for aggregations, joins, local or remote shuffles
  • 17. Intra-node Parallelism • Use multiple threads on a single node • More efficient than parallelism across nodes • Little latency overhead • Efficiently share state (e.g., hash tables) between threads • Needed due to skew or table transforms
  • 18. LookupJoin HashBuild LocalShuffle ScanHashScanFilterHash HashBuild Pipeline 0 Pipeline 1 Pipeline 2 Stage 0 Task 0 Stage 1 Task 0 Task 1 Task 3..n Task 2 HashAggregate ScanHash Physical Execution Plan Pipeline 1 is parallelized across multiple threads
  • 19. Stage Scheduling • Two scheduling policies: 1. All-at-once: minimize latency 2. Phased: minimize resource usage
  • 20. Split Scheduling • Splits are enumerated as the query executes, not up front • For Hive, both partition metadata and discovering files • Start executing immediately • Queries often finish early (LIMIT or interactive) • Reduces metadata memory usage on coordinator • Splits are assigned to worker with shortest queue
  • 21. Operating on Compressed Data • Process dictionaries directly instead of values • Shared dictionaries can be larger than rows • Use heuristics to determine if speculation is working • Hash table creation takes advantage of dictionaries • Joins can produce dictionary encoded data
  • 22. Page Layout in Memory Page 0 partkey returnflag shipinstruct 52470 50600 18866 72387 7429 44077 148102 101228 "F" x 8 0: "IN PERSON" 1: "COD" 2: "RETURN" 3: "NONE" LongBlock RLEBlock DictionaryBlock Indices 1 0 1 2 0 2 2 1 Dictionary Page 1 partkey returnflag 164648 35173 139350 40227 87261 184817 153099 "O" x 7 LongBlock RLEBlock DictionaryBlock Indices2 2 2 0 1 3 2 Dictionary shipinstruct
  • 23. Writer Scaling • Write performance dominated by concurrency • Too few writers causes the query to be slow • Too many writers creates small files • Expensive to read later (metadata, IO, latency) • Inefficient for storage system • Add writers as needed when producer buffers are full, as long as data written exceeds a configured threshold
  • 24. Code Generation • SQL → JVM bytecode → machine code • Filter, project, sort comparators, aggregations • Auto-vectorization, branch prediction, register use • Eliminate virtual calls and allow inlining • Profile each task independently based on data processed • Avoid profile pollution across tasks and queries • Profile can change during execution as data changes
  • 25. CPU Time Improvements for Bytecode Generation 0 1000 2000 3000 4000 5000 6000 7000 Baseline 1 Transform 2 Transforms 3 Transforms AvgCPUTime(seconds) Generated Naïve
  • 26. Fault Tolerance • Node crash causes query failure • In practice, failures are rare, even on large clusters • Checkpointing or other recovery mechanisms have a cost • Re-run failures rather than making everything expensive • Limit runtime to a few hours to reduce waste and latency • Clients retry on failure