Presto: Distributed sql query engine

•Download as PPTX, PDF•

14 likes•9,568 views

kiran palaka

Distributed sql query engine

Technology

Problem to solve
 Huge production of data.
 As data is growing enormously to the point of peta bytes
, querying the database has become a big issue.
 So we should be able to run more interactive queries and get
results faster .

Introduction
 Presto is a open source distributed sql query engine.
 For running queries against of all sizes ranging from
gigabytes to petabytes .
 It supports ANSI SQL ,including complex
queries,aggresgations,joins and window functions .
 It is implemented in java.

Architecture Explanation
 Client sends sql to presto coordinator.
 Coordinator parses ,analyzes and plans the query execution.
 The scheduler wires together the execution pipeline ,assigns
work to nodes closest to data and monitors the progress.
 The client pulls the data from output stage which in turn pulls
data from underlying stages.

Hive/Mapreduce Execution model
 Hive translates queries into multiple stage of mapreduce
tasks and execute them one after the other.
 Each task reads input from disk and writes intermediate
output back to disk.

Presto Execution
 Presto engine does not use Mapreduce.
 It employs a custom query and execution engine with
operators designed to support sql semantics.
 Processing is in memory and pipelined across the network
between stages which avoids unnecessary I/O and
associated latency overhead.
 Pipelined execution model runs multiple stages at once and
streams data from one stage to next as it becomes available
which reduces end-to-end latency

Note
 Presto dynamically compiles certain portions of query plan to
byte code which lets JVM optimize and generate native
machine code.

Extensibility
 Presto was designed with a simple storage abstraction that
makes its easy to provide sql query capability against
disparate data sources.
 Connectors only need to provide interfaces for fetching meta
data, getting data locations and accessing data itself.

Limitations
 Size limitation on the join tables and cardinality of unique
groups.
 Lacks the ability to write output back to tables. Currently
query results are streamed to client.

Presto developers claim:
 Presto is 10x better than hive/Mapreduce in terms of cpu
efficiency and latency for most queries.
 Supports ANSI sql, including joins, left/right outer
joins,subqueries,most of the common aggregate and scalar
functions, including approximate distinct counts,
approximate percentiles

What's hot

Understanding Query Plans and Spark UIsDatabricks

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Real-time Analytics with Trino and Apache PinotXiang Fu

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks

Introduction to PySparkRussell Jurney

Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner

Memory Management in Apache SparkDatabricks

Physical Plans in Spark SQLDatabricks

Dynamic Partition Pruning in Apache SparkDatabricks

Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

Hudi architecture, fundamentals and capabilitiesNishith Agarwal

Introduction to Spark InternalsPietro Michiardi

Spark SQLJoud Khattab

Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks

Free Training: How to Build a LakehouseDatabricks

What's hot (20)

Understanding Query Plans and Spark UIs

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Real-time Analytics with Trino and Apache Pinot

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...

Apache Spark Core—Deep Dive—Proper Optimization

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Introduction to PySpark

Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

Memory Management in Apache Spark

Physical Plans in Spark SQL

Dynamic Partition Pruning in Apache Spark

Cosco: An Efficient Facebook-Scale Shuffle Service

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...

Hudi architecture, fundamentals and capabilities

Introduction to Spark Internals

Spark SQL

Optimizing Delta/Parquet Data Lakes for Apache Spark

Free Training: How to Build a Lakehouse

Viewers also liked

Facebook Presto presentationCyanny LIANG

Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CAkbajda

Presto at Hadoop Summit 2016kbajda

Presto @ Facebook: Past, Present and FutureDataWorks Summit

Presto - SQL on anythingGrzegorz Kokosiński

How to ensure Presto scalability  in multi use case Kai Sasaki

Optimizing Presto Connector on Cloud StorageKai Sasaki

Hive, Presto, and Spark on TPC-DS benchmarkDongwon Kim

Viewers also liked (8)

Facebook Presto presentation

Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA

Presto at Hadoop Summit 2016

Presto @ Facebook: Past, Present and Future

Presto - SQL on anything

How to ensure Presto scalability  in multi use case

Optimizing Presto Connector on Cloud Storage

Hive, Presto, and Spark on TPC-DS benchmark

Similar to Presto: Distributed sql query engine

Real time analytics at uber @ strata data 2019Zhenxiao Luo

Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf

A noETL Parallel Streaming Transformation Loader using Spark, Kafka & VerticaData Con LA

Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel

Real time analytics on deep learning @ strata data 2019Zhenxiao Luo

How Java 19 Influences the Future of Your High-Scale Applications .pdfAna-Maria Mihalceanu

ChakraCore - JSConf Last CallGaurav Seth

Webinar september 2013Marc Gille

Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks

Presentation-QRUAAshwini Sarode

HPC Impact: EDA Telemetry Neural Networksinside-BigData.com

Presto: Query Anything - Data Engineer’s perspectiveAlluxio, Inc.

Building an open source high performance data analytics platformsupun06

Asko Oja Moskva Architecture HighloadOntico

Understanding the Single Thread Event LoopTorontoNodeJS

Zeppelin at TwitterPrasad Wagle

Web applicationRajivKumarSingh27

Ultralight Data Movement for IoT with SDC EdgeDataWorks Summit

Rajeev_ResumeRajeev Bhatnagar

Automatically partitioning packet processing applications for pipelined archi...Ashley Carter

Similar to Presto: Distributed sql query engine (20)

Real time analytics at uber @ strata data 2019

Jack Gudenkauf sparkug_20151207_7

A noETL Parallel Streaming Transformation Loader using Spark, Kafka & Vertica

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Real time analytics on deep learning @ strata data 2019

How Java 19 Influences the Future of Your High-Scale Applications .pdf

ChakraCore - JSConf Last Call

Webinar september 2013

Building Continuous Application with Structured Streaming and Real-Time Data ...

Presentation-QRUA

HPC Impact: EDA Telemetry Neural Networks

Presto: Query Anything - Data Engineer’s perspective

Building an open source high performance data analytics platform

Asko Oja Moskva Architecture Highload

Understanding the Single Thread Event Loop

Zeppelin at Twitter

Web application

Ultralight Data Movement for IoT with SDC Edge

Rajeev_Resume

Automatically partitioning packet processing applications for pipelined archi...

Recently uploaded

How to write a Business Continuity PlanDatabarracks

What is Artificial Intelligence?????????blackmambaettijean

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

A Journey Into the Emotions of Software DevelopersNicole Novielli

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Time Series Foundation Models - current state and future directionsNathaniel Shimoni

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Advanced Computer Architecture – An IntroductionDilum Bandara

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Artificial intelligence in cctv survelliance.pptxhariprasad279825

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Recently uploaded (20)

How to write a Business Continuity Plan

What is Artificial Intelligence?????????

TeamStation AI System Report LATAM IT Salaries 2024

Developer Data Modeling Mistakes: From Postgres to NoSQL

A Journey Into the Emotions of Software Developers

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Time Series Foundation Models - current state and future directions

Moving Beyond Passwords: FIDO Paris Seminar.pdf

WordPress Websites for Engineers: Elevate Your Brand

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

DSPy a system for AI to Write Prompts and Do Fine Tuning

Advanced Computer Architecture – An Introduction

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Generative AI for Technical Writer or Information Developers

Take control of your SAP testing with UiPath Test Suite

Artificial intelligence in cctv survelliance.pptx

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Presto: Distributed sql query engine

1. PRESTO Kiran Palaka

2. Problem to solve  Huge production of data.  As data is growing enormously to the point of peta bytes , querying the database has become a big issue.  So we should be able to run more interactive queries and get results faster .

3. Introduction  Presto is a open source distributed sql query engine.  For running queries against of all sizes ranging from gigabytes to petabytes .  It supports ANSI SQL ,including complex queries,aggresgations,joins and window functions .  It is implemented in java.

4. Presto: I can query

5. Architecture

6. Architecture Explanation  Client sends sql to presto coordinator.  Coordinator parses ,analyzes and plans the query execution.  The scheduler wires together the execution pipeline ,assigns work to nodes closest to data and monitors the progress.  The client pulls the data from output stage which in turn pulls data from underlying stages.

7. Hive/Mapreduce Execution model  Hive translates queries into multiple stage of mapreduce tasks and execute them one after the other.  Each task reads input from disk and writes intermediate output back to disk.

8. Presto Execution  Presto engine does not use Mapreduce.  It employs a custom query and execution engine with operators designed to support sql semantics.  Processing is in memory and pipelined across the network between stages which avoids unnecessary I/O and associated latency overhead.  Pipelined execution model runs multiple stages at once and streams data from one stage to next as it becomes available which reduces end-to-end latency

9. Note  Presto dynamically compiles certain portions of query plan to byte code which lets JVM optimize and generate native machine code.

10. Extensibility  Presto was designed with a simple storage abstraction that makes its easy to provide sql query capability against disparate data sources.  Connectors only need to provide interfaces for fetching meta data, getting data locations and accessing data itself.

11. Limitations  Size limitation on the join tables and cardinality of unique groups.  Lacks the ability to write output back to tables. Currently query results are streamed to client.

12. Presto developers claim:  Presto is 10x better than hive/Mapreduce in terms of cpu efficiency and latency for most queries.  Supports ANSI sql, including joins, left/right outer joins,subqueries,most of the common aggregate and scalar functions, including approximate distinct counts, approximate percentiles

Presto: Distributed sql query engine

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Presto: Distributed sql query engine

Similar to Presto: Distributed sql query engine (20)

Recently uploaded

Recently uploaded (20)

Presto: Distributed sql query engine