Presto@Netflix Presto Meetup 03-19-15

•Download as PPTX, PDF•

4 likes•2,942 views

Zhenxiao Luo

Presto Meetup 03-19-15

Internet

Presto @ Netflix: Interactive Queries
at Petabyte Scale
Nezih Yigitbasi and Zhenxiao Luo
Big Data Platform

Outline
» Big data platform @ Netflix
» Why we love Presto?
» Our contributions
» What are we working on?
» What else we need?

Cloud
Apps
S3
Suro Ursula
SSTable
s
Cassandra Aegisthus
Event Data
15m
Daily
Dimension Data
Our Data Pipeline

Data
Warehouse
Service
Tool
s
Gateways
Big Data Platform Architecture
Prod
Clients
Clusters
VPCQuery Prod TestBonusProd

» Batch jobs (Pig, Hive)
» ETL jobs
» reporting and other analysis
» Ad-hoc queries
» interactive data exploration
» Looked at Impala, Redshift, Spark, and Presto
Our Use Cases

Deployment
» v 0.86
» 1 coordinator (r3.4xlarge)
» 250 workers (m2.4xlarge)
Tooling
Numbers
» ~2.5K queries/day against our 10PB Hive DW on S3
» 230+ Presto users out of 300+ platform users
» presto-cli, Python, R,
BI tools (ODBC/JDBC), etc.
» Atlas/Suro for monitoring/logging
Presto @ Netflix

Why we love Presto?
» Open source
» Fast
» Scalable
» Works well on AWS
» Good integration with the Hadoop stack
» ANSI SQL

Our Contributions
24 open PRs, 60+ commits
» S3 file system
» multipart upload, IAM roles, retries, monitoring, etc.
» Functions for complex types
» Parquet
» name/index-based access, type coercion, etc.
» Query optimization
» Various other bug fixes

» Vectorized reader* Read based on column vectors
» Predicate pushdown Use statistics to skip data
» Lazy load Postpone loading the data until needed
» Lazy materialization Postpone decoding the data until needed
What are we Working on?
Parquet Optimizations
* PARQUET-

Netflix Integration
» BI tools integration
» ODBC driver, Tableau web connector, etc.
» Better monitoring
» Ganglia ⟶ Atlas
» Data lineage
» Presto ⟶ Suro ⟶ Charlotte

» Graceful cluster shrink
» Better resource management
» Dynamic type coercion for all file formats
» Support for more Hive types (e.g., decimal)
» Predictable metastore cache behavior
» Big table joins similar to Hive
What else we need?

What's hot

PrestoChen Chun

Presto @ Treasure Data - Presto Meetup Boston 2015Taro L. Saito

Presto - Analytical Database. Overview and use cases.Wojciech Biela

Netflix running Presto in the AWS CloudZhenxiao Luo

Superset druid realtimearupmalakar

Presto Strata Hadoop SJ 2016 short talkkbajda

Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...viirya

Bullet: A Real Time Data Query EngineDataWorks Summit

Presto Summit 2018 - 01 - Facebook Prestokbajda

Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks

Presto for the Enterprise @ Hadoop MeetupWojciech Biela

From Batch to Streaming ET(L) with Apache ApexDataWorks Summit

PrestoKnoldus Inc.

Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi

Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...HostedbyConfluent

Presto @ Uber Hadoop summit2017Zhenxiao Luo

Real time analytics at uber @ strata data 2019Zhenxiao Luo

Presto Summit 2018 - 02 - LinkedInkbajda

Presto GeoSpatial @ Strata New York 2017Zhenxiao Luo

Presto Summit 2018 - 09 - Netflix Icebergkbajda

What's hot (20)

Presto

Presto @ Treasure Data - Presto Meetup Boston 2015

Presto - Analytical Database. Overview and use cases.

Netflix running Presto in the AWS Cloud

Superset druid realtime

Presto Strata Hadoop SJ 2016 short talk

Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...

Bullet: A Real Time Data Query Engine

Presto Summit 2018 - 01 - Facebook Presto

Rental Cars and Industrialized Learning to Rank with Sean Downes

Presto for the Enterprise @ Hadoop Meetup

From Batch to Streaming ET(L) with Apache Apex

Presto

Introduction to Data Engineer and Data Pipeline at Credit OK

Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...

Presto @ Uber Hadoop summit2017

Real time analytics at uber @ strata data 2019

Presto Summit 2018 - 02 - LinkedIn

Presto GeoSpatial @ Strata New York 2017

Presto Summit 2018 - 09 - Netflix Iceberg

Viewers also liked

Putting Lipstick on Apache Pig at NetflixJeff Magnusson

Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015Kurt Brown

Presto at Facebook - Presto Meetup @ Boston (10/6/2015)Martin Traverso

JOSA TechTalk: Metadata Management in Big DataJordan Open Source Association

The evolution of the big data platform @ Netflix (OSCON 2015)Eva Tse

Netflix - Enabling a Culture of AnalyticsBlake Irvine

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...Amazon Web Services

Viewers also liked (7)

Putting Lipstick on Apache Pig at Netflix

Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015

Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

JOSA TechTalk: Metadata Management in Big Data

The evolution of the big data platform @ Netflix (OSCON 2015)

Netflix - Enabling a Culture of Analytics

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...

Similar to Presto@Netflix Presto Meetup 03-19-15

4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...PROIDEA

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks

Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine

Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent

Using the Open Science Data Cloud for Data Science ResearchRobert Grossman

Running Presto and Spark on the Netflix Big Data PlatformEva Tse

Elastic Data Analytics Platform @DatadogC4Media

(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services

Azure Data Explorer deep dive - review 04.2020Riccardo Zamana

Benchmark Showdown: Which Relational Database is the Fastest on AWS?Clustrix

Powering Interactive BI Analytics with Presto and Delta LakeDatabricks

Spark + AI Summit 2020 イベント概要Paulo Gutierrez

Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel

What's new in SQL on Hadoop and BeyondDataWorks Summit/Hadoop Summit

Yahoo compares Storm and SparkChicago Hadoop Users Group

1 Introduction to Microsoft data platform analytics for releaseJen Stirrup

USQL Trivadis Azure Data Lake EventTrivadis

Big Data Modeling Challenges and Machine Learning with No CodeLiana Ye

Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Folio3 Software

Data saturday malta - ADX Azure Data Explorer overviewRiccardo Zamana

Similar to Presto@Netflix Presto Meetup 03-19-15 (20)

4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...

Netflix Data Engineering @ Uber Engineering Meetup

Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset

Using the Open Science Data Cloud for Data Science Research

Running Presto and Spark on the Netflix Big Data Platform

Elastic Data Analytics Platform @Datadog

(BDT303) Running Spark and Presto on the Netflix Big Data Platform

Azure Data Explorer deep dive - review 04.2020

Benchmark Showdown: Which Relational Database is the Fastest on AWS?

Powering Interactive BI Analytics with Presto and Delta Lake

Spark + AI Summit 2020 イベント概要

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

What's new in SQL on Hadoop and Beyond

Yahoo compares Storm and Spark

1 Introduction to Microsoft data platform analytics for release

USQL Trivadis Azure Data Lake Event

Big Data Modeling Challenges and Machine Learning with No Code

Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...

Data saturday malta - ADX Azure Data Explorer overview

Recently uploaded

Real Men Wear Diapers T Shirts sweatshirtrahman018755

一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样ayvbos

Call girls Service in Ajman 0505086370 Ajman call girlsMonica Sydney

20240507 QFM013 Machine Intelligence Reading List April 2024.pdfMatthew Sinclair

Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsMonica Sydney

"Boost Your Digital Presence: Partner with a Leading SEO Agency"growthgrids

Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsPriya Reddy

Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsMonica Sydney

Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsMonica Sydney

Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Roommeghakumariji156

Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...meghakumariji156

Meaning of On page SEO & its process in detail.krishnachandrapal52

best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...kajalverma014

Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...gajnagarg

Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Balliameghakumariji156

Trump Diapers Over Dems t shirts Sweatshirtrahman018755

2nd Solid Symposium: Solid Pods vs Personal Knowledge GraphsEleniIlkou

一比一原版奥兹学院毕业证如何办理F

20240509 QFM015 Engineering Leadership Reading List April 2024.pdfMatthew Sinclair

原版制作美国爱荷华大学毕业证（iowa毕业证书）学位证网上存档可查ydyuyu

Recently uploaded (20)

Real Men Wear Diapers T Shirts sweatshirt

一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样

Call girls Service in Ajman 0505086370 Ajman call girls

20240507 QFM013 Machine Intelligence Reading List April 2024.pdf

Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts

"Boost Your Digital Presence: Partner with a Leading SEO Agency"

Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls

Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts

Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls

Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room

Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...

Meaning of On page SEO & its process in detail.

best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...

Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...

Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia

Trump Diapers Over Dems t shirts Sweatshirt

2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs

一比一原版奥兹学院毕业证如何办理

20240509 QFM015 Engineering Leadership Reading List April 2024.pdf

原版制作美国爱荷华大学毕业证（iowa毕业证书）学位证网上存档可查

Presto@Netflix Presto Meetup 03-19-15

1. Presto @ Netflix: Interactive Queries at Petabyte Scale Nezih Yigitbasi and Zhenxiao Luo Big Data Platform

2. Outline » Big data platform @ Netflix » Why we love Presto? » Our contributions » What are we working on? » What else we need?

3. Cloud Apps S3 Suro Ursula SSTable s Cassandra Aegisthus Event Data 15m Daily Dimension Data Our Data Pipeline

4. Data Warehouse Service Tool s Gateways Big Data Platform Architecture Prod Clients Clusters VPCQuery Prod TestBonusProd

5. » Batch jobs (Pig, Hive) » ETL jobs » reporting and other analysis » Ad-hoc queries » interactive data exploration » Looked at Impala, Redshift, Spark, and Presto Our Use Cases

6. Deployment » v 0.86 » 1 coordinator (r3.4xlarge) » 250 workers (m2.4xlarge) Tooling Numbers » ~2.5K queries/day against our 10PB Hive DW on S3 » 230+ Presto users out of 300+ platform users » presto-cli, Python, R, BI tools (ODBC/JDBC), etc. » Atlas/Suro for monitoring/logging Presto @ Netflix

7. Why we love Presto? » Open source » Fast » Scalable » Works well on AWS » Good integration with the Hadoop stack » ANSI SQL

8. Our Contributions 24 open PRs, 60+ commits » S3 file system » multipart upload, IAM roles, retries, monitoring, etc. » Functions for complex types » Parquet » name/index-based access, type coercion, etc. » Query optimization » Various other bug fixes

9. » Vectorized reader* Read based on column vectors » Predicate pushdown Use statistics to skip data » Lazy load Postpone loading the data until needed » Lazy materialization Postpone decoding the data until needed What are we Working on? Parquet Optimizations * PARQUET-

10. Netflix Integration » BI tools integration » ODBC driver, Tableau web connector, etc. » Better monitoring » Ganglia ⟶ Atlas » Data lineage » Presto ⟶ Suro ⟶ Charlotte

11. » Graceful cluster shrink » Better resource management » Dynamic type coercion for all file formats » Support for more Hive types (e.g., decimal) » Predictable metastore cache behavior » Big table joins similar to Hive What else we need?

12. THANK YOU

Editor's Notes

data from apps/services. event data 200b events: app logs, user activity (search event, movie detail click from website, etc.), system operational data ursula demultiplex the events into event types (~150 event types right now). latency of this ursula pipeline is 15m dimension data: subscriber data. aegisthus extracts data from cassandra which is the online backing store for netflix and writes to s3.
mention that we have single dw on s3, spin up multiple clusters. ittle perf diff. on s3 vs hdfs as we are mostly cpu bound. http://netflix.github.io/ sting: reporting charlotte: lineage
impala: no s3 support spark loads all data, doesn’t stream + stability issues at that time. it couldn’t even handle an hour worth of data ~ 2013. spark sql recently graduated from alpha with the spark 1.3 release (https://spark.apache.org/releases/spark-release-1-3-0.html) redshift: need to copy data from s3 to redshift
r3.4xlarge and m2.4xlarge are both memory optimized instances where m2 is a previous generation instance type 5PB of our 10PB Hive DW is in Parquet format
single warehouse on s3, spin up multiple test/prod presto clusters and query live data etc.
s3 fs: exp backoff, exposed various configs for the aws sdk, multipart upload, IAM roles, and monitoring prestoS3FileSystem and AWS sdk better tooling/community support for parquet. good integration with existing tools hive, spark, etc.. several bug fixes and new functions to manipulate complex types to close the gap between hive and presto DDL: alter/create table optimization:(2085) Rewrite Single Distinct Aggregation into GroupBy and (1937) and Optimize joins with similar subqueries complex types: array: contains, concat, sort, map: map_agg and map constructors, map_keys, map_values, etc. bridge the gap between hive and presto
We log queries to our internal data pipeline (Suro) and another internal tool (Charlotte) analyzes data lineage
we are pushing reporting to Presto with our Tableau/MS work. not for ETL. → monitoring, scheduling improvements. Presto’s distributed join is still memory-limited as there is no spills. hive decimal type: https://github.com/facebook/presto/issues/2417 -> at least be able to read it, still open

Presto@Netflix Presto Meetup 03-19-15

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Presto@Netflix Presto Meetup 03-19-15

Similar to Presto@Netflix Presto Meetup 03-19-15 (20)

Recently uploaded

Recently uploaded (20)

Presto@Netflix Presto Meetup 03-19-15

Editor's Notes