SlideShare a Scribd company logo
1 of 47
Download to read offline
Sadayuki Furuhashi
Founder & Software Architect
Treasure Data, inc.
internals
PostgreSQL protocol gateway for Presto
A little about me...
> Sadayuki Furuhashi
> github/twitter: @frsyuki
> Treasure Data, Inc.
> Founder & Software Architect
> Open-source hacker
> MessagePack - Efficient object serializer
> Fluentd - An unified data collection tool
> ServerEngine - A Ruby framework to build multiprocess servers
> Prestogres - PostgreSQL protocol gateway for Presto
> LS4 - A distributed object storage with cross-region replication
> kumofs - A distributed strong-consistent key-value data store
Today’s talk
1. What’s Presto?
2. Prestogres design
3. Prestogres implementation
4. Prestogres hacks
5. Prestogres future works
1. What’s Presto?
What’s Presto?
A distributed SQL query engine

for interactive data analisys

against GBs to PBs of data.
What’s the problems to solve?
> We couldn’t visualize data in HDFS directly using
dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response

(PostgreSQL, Redshift, etc.)
> Interactive DB costs more and less scalable by far
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
Commercial

BI Tools
Batch analysis platform Visualization platform
Dashboard
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
✓ Less scalable
✓ Extra cost
Commercial

BI Tools
Dashboard
✓ Extra work to manage

2 platforms
✓ Can’t query against

“live”data directly
Batch analysis platform Visualization platform
HDFS
Hive Dashboard
Presto
PostgreSQL, etc.
Daily/Hourly Batch
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Interactive query
Data analysis platform
Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Cassandra PostgreSQL Commertial DBs
SQL on any data sets
Data analysis platform
Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Cassandra PostgreSQL Commertial DBs
SQL on any data sets Commercial

BI Tools
✓ IBM Cognos

✓ Tableau

✓ ...
Data analysis platform
Prestogres
Presto
HDFS
Dashboard
Interactive query
Commercial

BI Tools
✓ IBM Cognos

✓ Tableau

✓ ...
Prestogres
Today’s topic!
dashboard on chart.io: https://chartio.com/
What can Presto do?
> Query interactively (in milli-seconds to minues)
> MapReduce and Hive are still necessary for ETL
> Query using commercial BI tools or dashboards
> Reliable ODBC/JDBC connectivity through Prestogres
> Query across multiple data sources such as

Hive, HBase, Cassandra, or even internal DBs
> Plugin mechanism
> Integrate batch analisys + visualization

into a single data analysis platform
Presto’s deployment
> Facebook
> Multiple geographical regions
> scaled to 1,000 nodes
> actively used by 1,000+ employees
> who run 30,000+ queries every day
> processing 1PB/day
> Netflix, Dropbox, Treasure Data, Airbnb, Qubole
> Presto as a Service
Prestogres design of the ODBC/JDBC gateway
The problems to use Presto with
BI tools
> BI tools need ODBC or JDBC connectivity
> Tableau, IBM Cognos, QlickView, Chart.IO, ...
> JasperSoft, Pentaho, MotionBoard, ...
> ODBC/JDBC is VERY COMPLICATED
> Matured implementation needs LONG time
• psqlODBC: 58,000 lines
• postgresql-jdbc: 62,000 lines
• mysql-connctor-odbc: 27,000 lines
• mysql-connector-j: 101,000 lines
A solution
> Creates a PostgreSQL protocol gateway
> Uses PostgreSQL’s stable ODBC / JDBC driver
Other possible designs were…
a) MySQL protocol + libdrizzle:
> Drizzle provides a well-designed library to implement
MySQL protocol server.
> Proof-of-concept worked well:
• trd-gateway - MySQL protocol gateway for Hive
> Difficulties: clients assumes the server is MySQL but,
• syntax mismatches: MySQL uses `…` while Presto “…”
• function mismatches: DAYOFMONTH(…) vs EXTRACT(day…)
b) PostgreSQL + Foreign Data Wrapper (FDW):
> JOIN and aggregation pushdown is not available
Other possible designs were…
c) PostgreSQL + H2 database + patch:
> H2 is an embedded database engine written in Java
> H2 has a PostgreSQL protocol implementation in Java
> Difficulties:
• System catalog implementation is incomplete

(pg_class, pg_namespace, pg_proc, etc.)
d) Reusing PostgreSQL protocol impl.:
> Difficulties:
• complete implementation of system catalogs was too difficult
Prestogres design
pgpool-II + PostgreSQL + PL/Python
> pgpool-II is a PostgreSQL protocol middleware for
replication, failover, load-balancing, etc.
> pgpool-II originally has some useful code

(parsing SQL, rewriting SQL, hacking system catalogs, …)
> Basic idea:
• Rewrite queries at pgpool-II and run Presto queries using PL/Python
select count(1)

from access
select * from

python_func(‘select count(1) from access’)
rewrite!
Prestogres implementation
psql
pgpool-IIodbc
jdbc
PostgreSQL Presto
Authentication Create faked system

catalogs for meta-queries
1. 2.
Rewriting queries Executing queries using
PL/Python
3. 4.
Overview
Patched!
psql
pgpool-IIodbc
jdbc
PostgreSQL Presto
Authentication Create faked system

catalogs for meta-queries
1. 2.
Rewriting queries Executing queries using
PL/Python
3. 4.
Overview
Patched!
Prestogres
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
StartupPacket {
database = “mydb”,
user = “me”,
…
}
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
prestogres.conf
system_db_dbname = ‘postgres’
system_db_user = ‘prestogres’
prestogres_hba.conf
host mydb me 0.0.0.0/0 trust

presto_server presto.local:8080,
presto_catalog hive,

pg_database hive
StartupPacket {
database = “mydb”,
user = “me”,
…
}
pgpool-IIpsql PostgreSQL Presto
StartupPacket {
database = “mydb”,
user = “me”,
…
}
$ psql -U me mydb
prestogres.conf
system_db_dbname = ‘postgres’
system_db_user = ‘prestogres’
> CREATE DATABASE hive;
> CREATE ROLE me;
> CREATE FUNCTION setup_system_catalog;
> CREATE FUNCTION start_presto_query;
libpq host=‘localhost’, dbname=‘postgres’,

user=‘prestogres’ (system_db)
prestogres_hba.conf
host mydb me 0.0.0.0/0 trust

presto_server presto.local:8080,
presto_catalog hive,

pg_database hive
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
prestogres_hba.conf
host mydb me 0.0.0.0/0 trust

presto_server presto.local:8080,
presto_catalog hive,

pg_database hive
prestogres.conf
system_db_dbname = ‘postgres’
system_db_user = ‘prestogres’
StartupPacket {
database = “hive”,
user = “me”,
…
}
StartupPacket {
database = “mydb”,
user = “me”,
…
}
system catalog!
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” SELECT * FROM pg_class;
"Query against a system catalog!”
Meta-query
system catalog!
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
SELECT setup_system_catalog(‘presto.local:8080’, ‘hive’)“Q” SELECT * FROM pg_class;
"Query against a system catalog!”
Meta-query
PL/Python function

defined at prestogres.py
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
SELECT setup_system_catalog(‘presto.local:8080’, ‘hive’)“Q” SELECT * FROM pg_class;
> CREATE TABLE access_logs;
> CREATE TABLE users;
> CREATE TABLE events;
…
Meta-query
SELECT * FROM

information_schema.columns
"Query against a system catalog!”
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” SELECT * FROM pg_class; “Q” SELECT * FROM pg_class;
Meta-query
"Query against a system catalog!”
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” select count(*) from access_logs;
regular table!
Presto Query
"Query against a regular table!”
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” select count(*) from access_logs; SELECT start_presto_query(…

‘select count(*) from access_logs’)
regular table!
Presto Query
"Query against a regular table!”
PL/Python function

defined at prestogres.py
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” select count(*) from access_logs; SELECT start_presto_query(…

‘select count(*) from access_logs’)
> CREATE TYPE result_type (c0_ BIGINT);
> CREATE FUNCTION fetch_results 

RETURNS SETOF result_type
…
regular table!
Presto Query
"Query against a regular table!”
1. start the query on Presto
2. define a function

to fetch the result
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” select count(*) from access_logs; “Q” SELECT * FROM fetch_results();
Presto Query
"Query against a regular table!”
PL/Python function

defined by start_presto_query
“Q” RAISE EXCEPTION …
Prestogres hacks
Multi-statement queries
BEGIN; SELECT 1; COMMIT;
Supporting Cursors
DECLARE CURSOR xyz FOR select …; FETCH
Security
security definer
Error message handling
raise exception ‘%’, E’…’ using errcode = …;
Faked current_database()
delete from pg_catalog.pg_proc where
proname=‘current_database’;
create function pg_catalog.current_database()

returns name as $$

begin

return ‘faked_name’::name;

end

$$ language plpgsql stable strict;
Future works
Future works
Rewriting CAST syntax
Extended query
CREATE TEMP TABLE
DROP TABLE
Check: www.treasuredata.com
Cloud service for the entire data pipeline,
including Presto. We’re hiring!
Prestogres internals
Prestogres internals

More Related Content

What's hot

3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and DeltaDatabricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itBruno Faria
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks
 
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...HostedbyConfluent
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®confluent
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache CalciteJulian Hyde
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaJiangjie Qin
 
ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리confluent
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
PostgreSQL Materialized Views with Active Record
PostgreSQL Materialized Views with Active RecordPostgreSQL Materialized Views with Active Record
PostgreSQL Materialized Views with Active RecordDavid Roberts
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan BlueDatabricks
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike SteenbergenMeet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergendistributed matters
 
Postgresql 12 streaming replication hol
Postgresql 12 streaming replication holPostgresql 12 streaming replication hol
Postgresql 12 streaming replication holVijay Kumar N
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 

What's hot (20)

3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
 
ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
PostgreSQL Materialized Views with Active Record
PostgreSQL Materialized Views with Active RecordPostgreSQL Materialized Views with Active Record
PostgreSQL Materialized Views with Active Record
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike SteenbergenMeet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
 
Postgresql 12 streaming replication hol
Postgresql 12 streaming replication holPostgresql 12 streaming replication hol
Postgresql 12 streaming replication hol
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 

Similar to Prestogres internals

SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014N Masahiro
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Zhenxiao Luo
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamGreg Goltsov
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSSN Masahiro
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryMárton Kodok
 
Data science for infrastructure dev week 2022
Data science for infrastructure   dev week 2022Data science for infrastructure   dev week 2022
Data science for infrastructure dev week 2022ZainAsgar1
 
Enterprise Data Science
Enterprise Data ScienceEnterprise Data Science
Enterprise Data ScienceMisha Lisovich
 
PostgreSQL Performance Problems: Monitoring and Alerting
PostgreSQL Performance Problems: Monitoring and AlertingPostgreSQL Performance Problems: Monitoring and Alerting
PostgreSQL Performance Problems: Monitoring and AlertingGrant Fritchey
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoSadayuki Furuhashi
 
Choisir entre une API RPC, SOAP, REST, GraphQL? 
Et si le problème était ai...
Choisir entre une API  RPC, SOAP, REST, GraphQL?  
Et si le problème était ai...Choisir entre une API  RPC, SOAP, REST, GraphQL?  
Et si le problème était ai...
Choisir entre une API RPC, SOAP, REST, GraphQL? 
Et si le problème était ai...François-Guillaume Ribreau
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak
 
Hannes end-of-the-router-tnc17
Hannes end-of-the-router-tnc17Hannes end-of-the-router-tnc17
Hannes end-of-the-router-tnc17Hannes Gredler
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...viirya
 
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fastpandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fastUwe Korn
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Bostonkbajda
 

Similar to Prestogres internals (20)

SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSS
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
 
Data science for infrastructure dev week 2022
Data science for infrastructure   dev week 2022Data science for infrastructure   dev week 2022
Data science for infrastructure dev week 2022
 
Enterprise Data Science
Enterprise Data ScienceEnterprise Data Science
Enterprise Data Science
 
PostgreSQL Performance Problems: Monitoring and Alerting
PostgreSQL Performance Problems: Monitoring and AlertingPostgreSQL Performance Problems: Monitoring and Alerting
PostgreSQL Performance Problems: Monitoring and Alerting
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
Choisir entre une API RPC, SOAP, REST, GraphQL? 
Et si le problème était ai...
Choisir entre une API  RPC, SOAP, REST, GraphQL?  
Et si le problème était ai...Choisir entre une API  RPC, SOAP, REST, GraphQL?  
Et si le problème était ai...
Choisir entre une API RPC, SOAP, REST, GraphQL? 
Et si le problème était ai...
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Presto
PrestoPresto
Presto
 
Hannes end-of-the-router-tnc17
Hannes end-of-the-router-tnc17Hannes end-of-the-router-tnc17
Hannes end-of-the-router-tnc17
 
Postgres Toolkit
Postgres ToolkitPostgres Toolkit
Postgres Toolkit
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
 
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fastpandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fast
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
 

More from Sadayuki Furuhashi

Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Sadayuki Furuhashi
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesSadayuki Furuhashi
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Sadayuki Furuhashi
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupSadayuki Furuhashi
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?Sadayuki Furuhashi
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11Sadayuki Furuhashi
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkSadayuki Furuhashi
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダSadayuki Furuhashi
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsSadayuki Furuhashi
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderSadayuki Furuhashi
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreSadayuki Furuhashi
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualSadayuki Furuhashi
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataSadayuki Furuhashi
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into HadoopSadayuki Furuhashi
 

More from Sadayuki Furuhashi (20)

Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins
 
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
 
Making KVS 10x Scalable
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x Scalable
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダ
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect More
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure Data
 
Fluentd meetup at Slideshare
Fluentd meetup at SlideshareFluentd meetup at Slideshare
Fluentd meetup at Slideshare
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into Hadoop
 

Recently uploaded

ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8Access Innovations, Inc.
 
Burning Issue presentation of Zhazgul N. , Cycle 54
Burning Issue presentation of Zhazgul N. , Cycle 54Burning Issue presentation of Zhazgul N. , Cycle 54
Burning Issue presentation of Zhazgul N. , Cycle 54ZhazgulNurdinova
 
Dynamics of Professional Presentationpdf
Dynamics of Professional PresentationpdfDynamics of Professional Presentationpdf
Dynamics of Professional Presentationpdfravleel42
 
The Real Story Of Project Manager/Scrum Master From Where It Came?!
The Real Story Of Project Manager/Scrum Master From Where It Came?!The Real Story Of Project Manager/Scrum Master From Where It Came?!
The Real Story Of Project Manager/Scrum Master From Where It Came?!Loay Mohamed Ibrahim Aly
 
Communication Accommodation Theory Kaylyn Benton.pptx
Communication Accommodation Theory Kaylyn Benton.pptxCommunication Accommodation Theory Kaylyn Benton.pptx
Communication Accommodation Theory Kaylyn Benton.pptxkb31670
 
Machine learning workshop, CZU Prague 2024
Machine learning workshop, CZU Prague 2024Machine learning workshop, CZU Prague 2024
Machine learning workshop, CZU Prague 2024Gokulks007
 
Juan Pablo Sugiura - eCommerce Day Bolivia 2024
Juan Pablo Sugiura - eCommerce Day Bolivia 2024Juan Pablo Sugiura - eCommerce Day Bolivia 2024
Juan Pablo Sugiura - eCommerce Day Bolivia 2024eCommerce Institute
 
Communication Accommodation Theory Kaylyn Benton.pptx
Communication Accommodation Theory Kaylyn Benton.pptxCommunication Accommodation Theory Kaylyn Benton.pptx
Communication Accommodation Theory Kaylyn Benton.pptxkb31670
 

Recently uploaded (8)

ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
 
Burning Issue presentation of Zhazgul N. , Cycle 54
Burning Issue presentation of Zhazgul N. , Cycle 54Burning Issue presentation of Zhazgul N. , Cycle 54
Burning Issue presentation of Zhazgul N. , Cycle 54
 
Dynamics of Professional Presentationpdf
Dynamics of Professional PresentationpdfDynamics of Professional Presentationpdf
Dynamics of Professional Presentationpdf
 
The Real Story Of Project Manager/Scrum Master From Where It Came?!
The Real Story Of Project Manager/Scrum Master From Where It Came?!The Real Story Of Project Manager/Scrum Master From Where It Came?!
The Real Story Of Project Manager/Scrum Master From Where It Came?!
 
Communication Accommodation Theory Kaylyn Benton.pptx
Communication Accommodation Theory Kaylyn Benton.pptxCommunication Accommodation Theory Kaylyn Benton.pptx
Communication Accommodation Theory Kaylyn Benton.pptx
 
Machine learning workshop, CZU Prague 2024
Machine learning workshop, CZU Prague 2024Machine learning workshop, CZU Prague 2024
Machine learning workshop, CZU Prague 2024
 
Juan Pablo Sugiura - eCommerce Day Bolivia 2024
Juan Pablo Sugiura - eCommerce Day Bolivia 2024Juan Pablo Sugiura - eCommerce Day Bolivia 2024
Juan Pablo Sugiura - eCommerce Day Bolivia 2024
 
Communication Accommodation Theory Kaylyn Benton.pptx
Communication Accommodation Theory Kaylyn Benton.pptxCommunication Accommodation Theory Kaylyn Benton.pptx
Communication Accommodation Theory Kaylyn Benton.pptx
 

Prestogres internals

  • 1. Sadayuki Furuhashi Founder & Software Architect Treasure Data, inc. internals PostgreSQL protocol gateway for Presto
  • 2. A little about me... > Sadayuki Furuhashi > github/twitter: @frsyuki > Treasure Data, Inc. > Founder & Software Architect > Open-source hacker > MessagePack - Efficient object serializer > Fluentd - An unified data collection tool > ServerEngine - A Ruby framework to build multiprocess servers > Prestogres - PostgreSQL protocol gateway for Presto > LS4 - A distributed object storage with cross-region replication > kumofs - A distributed strong-consistent key-value data store
  • 3. Today’s talk 1. What’s Presto? 2. Prestogres design 3. Prestogres implementation 4. Prestogres hacks 5. Prestogres future works
  • 5. What’s Presto? A distributed SQL query engine
 for interactive data analisys
 against GBs to PBs of data.
  • 6. What’s the problems to solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response
 (PostgreSQL, Redshift, etc.) > Interactive DB costs more and less scalable by far > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 7. HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactive query Commercial
 BI Tools Batch analysis platform Visualization platform Dashboard
  • 8. HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactive query ✓ Less scalable ✓ Extra cost Commercial
 BI Tools Dashboard ✓ Extra work to manage
 2 platforms ✓ Can’t query against
 “live”data directly Batch analysis platform Visualization platform
  • 9. HDFS Hive Dashboard Presto PostgreSQL, etc. Daily/Hourly Batch HDFS Hive Dashboard Daily/Hourly Batch Interactive query Interactive query Data analysis platform
  • 10. Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query Cassandra PostgreSQL Commertial DBs SQL on any data sets Data analysis platform
  • 11. Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query Cassandra PostgreSQL Commertial DBs SQL on any data sets Commercial
 BI Tools ✓ IBM Cognos
 ✓ Tableau
 ✓ ... Data analysis platform Prestogres
  • 12. Presto HDFS Dashboard Interactive query Commercial
 BI Tools ✓ IBM Cognos
 ✓ Tableau
 ✓ ... Prestogres Today’s topic!
  • 13. dashboard on chart.io: https://chartio.com/
  • 14. What can Presto do? > Query interactively (in milli-seconds to minues) > MapReduce and Hive are still necessary for ETL > Query using commercial BI tools or dashboards > Reliable ODBC/JDBC connectivity through Prestogres > Query across multiple data sources such as
 Hive, HBase, Cassandra, or even internal DBs > Plugin mechanism > Integrate batch analisys + visualization
 into a single data analysis platform
  • 15. Presto’s deployment > Facebook > Multiple geographical regions > scaled to 1,000 nodes > actively used by 1,000+ employees > who run 30,000+ queries every day > processing 1PB/day > Netflix, Dropbox, Treasure Data, Airbnb, Qubole > Presto as a Service
  • 16. Prestogres design of the ODBC/JDBC gateway
  • 17. The problems to use Presto with BI tools > BI tools need ODBC or JDBC connectivity > Tableau, IBM Cognos, QlickView, Chart.IO, ... > JasperSoft, Pentaho, MotionBoard, ... > ODBC/JDBC is VERY COMPLICATED > Matured implementation needs LONG time • psqlODBC: 58,000 lines • postgresql-jdbc: 62,000 lines • mysql-connctor-odbc: 27,000 lines • mysql-connector-j: 101,000 lines
  • 18. A solution > Creates a PostgreSQL protocol gateway > Uses PostgreSQL’s stable ODBC / JDBC driver
  • 19. Other possible designs were… a) MySQL protocol + libdrizzle: > Drizzle provides a well-designed library to implement MySQL protocol server. > Proof-of-concept worked well: • trd-gateway - MySQL protocol gateway for Hive > Difficulties: clients assumes the server is MySQL but, • syntax mismatches: MySQL uses `…` while Presto “…” • function mismatches: DAYOFMONTH(…) vs EXTRACT(day…) b) PostgreSQL + Foreign Data Wrapper (FDW): > JOIN and aggregation pushdown is not available
  • 20. Other possible designs were… c) PostgreSQL + H2 database + patch: > H2 is an embedded database engine written in Java > H2 has a PostgreSQL protocol implementation in Java > Difficulties: • System catalog implementation is incomplete
 (pg_class, pg_namespace, pg_proc, etc.) d) Reusing PostgreSQL protocol impl.: > Difficulties: • complete implementation of system catalogs was too difficult
  • 21. Prestogres design pgpool-II + PostgreSQL + PL/Python > pgpool-II is a PostgreSQL protocol middleware for replication, failover, load-balancing, etc. > pgpool-II originally has some useful code
 (parsing SQL, rewriting SQL, hacking system catalogs, …) > Basic idea: • Rewrite queries at pgpool-II and run Presto queries using PL/Python select count(1)
 from access select * from
 python_func(‘select count(1) from access’) rewrite!
  • 23. psql pgpool-IIodbc jdbc PostgreSQL Presto Authentication Create faked system
 catalogs for meta-queries 1. 2. Rewriting queries Executing queries using PL/Python 3. 4. Overview Patched!
  • 24. psql pgpool-IIodbc jdbc PostgreSQL Presto Authentication Create faked system
 catalogs for meta-queries 1. 2. Rewriting queries Executing queries using PL/Python 3. 4. Overview Patched! Prestogres
  • 25. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb StartupPacket { database = “mydb”, user = “me”, … }
  • 26. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb prestogres.conf system_db_dbname = ‘postgres’ system_db_user = ‘prestogres’ prestogres_hba.conf host mydb me 0.0.0.0/0 trust
 presto_server presto.local:8080, presto_catalog hive,
 pg_database hive StartupPacket { database = “mydb”, user = “me”, … }
  • 27. pgpool-IIpsql PostgreSQL Presto StartupPacket { database = “mydb”, user = “me”, … } $ psql -U me mydb prestogres.conf system_db_dbname = ‘postgres’ system_db_user = ‘prestogres’ > CREATE DATABASE hive; > CREATE ROLE me; > CREATE FUNCTION setup_system_catalog; > CREATE FUNCTION start_presto_query; libpq host=‘localhost’, dbname=‘postgres’,
 user=‘prestogres’ (system_db) prestogres_hba.conf host mydb me 0.0.0.0/0 trust
 presto_server presto.local:8080, presto_catalog hive,
 pg_database hive
  • 28. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb prestogres_hba.conf host mydb me 0.0.0.0/0 trust
 presto_server presto.local:8080, presto_catalog hive,
 pg_database hive prestogres.conf system_db_dbname = ‘postgres’ system_db_user = ‘prestogres’ StartupPacket { database = “hive”, user = “me”, … } StartupPacket { database = “mydb”, user = “me”, … }
  • 29. system catalog! pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” SELECT * FROM pg_class; "Query against a system catalog!” Meta-query
  • 30. system catalog! pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb SELECT setup_system_catalog(‘presto.local:8080’, ‘hive’)“Q” SELECT * FROM pg_class; "Query against a system catalog!” Meta-query PL/Python function
 defined at prestogres.py
  • 31. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb SELECT setup_system_catalog(‘presto.local:8080’, ‘hive’)“Q” SELECT * FROM pg_class; > CREATE TABLE access_logs; > CREATE TABLE users; > CREATE TABLE events; … Meta-query SELECT * FROM
 information_schema.columns "Query against a system catalog!”
  • 32. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” SELECT * FROM pg_class; “Q” SELECT * FROM pg_class; Meta-query "Query against a system catalog!”
  • 33. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” select count(*) from access_logs; regular table! Presto Query "Query against a regular table!”
  • 34. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” select count(*) from access_logs; SELECT start_presto_query(…
 ‘select count(*) from access_logs’) regular table! Presto Query "Query against a regular table!” PL/Python function
 defined at prestogres.py
  • 35. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” select count(*) from access_logs; SELECT start_presto_query(…
 ‘select count(*) from access_logs’) > CREATE TYPE result_type (c0_ BIGINT); > CREATE FUNCTION fetch_results 
 RETURNS SETOF result_type … regular table! Presto Query "Query against a regular table!” 1. start the query on Presto 2. define a function
 to fetch the result
  • 36. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” select count(*) from access_logs; “Q” SELECT * FROM fetch_results(); Presto Query "Query against a regular table!” PL/Python function
 defined by start_presto_query “Q” RAISE EXCEPTION …
  • 39. Supporting Cursors DECLARE CURSOR xyz FOR select …; FETCH
  • 41. Error message handling raise exception ‘%’, E’…’ using errcode = …;
  • 42. Faked current_database() delete from pg_catalog.pg_proc where proname=‘current_database’; create function pg_catalog.current_database()
 returns name as $$
 begin
 return ‘faked_name’::name;
 end
 $$ language plpgsql stable strict;
  • 44. Future works Rewriting CAST syntax Extended query CREATE TEMP TABLE DROP TABLE
  • 45. Check: www.treasuredata.com Cloud service for the entire data pipeline, including Presto. We’re hiring!