• Analytics Infra.
• Nearly working embulk…
• Presto using one and a half years.
Our development situation
• We commonly used sql.
• Marketing occupation don't write sql.
• I often write the complicated SQL, that is
• We love OSS.
• Not use Update, Insert, Delete by Presto.
Our Business situation
• We manage and operate web site of BtoB.
• Our data lifecycle is long.
• Business side not write sql.
• watching re:dash and Adobe analytics.
• Sales increase 15 straight year.
• Cross server and cross database.
• A single Presto query can combine data
from multiple sources.
• We use multiple sources join query.
• reduce ETL pain.
Collect data in one place?
• Equal able to get data by one query.
• I not want to have duplicate data.(master
data, user data)
• Collect the data in one place, high
with mysql_user as (
redshift_user_log as (
inner join redshift_user_log on mysql_user.user_id = redshift_user_log.user_id
group by user_id, user_name
• We use window function for Mysql.
(Presto on mysql)
• data source is Mysql, But Presto world
• But can not use original function of mysql.
Rank function on mysql
rank() over (
partition by company_id
order by count(*) desc
group by company_id, category_id
other window function
• PostgreSQL protocol gateway for Presto.
• rewrite queries before sending Presto to
• have password-based authentication and
• Other application connectivity.
– pgAdmin, psql command.
– re:dash connecte with PostgreSQL protocol to presto.
– But can directly connect to presto.
• We connect to presto, need Presto client.
– I not want use java client.
• Weak security.
– certification is taken by prestogres
• prepared statement.
– not support Presto too.
– so not work embulk-input-postgresql
• Can’t fetch schema by sql.
• Temporary table
• DROP TABLE
• Visualization platform, write by python.
• Supports many data sources.
• Sharing query with member.
• Scheduling query.(per day, per hour)
• Very active contribution.
increased rapidly Presto
query by re:dash
• Number of the presto queries increased
than 10 times.
• That won't change with writing ETL on
• Re:dash having a good reputation in
Install by RPM
• Presto have RPM.
– not distribution.
– need source build..
• include init script.
• But not support open-jdk..
– Pull requesting..
• We build Presto on ec2.
• Not use EMR.
• Worker is spot instance, multi instance
– prevent down all at once
• Presto cluster(coordinator and workers)
place in the same AZ.
• If other AZ, very high traffic cost(and
– should not multi AZ.
Networking on AWS
Availability Zone Availability
• Very huge repository.
• SPOF cordinator.
• run long range query, occur
Very huge repository
• monolithic application.
– I want Separate repository.
• First build takes 30 minutes.
• After the second time build takes 10 minutes.
• All connector is main repository.
– will nearly support Elasticsearch
• Hard to do the contribution.
Big change for jdbc
• support multi data type predicate
• We used apply patch presto…
• Let's try mysql people.
listened Presto impression
• extended technology of Hadoop.
=>I don't know hadoop. Presto have many
• parallel processing looks difficult.
=>Presto not have storage, There is not so
・I do not have so big data.
=>I don't so big player.
• Presto is great software.
• So not difficult.
• Let's use it more.