SlideShare a Scribd company logo
1 of 61
Download to read offline
Sadayuki Furuhashi
Founder & Software Architect
Treasure Data, inc.
PrestoInteractive SQL Query Engine for Big Data
Hadoop Conference in Japan 2014
A little about me...
> Sadayuki Furuhashi
> github/twitter: @frsyuki
> Treasure Data, Inc.
> Founder & Software Architect
> Open-source hacker
> MessagePack - efficient object serializer
> Fluentd - data collection tool
> ServerEngine - Ruby framework to build multiprocess servers
> LS4 - distributed object storage system
> kumofs - distributed key-value data store
0. Background + Intro
What’s Presto?
A distributed SQL query engine
for interactive data analisys
against GBs to PBs of data.
Presto’s history
> 2012 Fall: Project started at Facebook
> Designed for interactive query
> with speed of commercial data warehouse
> and scalability to the size of Facebook
> 2013 Winter: Open sourced!
> 30+ contributes in 6 months
> including people from outside of Facebook
What’s the problems to solve?
> We couldn’t visualize data in HDFS directly
using dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more and less scalable by far
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
What’s the problems to solve?
> We couldn’t visualize data in HDFS directly
using dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more and less scalable by far
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
What’s the problems to solve?
> We couldn’t visualize data in HDFS directly
using dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more and less scalable by far
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
What’s the problems to solve?
> We couldn’t visualize data in HDFS directly
using dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more and less scalable by far
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
Commercial
BI Tools
Batch analysis platform Visualization platform
Dashboard
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
✓ Less scalable
✓ Extra cost
Commercial
BI Tools
Dashboard
✓ More work to manage
2 platforms
✓ Can’t query against
“live”data directly
Batch analysis platform Visualization platform
HDFS
Hive Dashboard
Presto
PostgreSQL, etc.
Daily/Hourly Batch
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Interactive query
Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Cassandra MySQL Commertial DBs
SQL on any data sets
Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Cassandra MySQL Commertial DBs
SQL on any data sets Commercial
BI Tools
✓ IBM Cognos
✓ Tableau
✓ ...
Data analysis platform
dashboard on chart.io: https://chartio.com/
What can Presto do?
> Query interactively (in milli-seconds to minues)
> MapReduce and Hive are still necessary for ETL
> Query using commercial BI tools or dashboards
> Reliable ODBC/JDBC connectivity
> Query across multiple data sources such as
Hive, HBase, Cassandra, or even commertial DBs
> Plugin mechanism
> Integrate batch analisys + visualization
into a single data analysis platform
Presto’s deployment
> Facebook
> Multiple geographical regions
> scaled to 1,000 nodes
> actively used by 1,000+ employees
> who run 30,000+ queries every day
> processing 1PB/day
> Netflix, Dropbox, Treasure Data, Airbnb, Qubole
> Presto as a Service
Today’s talk
1. Distributed architecture
2. Data visualization - Demo
3. Query Execution - Presto vs. MapReduce
4. Monitoring & Configuration
5. Roadmap - the future
1. Distributed architecture
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
1. find servers in a cluster
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
2. Client sends a query
using HTTP
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
3. Coordinator builds
a query plan
Connector plugin
provides metadata
(table schema, etc.)
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
4. Coordinator sends
tasks to workers
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
5. Workers read data
through connector plugin
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
6. Workers run tasks
in memory
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
7. Client gets the result
from a worker
Client
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
What’s Connectors?
> Connectors are plugins to Presto
> written in Java
> Access to storage and metadata
> provide table schema to coordinators
> provide table rows to workers
> Implementations:
> Hive connector
> Cassandra connector
> MySQL through JDBC connector (prerelease)
> Or your own connector
Client
Coordinator Hive
Connector
Worker
Worker
Worker
HDFS,
Hive Metastore
Discovery Service
find servers in a cluster
Hive connector
Client
Coordinator Cassandra
Connector
Worker
Worker
Worker
Cassandra
Discovery Service
find servers in a cluster
Cassandra connector
Client
Coordinator
other
connectors
...
Worker
Worker
Worker
Cassandra
Discovery Service
find servers in a cluster
Hive
Connector
HDFS / Metastore
Multiple connectors in a query
Cassandra
Connector
Other data sources...
1. Distributed architecture
> 3 type of servers:
> Coordinator, worker, discovery service
> Get data/metadata through connector plugins.
> Presto is NOT a database
> Presto provides SQL to existent data stores
> Client protocol is HTTP + JSON
> Language bindings:
Ruby, Python, PHP, Java (JDBC), R, Node.JS...
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
Coordinator
Coordinator HA
2. Data visualization
The problems to use BI tools
> BI tools need ODBC or JDBC connectivity
> Tableau, IBM Cognos, QlickView, Chart.IO, ...
> JasperSoft, Pentaho, MotionBoard, ...
> ODBC/JDBC is VERY COMPLICATED
> Matured implementation needs LONG time
A solution: PostgreSQL protocol
> Creating a PostgreSQL protocol gateway
> Using PostgreSQL’s stable ODBC / JDBC driver
https://github.com/treasure-data/prestogres
How Prestogres works?
2. select run_presto_as_temp_table(
‘presto_result’,‘SELECT COUNT(1) FROM tbl1’);
pgpool-II
+ patchclient
1. SELECT COUNT(1) FROM tbl1
4. SELECT * FROM presto_result;
PostgreSQL
3.“run_persto_as_temp_table”function
runs query on Presto
Coordinator
Demo
2. Data visualization with Presto
> Data visualization tools need ODBC/JDBC driver
> but implemetation takes LONG time
> A solution is to use PostgreSQL protocol
> and use PostgreSQL’s ODBC/JDBC driver
> Prestogres is already confirmed to work with
some commertial BI tools
3. Query Execution
Presto’s execution model
> Presto is NOT MapReduce
> Presto’s query plan is based on DAG
> more like Apache Tez or traditional MPP
databases
How query runs?
> Coordinator
> SQL Parser
> Query Planner
> Execution planner
> Workers
> Task execution scheduler
SQL
SQL Parser
AST
Logical
Planner
Distributed
Planner
Logical
Query Plan
Execution
Planner
Discovery Server
Connector
Distributed
Query Plan Execution Plan
Optimizer
NodeManager
✓ node list
✓ table schema
Metadata
SQL
SQL Parser
SQL
Distributed
Planner
Logical
Query Plan
Execution
Planner
Discovery Service
Connector
Query Plan Execution Plan
Optimizer
NodeManager
✓ node list
✓ table schema
Metadata
(today’s talk)
Query
Planner
Query Planner
SELECT
name,
count(*) AS c
FROM impressions
GROUP BY name
SQL
impressions (
name varchar
time bigint
)
Table schema
Table scan
(name:varchar)
GROUP BY
(name, count(*))
Output
(name, c)
+
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
Logical query plan
Distributed query plan
Query Planner - Stages
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
inter-worker
data transfer
pipelined
aggregation
inter-worker
data transfer
Stage-0
Stage-1
Stage-2
Sink
Partial aggregation
Table scan
Sink
Partial aggregation
Table scan
Execution Planner
+ Node list
✓ 2 workers
Sink
Final aggregation
Exchange
Output
Exchange
Sink
Final aggregation
Exchange
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
Worker 1 Worker 2
Execution Planner - Tasks
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Task
1 task / worker / stage
✓ All tasks in parallel
Output
Exchange
Worker 1 Worker 2
Execution Planner - Split
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
Split
many splits / task
= many threads / worker
(table scan)
1 split / task
= 1 thread / worker
Worker 1 Worker 2
1 split / worker
= 1 thread / worker
All stages are pipe-lined
✓ No wait time
✓ No fault-tolerance
MapReduce vs. Presto
MapReduce Presto
map map
reduce reduce
task task
task task
task
task
memory-to-memory
data transfer
✓ No disk IO
✓ Data chunk must
fit in memory
task
disk
map map
reduce reduce
disk
disk
Write data
to disk
Wait between
stages
3. Query Execution
> SQL is converted into stages, tasks and splits
> All tasks run in parallel
> No wait time between stages (pipelined)
> If one task fails, all tasks fail at once (query fails)
> Memory-to-memory data transfer
> No disk IO
> If aggregated data doesn’t fit in memory,
query fails
• Note: query dies but worker doesn’t die.
Memory consumption of all queries is fully managed
4. Monitoring & Configuration
Monitoring
> Web UI
> basic query status check
> JMX HTTP API
> GET /v1/jmx/mbean[/{objectName}]
• com.facebook.presto.execution:name=TaskManager
• com.facebook.presto.execution:name=QueryManager
• com.facebook.presto.execution:name=NodeScheduler
> Event notification (remote logging)
> POST http://remote.server/v2/event
• query start, query complete, split complete
Configuration
> Execution planning (for coordinator)
> query.initial-hash-partitions
• max number of hash buckets (=tasks) of a GROUP BY
(default: 8)
> node-scheduler.min-candidates
• max number of workers to run a stage in parallel
(default: 10)
> node-scheduler.include-coordinator
• whether run tasks only on workers or include coordinator
> query.schedule-split-batch-size
• number of splits of a stage to start at once
Configuration
> Task execution (for workers)
> task.cpu-timer-enabled
• enable detailed statistics (causes some overhead)
(default: true)
> task.max-memory
• memory limit of a task especially for hash tables used by
GROUP BY and JOIN operations (default: 256MB)
• enlarge if you get“Task exceeded max memory size”error
> task.shard.max-threads
• max number of threads of a worker to run active splits
(default: number of CPU cores * 4)
5. Roadmap
A report of Presto Meetup 2014
http://www.slideshare.net/dain1/presto-meetup-20140514-34731104
"Presto, Past, Present, and Future" by Dain Sundstrom at Facebook
Presto’s future
> Huge JOIN and GROUP BY
> Spill to disk
> Task recovery
> CREATE VIEW (※implemented)
> Native store (※implemented)
> Fast data store in Presto workers
> to cache hot data
> Authentication and permissions
Presto’s future
> DDL/DML statements
> CREATE TABLE with partitioning
> DELETE and INSERT
> Plugin repository
> CLI plugin manager
> JOIN and aggregation pushdown
> Custom optimizers
Links
> Web site & document
> http://prestodb.io
> Mailing list
> https://groups.google.com/group/presto-users
> Github
> https://github.com/facebook/presto
> Guidelines for contribution
> https://github.com/facebook/presto/blob/master/CONTRIBUTING.md
Check: www.treasuredata.com
Cloud service for the entire data pipeline,
including Presto. We’re hiring!

More Related Content

What's hot

Managing Millions of Tests Using Databricks
Managing Millions of Tests Using DatabricksManaging Millions of Tests Using Databricks
Managing Millions of Tests Using DatabricksDatabricks
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
Mastering Java Bytecode With ASM - 33rd degree, 2012
Mastering Java Bytecode With ASM - 33rd degree, 2012Mastering Java Bytecode With ASM - 33rd degree, 2012
Mastering Java Bytecode With ASM - 33rd degree, 2012Anton Arhipov
 
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkUnifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkDataWorks Summit/Hadoop Summit
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performanceDataWorks Summit
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...Databricks
 
Redis cluster
Redis clusterRedis cluster
Redis clusteriammutex
 
Elk devops
Elk devopsElk devops
Elk devopsIdeato
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 

What's hot (20)

Managing Millions of Tests Using Databricks
Managing Millions of Tests Using DatabricksManaging Millions of Tests Using Databricks
Managing Millions of Tests Using Databricks
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Mastering Java Bytecode With ASM - 33rd degree, 2012
Mastering Java Bytecode With ASM - 33rd degree, 2012Mastering Java Bytecode With ASM - 33rd degree, 2012
Mastering Java Bytecode With ASM - 33rd degree, 2012
 
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkUnifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
TiDB Introduction
TiDB IntroductionTiDB Introduction
TiDB Introduction
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 
Redis cluster
Redis clusterRedis cluster
Redis cluster
 
Elk devops
Elk devopsElk devops
Elk devops
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 

Similar to Presto - Hadoop Conference Japan 2014

SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014N Masahiro
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSSN Masahiro
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoSadayuki Furuhashi
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoopch adnan
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...viirya
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
 
Big data-at-detik
Big data-at-detikBig data-at-detik
Big data-at-detikk4ndar
 
Boston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the EnterpriseBoston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the EnterpriseMatt Fuller
 
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65Fluentd - RubyKansai 65
Fluentd - RubyKansai 65N Masahiro
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesSadayuki Furuhashi
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 

Similar to Presto - Hadoop Conference Japan 2014 (20)

SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSS
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
Big data-at-detik
Big data-at-detikBig data-at-detik
Big data-at-detik
 
Boston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the EnterpriseBoston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the Enterprise
 
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65Fluentd - RubyKansai 65
Fluentd - RubyKansai 65
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
Presto
PrestoPresto
Presto
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 

More from Sadayuki Furuhashi

Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Sadayuki Furuhashi
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Sadayuki Furuhashi
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupSadayuki Furuhashi
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?Sadayuki Furuhashi
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11Sadayuki Furuhashi
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkSadayuki Furuhashi
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダSadayuki Furuhashi
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsSadayuki Furuhashi
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderSadayuki Furuhashi
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreSadayuki Furuhashi
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualSadayuki Furuhashi
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataSadayuki Furuhashi
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into HadoopSadayuki Furuhashi
 

More from Sadayuki Furuhashi (20)

Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins
 
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
 
Making KVS 10x Scalable
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x Scalable
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダ
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect More
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure Data
 
Fluentd meetup at Slideshare
Fluentd meetup at SlideshareFluentd meetup at Slideshare
Fluentd meetup at Slideshare
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into Hadoop
 
Fluentd meetup
Fluentd meetupFluentd meetup
Fluentd meetup
 
upload test 1
upload test 1upload test 1
upload test 1
 

Recently uploaded

Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfhenrik385807
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Kayode Fayemi
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxFamilyWorshipCenterD
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...Sheetaleventcompany
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AITatiana Gurgel
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024eCommerce Institute
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Hasting Chen
 
SaaStr Workshop Wednesday w: Jason Lemkin, SaaStr
SaaStr Workshop Wednesday w: Jason Lemkin, SaaStrSaaStr Workshop Wednesday w: Jason Lemkin, SaaStr
SaaStr Workshop Wednesday w: Jason Lemkin, SaaStrsaastr
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Delhi Call girls
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024eCommerce Institute
 
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...NETWAYS
 
call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@vikas rana
 
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...NETWAYS
 
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptssuser319dad
 
Motivation and Theory Maslow and Murray pdf
Motivation and Theory Maslow and Murray pdfMotivation and Theory Maslow and Murray pdf
Motivation and Theory Maslow and Murray pdfakankshagupta7348026
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxNikitaBankoti2
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxmohammadalnahdi22
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...henrik385807
 
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...NETWAYS
 

Recently uploaded (20)

Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AI
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
 
SaaStr Workshop Wednesday w: Jason Lemkin, SaaStr
SaaStr Workshop Wednesday w: Jason Lemkin, SaaStrSaaStr Workshop Wednesday w: Jason Lemkin, SaaStr
SaaStr Workshop Wednesday w: Jason Lemkin, SaaStr
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
 
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
 
call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@
 
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
 
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.ppt
 
Motivation and Theory Maslow and Murray pdf
Motivation and Theory Maslow and Murray pdfMotivation and Theory Maslow and Murray pdf
Motivation and Theory Maslow and Murray pdf
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
 
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
 

Presto - Hadoop Conference Japan 2014

  • 1. Sadayuki Furuhashi Founder & Software Architect Treasure Data, inc. PrestoInteractive SQL Query Engine for Big Data Hadoop Conference in Japan 2014
  • 2. A little about me... > Sadayuki Furuhashi > github/twitter: @frsyuki > Treasure Data, Inc. > Founder & Software Architect > Open-source hacker > MessagePack - efficient object serializer > Fluentd - data collection tool > ServerEngine - Ruby framework to build multiprocess servers > LS4 - distributed object storage system > kumofs - distributed key-value data store
  • 4. What’s Presto? A distributed SQL query engine for interactive data analisys against GBs to PBs of data.
  • 5. Presto’s history > 2012 Fall: Project started at Facebook > Designed for interactive query > with speed of commercial data warehouse > and scalability to the size of Facebook > 2013 Winter: Open sourced! > 30+ contributes in 6 months > including people from outside of Facebook
  • 6. What’s the problems to solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more and less scalable by far > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 7. What’s the problems to solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more and less scalable by far > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 8. What’s the problems to solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more and less scalable by far > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 9. What’s the problems to solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more and less scalable by far > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 10. HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactive query Commercial BI Tools Batch analysis platform Visualization platform Dashboard
  • 11. HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactive query ✓ Less scalable ✓ Extra cost Commercial BI Tools Dashboard ✓ More work to manage 2 platforms ✓ Can’t query against “live”data directly Batch analysis platform Visualization platform
  • 12. HDFS Hive Dashboard Presto PostgreSQL, etc. Daily/Hourly Batch HDFS Hive Dashboard Daily/Hourly Batch Interactive query Interactive query
  • 14. Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query Cassandra MySQL Commertial DBs SQL on any data sets Commercial BI Tools ✓ IBM Cognos ✓ Tableau ✓ ... Data analysis platform
  • 15. dashboard on chart.io: https://chartio.com/
  • 16. What can Presto do? > Query interactively (in milli-seconds to minues) > MapReduce and Hive are still necessary for ETL > Query using commercial BI tools or dashboards > Reliable ODBC/JDBC connectivity > Query across multiple data sources such as Hive, HBase, Cassandra, or even commertial DBs > Plugin mechanism > Integrate batch analisys + visualization into a single data analysis platform
  • 17. Presto’s deployment > Facebook > Multiple geographical regions > scaled to 1,000 nodes > actively used by 1,000+ employees > who run 30,000+ queries every day > processing 1PB/day > Netflix, Dropbox, Treasure Data, Airbnb, Qubole > Presto as a Service
  • 18. Today’s talk 1. Distributed architecture 2. Data visualization - Demo 3. Query Execution - Presto vs. MapReduce 4. Monitoring & Configuration 5. Roadmap - the future
  • 21. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 1. find servers in a cluster
  • 22. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 2. Client sends a query using HTTP
  • 23. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 3. Coordinator builds a query plan Connector plugin provides metadata (table schema, etc.)
  • 24. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 4. Coordinator sends tasks to workers
  • 25. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 5. Workers read data through connector plugin
  • 26. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 6. Workers run tasks in memory
  • 27. Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 7. Client gets the result from a worker Client
  • 29. What’s Connectors? > Connectors are plugins to Presto > written in Java > Access to storage and metadata > provide table schema to coordinators > provide table rows to workers > Implementations: > Hive connector > Cassandra connector > MySQL through JDBC connector (prerelease) > Or your own connector
  • 32. Client Coordinator other connectors ... Worker Worker Worker Cassandra Discovery Service find servers in a cluster Hive Connector HDFS / Metastore Multiple connectors in a query Cassandra Connector Other data sources...
  • 33. 1. Distributed architecture > 3 type of servers: > Coordinator, worker, discovery service > Get data/metadata through connector plugins. > Presto is NOT a database > Presto provides SQL to existent data stores > Client protocol is HTTP + JSON > Language bindings: Ruby, Python, PHP, Java (JDBC), R, Node.JS...
  • 34. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service Coordinator Coordinator HA
  • 36. The problems to use BI tools > BI tools need ODBC or JDBC connectivity > Tableau, IBM Cognos, QlickView, Chart.IO, ... > JasperSoft, Pentaho, MotionBoard, ... > ODBC/JDBC is VERY COMPLICATED > Matured implementation needs LONG time
  • 37. A solution: PostgreSQL protocol > Creating a PostgreSQL protocol gateway > Using PostgreSQL’s stable ODBC / JDBC driver https://github.com/treasure-data/prestogres
  • 38. How Prestogres works? 2. select run_presto_as_temp_table( ‘presto_result’,‘SELECT COUNT(1) FROM tbl1’); pgpool-II + patchclient 1. SELECT COUNT(1) FROM tbl1 4. SELECT * FROM presto_result; PostgreSQL 3.“run_persto_as_temp_table”function runs query on Presto Coordinator
  • 39. Demo
  • 40. 2. Data visualization with Presto > Data visualization tools need ODBC/JDBC driver > but implemetation takes LONG time > A solution is to use PostgreSQL protocol > and use PostgreSQL’s ODBC/JDBC driver > Prestogres is already confirmed to work with some commertial BI tools
  • 42. Presto’s execution model > Presto is NOT MapReduce > Presto’s query plan is based on DAG > more like Apache Tez or traditional MPP databases
  • 43. How query runs? > Coordinator > SQL Parser > Query Planner > Execution planner > Workers > Task execution scheduler
  • 44. SQL SQL Parser AST Logical Planner Distributed Planner Logical Query Plan Execution Planner Discovery Server Connector Distributed Query Plan Execution Plan Optimizer NodeManager ✓ node list ✓ table schema Metadata
  • 45. SQL SQL Parser SQL Distributed Planner Logical Query Plan Execution Planner Discovery Service Connector Query Plan Execution Plan Optimizer NodeManager ✓ node list ✓ table schema Metadata (today’s talk) Query Planner
  • 46. Query Planner SELECT name, count(*) AS c FROM impressions GROUP BY name SQL impressions ( name varchar time bigint ) Table schema Table scan (name:varchar) GROUP BY (name, count(*)) Output (name, c) + Sink Final aggregation Exchange Sink Partial aggregation Table scan Output Exchange Logical query plan Distributed query plan
  • 47. Query Planner - Stages Sink Final aggregation Exchange Sink Partial aggregation Table scan Output Exchange inter-worker data transfer pipelined aggregation inter-worker data transfer Stage-0 Stage-1 Stage-2
  • 48. Sink Partial aggregation Table scan Sink Partial aggregation Table scan Execution Planner + Node list ✓ 2 workers Sink Final aggregation Exchange Output Exchange Sink Final aggregation Exchange Sink Final aggregation Exchange Sink Partial aggregation Table scan Output Exchange Worker 1 Worker 2
  • 49. Execution Planner - Tasks Sink Final aggregation Exchange Sink Partial aggregation Table scan Sink Final aggregation Exchange Sink Partial aggregation Table scan Task 1 task / worker / stage ✓ All tasks in parallel Output Exchange Worker 1 Worker 2
  • 50. Execution Planner - Split Sink Final aggregation Exchange Sink Partial aggregation Table scan Sink Final aggregation Exchange Sink Partial aggregation Table scan Output Exchange Split many splits / task = many threads / worker (table scan) 1 split / task = 1 thread / worker Worker 1 Worker 2 1 split / worker = 1 thread / worker
  • 51. All stages are pipe-lined ✓ No wait time ✓ No fault-tolerance MapReduce vs. Presto MapReduce Presto map map reduce reduce task task task task task task memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory task disk map map reduce reduce disk disk Write data to disk Wait between stages
  • 52. 3. Query Execution > SQL is converted into stages, tasks and splits > All tasks run in parallel > No wait time between stages (pipelined) > If one task fails, all tasks fail at once (query fails) > Memory-to-memory data transfer > No disk IO > If aggregated data doesn’t fit in memory, query fails • Note: query dies but worker doesn’t die. Memory consumption of all queries is fully managed
  • 53. 4. Monitoring & Configuration
  • 54. Monitoring > Web UI > basic query status check > JMX HTTP API > GET /v1/jmx/mbean[/{objectName}] • com.facebook.presto.execution:name=TaskManager • com.facebook.presto.execution:name=QueryManager • com.facebook.presto.execution:name=NodeScheduler > Event notification (remote logging) > POST http://remote.server/v2/event • query start, query complete, split complete
  • 55. Configuration > Execution planning (for coordinator) > query.initial-hash-partitions • max number of hash buckets (=tasks) of a GROUP BY (default: 8) > node-scheduler.min-candidates • max number of workers to run a stage in parallel (default: 10) > node-scheduler.include-coordinator • whether run tasks only on workers or include coordinator > query.schedule-split-batch-size • number of splits of a stage to start at once
  • 56. Configuration > Task execution (for workers) > task.cpu-timer-enabled • enable detailed statistics (causes some overhead) (default: true) > task.max-memory • memory limit of a task especially for hash tables used by GROUP BY and JOIN operations (default: 256MB) • enlarge if you get“Task exceeded max memory size”error > task.shard.max-threads • max number of threads of a worker to run active splits (default: number of CPU cores * 4)
  • 57. 5. Roadmap A report of Presto Meetup 2014 http://www.slideshare.net/dain1/presto-meetup-20140514-34731104 "Presto, Past, Present, and Future" by Dain Sundstrom at Facebook
  • 58. Presto’s future > Huge JOIN and GROUP BY > Spill to disk > Task recovery > CREATE VIEW (※implemented) > Native store (※implemented) > Fast data store in Presto workers > to cache hot data > Authentication and permissions
  • 59. Presto’s future > DDL/DML statements > CREATE TABLE with partitioning > DELETE and INSERT > Plugin repository > CLI plugin manager > JOIN and aggregation pushdown > Custom optimizers
  • 60. Links > Web site & document > http://prestodb.io > Mailing list > https://groups.google.com/group/presto-users > Github > https://github.com/facebook/presto > Guidelines for contribution > https://github.com/facebook/presto/blob/master/CONTRIBUTING.md
  • 61. Check: www.treasuredata.com Cloud service for the entire data pipeline, including Presto. We’re hiring!