SlideShare a Scribd company logo
1 of 43
Download to read offline
Internals of Presto Service
Taro L. Saito, Treasure Data
leo@treasure-data.com
March 11-12th, 2015
Treasure Data Tech Talk #1 at Tokyo
Taro L. Saito @taroleo
•  2007 University of Tokyo. Ph.D.
–  XML DBMS, Transaction Processing
•  Relational-Style XML Query [SIGMOD 2008]
•  ~ 2014 Assistant Professor at University of Tokyo
–  Genome Science Research
•  Distributed Computing, Personal Genome Analysis
•  March 2014 ~ Treasure Data
–  Software Engineer, MPP Team Leader
•  Open source projects at GitHub
–  snappy-java, msgpack-java, sqlite-jdbc
–  sbt-pack, sbt-sonatype, larray
–  silk
•  Distributed workflow engine
2
Hive
TD API /
Web Console
batch query
Presto
Treasure Data
PlazmaDB:
MessagePack Columnar Storage
td-presto connector
Interactive query
What is Presto?
•  A distributed SQL Engine developed by Facebook
–  For interactive analysis on peta-scale dataset
•  As a replacement of Hive
–  Nov. 2013: Open sourced at GitHub
•  Presto
–  Written in Java
–  In-memory query layer
–  CPU efficient for ad-hoc analysis
–  Based on ANSI SQL
–  Isolation of query layer and storage access layer
•  A connector provides data access (reading schema and records)
4
Presto: Distributed SQL Engine
5
TD Presto has its own
query retry mechanism
Tailored to throughput CPU-intensive. Faster response time
Fault
Tolerant
Treasure Data: Presto as a Service
6
Presto Public
Release
Topics
•  Challenges in providing Database as a Service
•  TD Presto Connector
–  Optimizing Scan Performance
–  Multi-tenancy Cluster Management
•  Resource allocation
•  Monitoring
•  Query Tuning
7
buffer
Optimizing Scan Performance
•  Fully utilize the network bandwidth from S3
•  TD Presto becomes CPU bottleneck
TableScanOperator	
•  s3 file list
•  table schema
header
request
S3 / RiakCS	
•  release(Buffer)
Buffer size limit
Reuse allocated buffers
Request Queue	
•  priority queue
•  max connections limit
Header	
Column Block 0
(column names)	
Column Block 1	
Column Block i	
Column Block m	
MPC1 file
HeaderReader	
•  callback to HeaderParser
ColumnBlockReader	
header
HeaderParser	
•  parse MPC file header
• column block offsets
• column names
column block request
Column block requests
column block
prepare
MessageUnpacker	
buffer
MessageUnpacker	
MessageUnpacker	
S3 read	
S3 read	
pull records
Retry GET request on
- 500 (internal error)
- 503 (slow down)
- 404 (not found)
- eventual consistency
S3 read	
•  decompression
•  msgpack-java v07
S3 read	
S3 read	
S3 read
MessageBuffer
•  msgpack-java v06 was the bottleneck
–  Inefficient buffer access
•  v07
•  Fast memory access
•  sun.misc.Unsafe
•  Direct access to heap memory
•  extract primitive type value from byte[]
•  cast
•  No boxing
9
Unsafe memory access performance is comparable to C
•  http://frsyuki.hatenablog.com/entry/2014/03/12/155231
10
Why ByteBuffer is slow?
•  Following a good programming manner
–  Define interface, then implement classes
•  ByteBuffer interface has HeapByteBuffer and DirectByteBuffer
implementations
•  In reality: TypeProfile slows down method access
–  JVM generates look-up table of method implementations
–  Simply importing one or more classes generates TypeProfile
•  v07 avoid TypeProfile generation
–  Load an implementation class through Reflection
11
Format Type Detection
•  MessageUnpacker
–  read prefix: 1 byte
–  detect format type
•  switch-case
–  ANTLR generates this
type of codes
12
Format Type Detection
•  Using cache-efficient lookup table: 20000x faster
13
2x performance improvement in v07
14
Database As A Service
15
Claremont Report on Database Research
•  Discussion on future of DBMS
–  Top researchers, vendors and
practitioners.
–  CACM, Vol. 52 No. 6, 2009
•  Predicts emergence of Cloud Data
Service
–  SQL has an important role
•  limited functionality
•  suited for service provider
–  A difficult example: Spark 
•  Need a secure application container
to run arbitrary Scala code.
16
Beckman Report on Database Research
•  2013
–  http://beckman.cs.wisc.edu/beckman-report2013.pdf
–  Topics of Big-Data
•  End-to-end service
–  From data collection to knowledge
•  Cloud Service has become popular
–  IaaS, PaaS, SaaS
–  Challenge is to migrate all of the functionalities of DBMS into Cloud
17
Results Push
Results Push
SQL
Big Data Simplified: The Treasure Data Approach
AppServers
Multi-structured Events!
•  register!
•  login!
•  start_event!
•  purchase!
•  etc!
SQL-based
Ad-hoc Queries
SQL-based Dashboards
DBs & Data Marts
Other Apps
Familiar &
Table-oriented
Infinite & Economical
Cloud Data Store
ü  App log data!
ü  Mobile event data!
ü  Sensor data!
ü  Telemetry!
Mobile SDKs
Web SDK
Multi-structured Events
Multi-structured Events
Treasure Agent
Treasure Agent
Treasure Agent
Treasure Agent Treasure Agent
Treasure Agent
Treasure Agent
Treasure Agent
Embedded SDKs
Server-side Agents
18
Challenges in Database as a Service
•  Tradeoffs
–  Cost and service level objectives (SLOs)
•  Reference
–  Workload Management for Big Data Analytics. A. Aboulnaga
[SIGMOD2013 Tutorial]
19
Run each query set
on an independent
cluster
Run all queries
together on the
smallest possible
cluster
Fast
$$$
Limited performance guarantee
Reasonable price
Shift of Presto Query Usage
•  Initial phase
–  Try and error of queries
•  Many syntax errors, semantic errors
•  Next phase
–  Scheduled query execution
•  Increased Presto query usage
–  Some customers submit more than 1,000 Presto queries / day
–  Establishing typical query patterns
•  hourly, daily reports
•  query templates
•  Advanced phase: More elaborate data analysis
–  Complex queries
•  via data scientists and data analysts
–  High resource usage
20
Usage Shift: Simple to Complex queries
21
Monitoring Presto Usage with Fluentd
22
Hive
Presto
DataDog
•  Monitoring CPU, memory and network usage
•  Query stats
23
Query Collection in TD
•  SQL query logs
–  query, detailed query plan, elapsed time, processed rows, etc.
•  Presto is used for analyzing the query history
24
Daily/Hourly Query Usage
25
Query Running Time
•  More than 90% of queries finishes within 2 min.
expected response time for interactive queries
26
Processed Rows of Queries
27
Performance
•  Processed rows / sec. of a query
28
Collecting Recoverable Error Patterns
•  Presto has no fault tolerance
•  Error types
–  User error
•  Syntax errors
–  SQL syntax, missing function
•  Semantic errors
–  missing tables/columns
–  Insufficient resource
•  Exceeded task memory size
–  Internal failure
•  I/O error
–  S3/Riak CS
•  worker failure
•  etc.
29
TD Presto retries
these queries
Query Retry on Internal Errors
•  More than 99.8% of queries finishes without errors
30
Query Retry on Internal Errors (log scale)
•  Queries succeed eventually
31
Multi-tenancy: Resource Allocation
•  Price-plan based resource allocation
•  Parameters
–  The number of worker nodes to use (min-candidates)
–  The number of hash partitions (initial-hash-partitions)
–  The maximum number of running tasks per account
•  If running queries exceeds allowed number of tasks, the next queries need
to wait (queued)
•  Presto: SqlQueryExecution class
–  Controls query execution state: planning -> running -> finished
•  No resource allocation policy
–  Extended TDSqlQueryExection class monitors running tasks and limits
resource usage
•  Rewriting SqlQueryExecutionFactory at run-time by using ASM library
32
Query Queue
•  Presto 0.97
–  Introduces user-wise query queues
•  Can limit the number of concurrent queries per user
•  Problem
–  Running too many queries delays overall query
performance
33
Customer Feedback
•  A feedback:
–  We don’t care if large queries take long time
–  But interactive queries should run immediately
•  Challenges
–  How do we allocate resources even if preceding queries
occupies customer share of resources?
–  How do we know a submitted query is interactive one?
34
Admission control is necessary
•  Adjust resource utilization
–  Running Drivers (Splits)
–  MPL (Multi-Programming Level)
35
Challenge: Auto Scaling
•  Setting the cluster size based on the peak usage is expensive
•  But predicting customer usage is difficult
36
Typical Query Patterns [Li Juang]
•  Q: What are typical queries of a customer?
–  Customer feels some queries are slow
–  But we don’t know what to compare with, except scheduled queries
•  Approach: Clustering Customer SQLs
•  TF/IDF measure: TF x IDF vector
–  Split SQL statements into tokens
–  Term frequency (TF) = the number of each term in a query
–  Inverse document frequency (IDF) = log (# of queries / # of queries that
have a token)
•  k-means clustering
–  TF/IDF vector
–  Generates clusters of similar queries
•  x-means clustering for deciding number of clusters automatically
–  D. Pelleg [ICML2000]
37
Problematic Queries
•  90% of queries finishes within 2 min.
–  But remaining 10% is still large
•  10% of 10,000 queries is 1,000.
•  Long-running queries
•  Hog queries
38
Long Running Queries
•  Typical bottlenecks
–  Cross joins
–  IN (a, b, c, …)
•  semi-join filtering process is slow
–  Complex scan condition
•  pushing down selection
•  but delays column scan
–  Tuple materialization
•  coordinator generates json data
–  Many aggregation columns
•  group by 1, 2, 3, 4, 5, 6, …
–  Full scan
•  Scanning 100 billion rows…
•  Adding more resources does not always make query faster
•  Storing intermediate data to disks is necessary
39
Result are
buffered
(waiting fetch)
slow process
fast
fast
Hog Query
•  Queries consuming a lot of CPU/memory resources
–  Coined in S. Krompass et al. [EDBT2009]
•  Example:
–  select 1 as day, count(…) from … where time <= current_date - interval 1 day
union all
select 2 as day, count(…) from … where time <= current_date - interval 2 day
union all
–  …
–  (up to 190 days)
•  More than 1000 query stages.
•  Presto tries to run all of the stages at once.
–  High CPU usage at coordinator
40
•  Query rewriting (better)
–  With group by and window functions
–  Not a perfect solution
•  Need to understand the meaning of the query
•  Semantic change is not allowed
–  e.g., We cannot rewrite UNION to UNION ALL
–  UNION includes duplicate elimination
•  Workaround Idea
–  Bushy plan -> Deep plan
–  Introduce stage-wise resource assignment
Query Rewriting? Plan Optimization?
41
Future Work
•  Reducing Queuing/Response Time
–  Introducing shared queue between customers
•  For utilizing remaining cluster resources
–  Fair-Scheduling: C. Gupata [EDBT2009]
–  Self-tuning DBMS. S. Chaudhuri [VLDB2007]
•  Adjusting Running Query Size (hard)
–  Limiting driver resources as small as possible for hog queries
–  Query plan based cost estimation
•  Predicting Query Running Time
–  J. Duggan [SIGMOD2011], A.C. Konig [VLDB2011]
42
Summary: Treasures in Treasure Data
•  Treasures for our customers
–  Data collected by fluentd (td-agent)
–  Query analysis platform
–  Query results - values
•  For Treasure Data
–  SQL query logs
•  Stored in treasure data
–  We know how customers use SQL
•  Typical queries and failures
–  We know which part of query can be improved
43

More Related Content

What's hot

Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsDave Gardner
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinChristian Johannsen
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Time Series Data with InfluxDB
Time Series Data with InfluxDBTime Series Data with InfluxDB
Time Series Data with InfluxDBTuri, Inc.
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 
Treasure Data Cloud Data Platform
Treasure Data Cloud Data PlatformTreasure Data Cloud Data Platform
Treasure Data Cloud Data Platforminside-BigData.com
 
Introduction to the Disruptor
Introduction to the DisruptorIntroduction to the Disruptor
Introduction to the DisruptorTrisha Gee
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure DataTaro L. Saito
 
Introduction to Cassandra Basics
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basicsnickmbailey
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala InternalsDavid Groozman
 

What's hot (20)

Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Presto
PrestoPresto
Presto
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Time Series Data with InfluxDB
Time Series Data with InfluxDBTime Series Data with InfluxDB
Time Series Data with InfluxDB
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Treasure Data Cloud Data Platform
Treasure Data Cloud Data PlatformTreasure Data Cloud Data Platform
Treasure Data Cloud Data Platform
 
Introduction to the Disruptor
Introduction to the DisruptorIntroduction to the Disruptor
Introduction to the Disruptor
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Cassandra 101
Cassandra 101Cassandra 101
Cassandra 101
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Introduction to Cassandra Basics
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basics
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 

Viewers also liked

Plazma - Treasure Data’s distributed analytical database -
Plazma - Treasure Data’s distributed analytical database -Plazma - Treasure Data’s distributed analytical database -
Plazma - Treasure Data’s distributed analytical database -Treasure Data, Inc.
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話Kentaro Yoshida
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringTaro L. Saito
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine kiran palaka
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Taro L. Saito
 
Presto in my_use_case
Presto in my_use_casePresto in my_use_case
Presto in my_use_casewyukawa
 
トレジャーデータ流,データ分析の始め方
トレジャーデータ流,データ分析の始め方トレジャーデータ流,データ分析の始め方
トレジャーデータ流,データ分析の始め方Takahiro Inoue
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Pentaho CTools 20140902
Pentaho CTools 20140902Pentaho CTools 20140902
Pentaho CTools 20140902Teruo Kawasaki
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomyDongmin Yu
 
Building Physical in a Virtual World
Building Physical in a Virtual WorldBuilding Physical in a Virtual World
Building Physical in a Virtual WorldChris Maxwell
 
Lambda in java_20160121
Lambda in java_20160121Lambda in java_20160121
Lambda in java_20160121Teruo Kawasaki
 

Viewers also liked (20)

Plazma - Treasure Data’s distributed analytical database -
Plazma - Treasure Data’s distributed analytical database -Plazma - Treasure Data’s distributed analytical database -
Plazma - Treasure Data’s distributed analytical database -
 
Diary of Support Engineer
Diary of Support EngineerDiary of Support Engineer
Diary of Support Engineer
 
Treasure Data and Fluentd
Treasure Data and FluentdTreasure Data and Fluentd
Treasure Data and Fluentd
 
HDP2 and YARN operations point
HDP2 and YARN operations pointHDP2 and YARN operations point
HDP2 and YARN operations point
 
hotdog a TD tool for DD
hotdog a TD tool for DDhotdog a TD tool for DD
hotdog a TD tool for DD
 
Treasure Data Mobile SDK
Treasure Data Mobile SDKTreasure Data Mobile SDK
Treasure Data Mobile SDK
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例
 
Presto in my_use_case
Presto in my_use_casePresto in my_use_case
Presto in my_use_case
 
トレジャーデータ流,データ分析の始め方
トレジャーデータ流,データ分析の始め方トレジャーデータ流,データ分析の始め方
トレジャーデータ流,データ分析の始め方
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
pagecache-memo
pagecache-memopagecache-memo
pagecache-memo
 
Pentaho CTools 20140902
Pentaho CTools 20140902Pentaho CTools 20140902
Pentaho CTools 20140902
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Building Physical in a Virtual World
Building Physical in a Virtual WorldBuilding Physical in a Virtual World
Building Physical in a Virtual World
 
Pentaho
PentahoPentaho
Pentaho
 
Lambda in java_20160121
Lambda in java_20160121Lambda in java_20160121
Lambda in java_20160121
 

Similar to Internals of Presto Service

Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookTreasure Data, Inc.
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage SystemsSATOSHI TAGOMORI
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analyticsamesar0
 
Introduction to .NET Performance Measurement
Introduction to .NET Performance MeasurementIntroduction to .NET Performance Measurement
Introduction to .NET Performance MeasurementSasha Goldshtein
 
ElasticSearch as (only) datastore
ElasticSearch as (only) datastoreElasticSearch as (only) datastore
ElasticSearch as (only) datastoreTomas Sirny
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrlucenerevolution
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly SolarWinds Loggly
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorizationAndreas Loupasakis
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization Warply
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineDataWorks Summit
 
Apache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbApache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbZhangZhengming
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceSATOSHI TAGOMORI
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuNETWAYS
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuNETWAYS
 
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!Teamstudio
 

Similar to Internals of Presto Service (20)

Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @Facebook
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage Systems
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Introduction to .NET Performance Measurement
Introduction to .NET Performance MeasurementIntroduction to .NET Performance Measurement
Introduction to .NET Performance Measurement
 
ElasticSearch as (only) datastore
ElasticSearch as (only) datastoreElasticSearch as (only) datastore
ElasticSearch as (only) datastore
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorization
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
Apache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbApache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdb
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
 
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!
 

More from Treasure Data, Inc.

GDPR: A Practical Guide for Marketers
GDPR: A Practical Guide for MarketersGDPR: A Practical Guide for Marketers
GDPR: A Practical Guide for MarketersTreasure Data, Inc.
 
AR and VR by the Numbers: A Data First Approach to the Technology and Market
AR and VR by the Numbers: A Data First Approach to the Technology and MarketAR and VR by the Numbers: A Data First Approach to the Technology and Market
AR and VR by the Numbers: A Data First Approach to the Technology and MarketTreasure Data, Inc.
 
Introduction to Customer Data Platforms
Introduction to Customer Data PlatformsIntroduction to Customer Data Platforms
Introduction to Customer Data PlatformsTreasure Data, Inc.
 
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
Hands-On: Managing Slowly Changing Dimensions Using TD WorkflowHands-On: Managing Slowly Changing Dimensions Using TD Workflow
Hands-On: Managing Slowly Changing Dimensions Using TD WorkflowTreasure Data, Inc.
 
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
Brand Analytics Management: Measuring CLV Across Platforms, Devices and AppsBrand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
Brand Analytics Management: Measuring CLV Across Platforms, Devices and AppsTreasure Data, Inc.
 
How to Power Your Customer Experience with Data
How to Power Your Customer Experience with DataHow to Power Your Customer Experience with Data
How to Power Your Customer Experience with DataTreasure Data, Inc.
 
Why Your VR Game is Virtually Useless Without Data
Why Your VR Game is Virtually Useless Without DataWhy Your VR Game is Virtually Useless Without Data
Why Your VR Game is Virtually Useless Without DataTreasure Data, Inc.
 
Connecting the Customer Data Dots
Connecting the Customer Data DotsConnecting the Customer Data Dots
Connecting the Customer Data DotsTreasure Data, Inc.
 
Harnessing Data for Better Customer Experience and Company Success
Harnessing Data for Better Customer Experience and Company SuccessHarnessing Data for Better Customer Experience and Company Success
Harnessing Data for Better Customer Experience and Company SuccessTreasure Data, Inc.
 
Packaging Ecosystems -Monki Gras 2017
Packaging Ecosystems -Monki Gras 2017Packaging Ecosystems -Monki Gras 2017
Packaging Ecosystems -Monki Gras 2017Treasure Data, Inc.
 
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)Treasure Data, Inc.
 
Introduction to New features and Use cases of Hivemall
Introduction to New features and Use cases of HivemallIntroduction to New features and Use cases of Hivemall
Introduction to New features and Use cases of HivemallTreasure Data, Inc.
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataTreasure Data, Inc.
 
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...Treasure Data, Inc.
 
Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to RedshiftTreasure Data, Inc.
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudTreasure Data, Inc.
 

More from Treasure Data, Inc. (20)

GDPR: A Practical Guide for Marketers
GDPR: A Practical Guide for MarketersGDPR: A Practical Guide for Marketers
GDPR: A Practical Guide for Marketers
 
AR and VR by the Numbers: A Data First Approach to the Technology and Market
AR and VR by the Numbers: A Data First Approach to the Technology and MarketAR and VR by the Numbers: A Data First Approach to the Technology and Market
AR and VR by the Numbers: A Data First Approach to the Technology and Market
 
Introduction to Customer Data Platforms
Introduction to Customer Data PlatformsIntroduction to Customer Data Platforms
Introduction to Customer Data Platforms
 
Hands On: Javascript SDK
Hands On: Javascript SDKHands On: Javascript SDK
Hands On: Javascript SDK
 
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
Hands-On: Managing Slowly Changing Dimensions Using TD WorkflowHands-On: Managing Slowly Changing Dimensions Using TD Workflow
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
 
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
Brand Analytics Management: Measuring CLV Across Platforms, Devices and AppsBrand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
 
How to Power Your Customer Experience with Data
How to Power Your Customer Experience with DataHow to Power Your Customer Experience with Data
How to Power Your Customer Experience with Data
 
Why Your VR Game is Virtually Useless Without Data
Why Your VR Game is Virtually Useless Without DataWhy Your VR Game is Virtually Useless Without Data
Why Your VR Game is Virtually Useless Without Data
 
Connecting the Customer Data Dots
Connecting the Customer Data DotsConnecting the Customer Data Dots
Connecting the Customer Data Dots
 
Harnessing Data for Better Customer Experience and Company Success
Harnessing Data for Better Customer Experience and Company SuccessHarnessing Data for Better Customer Experience and Company Success
Harnessing Data for Better Customer Experience and Company Success
 
Packaging Ecosystems -Monki Gras 2017
Packaging Ecosystems -Monki Gras 2017Packaging Ecosystems -Monki Gras 2017
Packaging Ecosystems -Monki Gras 2017
 
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
 
Keynote - Fluentd meetup v14
Keynote - Fluentd meetup v14Keynote - Fluentd meetup v14
Keynote - Fluentd meetup v14
 
Introduction to New features and Use cases of Hivemall
Introduction to New features and Use cases of HivemallIntroduction to New features and Use cases of Hivemall
Introduction to New features and Use cases of Hivemall
 
Scalable Hadoop in the cloud
Scalable Hadoop in the cloudScalable Hadoop in the cloud
Scalable Hadoop in the cloud
 
Using Embulk at Treasure Data
Using Embulk at Treasure DataUsing Embulk at Treasure Data
Using Embulk at Treasure Data
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
 
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
 
Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to Redshift
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the Cloud
 

Recently uploaded

Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSrknatarajan
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spaintimesproduction05
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 

Recently uploaded (20)

Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 

Internals of Presto Service

  • 1. Internals of Presto Service Taro L. Saito, Treasure Data leo@treasure-data.com March 11-12th, 2015 Treasure Data Tech Talk #1 at Tokyo
  • 2. Taro L. Saito @taroleo •  2007 University of Tokyo. Ph.D. –  XML DBMS, Transaction Processing •  Relational-Style XML Query [SIGMOD 2008] •  ~ 2014 Assistant Professor at University of Tokyo –  Genome Science Research •  Distributed Computing, Personal Genome Analysis •  March 2014 ~ Treasure Data –  Software Engineer, MPP Team Leader •  Open source projects at GitHub –  snappy-java, msgpack-java, sqlite-jdbc –  sbt-pack, sbt-sonatype, larray –  silk •  Distributed workflow engine 2
  • 3. Hive TD API / Web Console batch query Presto Treasure Data PlazmaDB: MessagePack Columnar Storage td-presto connector Interactive query
  • 4. What is Presto? •  A distributed SQL Engine developed by Facebook –  For interactive analysis on peta-scale dataset •  As a replacement of Hive –  Nov. 2013: Open sourced at GitHub •  Presto –  Written in Java –  In-memory query layer –  CPU efficient for ad-hoc analysis –  Based on ANSI SQL –  Isolation of query layer and storage access layer •  A connector provides data access (reading schema and records) 4
  • 5. Presto: Distributed SQL Engine 5 TD Presto has its own query retry mechanism Tailored to throughput CPU-intensive. Faster response time Fault Tolerant
  • 6. Treasure Data: Presto as a Service 6 Presto Public Release
  • 7. Topics •  Challenges in providing Database as a Service •  TD Presto Connector –  Optimizing Scan Performance –  Multi-tenancy Cluster Management •  Resource allocation •  Monitoring •  Query Tuning 7
  • 8. buffer Optimizing Scan Performance •  Fully utilize the network bandwidth from S3 •  TD Presto becomes CPU bottleneck TableScanOperator •  s3 file list •  table schema header request S3 / RiakCS •  release(Buffer) Buffer size limit Reuse allocated buffers Request Queue •  priority queue •  max connections limit Header Column Block 0 (column names) Column Block 1 Column Block i Column Block m MPC1 file HeaderReader •  callback to HeaderParser ColumnBlockReader header HeaderParser •  parse MPC file header • column block offsets • column names column block request Column block requests column block prepare MessageUnpacker buffer MessageUnpacker MessageUnpacker S3 read S3 read pull records Retry GET request on - 500 (internal error) - 503 (slow down) - 404 (not found) - eventual consistency S3 read •  decompression •  msgpack-java v07 S3 read S3 read S3 read
  • 9. MessageBuffer •  msgpack-java v06 was the bottleneck –  Inefficient buffer access •  v07 •  Fast memory access •  sun.misc.Unsafe •  Direct access to heap memory •  extract primitive type value from byte[] •  cast •  No boxing 9
  • 10. Unsafe memory access performance is comparable to C •  http://frsyuki.hatenablog.com/entry/2014/03/12/155231 10
  • 11. Why ByteBuffer is slow? •  Following a good programming manner –  Define interface, then implement classes •  ByteBuffer interface has HeapByteBuffer and DirectByteBuffer implementations •  In reality: TypeProfile slows down method access –  JVM generates look-up table of method implementations –  Simply importing one or more classes generates TypeProfile •  v07 avoid TypeProfile generation –  Load an implementation class through Reflection 11
  • 12. Format Type Detection •  MessageUnpacker –  read prefix: 1 byte –  detect format type •  switch-case –  ANTLR generates this type of codes 12
  • 13. Format Type Detection •  Using cache-efficient lookup table: 20000x faster 13
  • 15. Database As A Service 15
  • 16. Claremont Report on Database Research •  Discussion on future of DBMS –  Top researchers, vendors and practitioners. –  CACM, Vol. 52 No. 6, 2009 •  Predicts emergence of Cloud Data Service –  SQL has an important role •  limited functionality •  suited for service provider –  A difficult example: Spark  •  Need a secure application container to run arbitrary Scala code. 16
  • 17. Beckman Report on Database Research •  2013 –  http://beckman.cs.wisc.edu/beckman-report2013.pdf –  Topics of Big-Data •  End-to-end service –  From data collection to knowledge •  Cloud Service has become popular –  IaaS, PaaS, SaaS –  Challenge is to migrate all of the functionalities of DBMS into Cloud 17
  • 18. Results Push Results Push SQL Big Data Simplified: The Treasure Data Approach AppServers Multi-structured Events! •  register! •  login! •  start_event! •  purchase! •  etc! SQL-based Ad-hoc Queries SQL-based Dashboards DBs & Data Marts Other Apps Familiar & Table-oriented Infinite & Economical Cloud Data Store ü  App log data! ü  Mobile event data! ü  Sensor data! ü  Telemetry! Mobile SDKs Web SDK Multi-structured Events Multi-structured Events Treasure Agent Treasure Agent Treasure Agent Treasure Agent Treasure Agent Treasure Agent Treasure Agent Treasure Agent Embedded SDKs Server-side Agents 18
  • 19. Challenges in Database as a Service •  Tradeoffs –  Cost and service level objectives (SLOs) •  Reference –  Workload Management for Big Data Analytics. A. Aboulnaga [SIGMOD2013 Tutorial] 19 Run each query set on an independent cluster Run all queries together on the smallest possible cluster Fast $$$ Limited performance guarantee Reasonable price
  • 20. Shift of Presto Query Usage •  Initial phase –  Try and error of queries •  Many syntax errors, semantic errors •  Next phase –  Scheduled query execution •  Increased Presto query usage –  Some customers submit more than 1,000 Presto queries / day –  Establishing typical query patterns •  hourly, daily reports •  query templates •  Advanced phase: More elaborate data analysis –  Complex queries •  via data scientists and data analysts –  High resource usage 20
  • 21. Usage Shift: Simple to Complex queries 21
  • 22. Monitoring Presto Usage with Fluentd 22 Hive Presto
  • 23. DataDog •  Monitoring CPU, memory and network usage •  Query stats 23
  • 24. Query Collection in TD •  SQL query logs –  query, detailed query plan, elapsed time, processed rows, etc. •  Presto is used for analyzing the query history 24
  • 26. Query Running Time •  More than 90% of queries finishes within 2 min. expected response time for interactive queries 26
  • 27. Processed Rows of Queries 27
  • 28. Performance •  Processed rows / sec. of a query 28
  • 29. Collecting Recoverable Error Patterns •  Presto has no fault tolerance •  Error types –  User error •  Syntax errors –  SQL syntax, missing function •  Semantic errors –  missing tables/columns –  Insufficient resource •  Exceeded task memory size –  Internal failure •  I/O error –  S3/Riak CS •  worker failure •  etc. 29 TD Presto retries these queries
  • 30. Query Retry on Internal Errors •  More than 99.8% of queries finishes without errors 30
  • 31. Query Retry on Internal Errors (log scale) •  Queries succeed eventually 31
  • 32. Multi-tenancy: Resource Allocation •  Price-plan based resource allocation •  Parameters –  The number of worker nodes to use (min-candidates) –  The number of hash partitions (initial-hash-partitions) –  The maximum number of running tasks per account •  If running queries exceeds allowed number of tasks, the next queries need to wait (queued) •  Presto: SqlQueryExecution class –  Controls query execution state: planning -> running -> finished •  No resource allocation policy –  Extended TDSqlQueryExection class monitors running tasks and limits resource usage •  Rewriting SqlQueryExecutionFactory at run-time by using ASM library 32
  • 33. Query Queue •  Presto 0.97 –  Introduces user-wise query queues •  Can limit the number of concurrent queries per user •  Problem –  Running too many queries delays overall query performance 33
  • 34. Customer Feedback •  A feedback: –  We don’t care if large queries take long time –  But interactive queries should run immediately •  Challenges –  How do we allocate resources even if preceding queries occupies customer share of resources? –  How do we know a submitted query is interactive one? 34
  • 35. Admission control is necessary •  Adjust resource utilization –  Running Drivers (Splits) –  MPL (Multi-Programming Level) 35
  • 36. Challenge: Auto Scaling •  Setting the cluster size based on the peak usage is expensive •  But predicting customer usage is difficult 36
  • 37. Typical Query Patterns [Li Juang] •  Q: What are typical queries of a customer? –  Customer feels some queries are slow –  But we don’t know what to compare with, except scheduled queries •  Approach: Clustering Customer SQLs •  TF/IDF measure: TF x IDF vector –  Split SQL statements into tokens –  Term frequency (TF) = the number of each term in a query –  Inverse document frequency (IDF) = log (# of queries / # of queries that have a token) •  k-means clustering –  TF/IDF vector –  Generates clusters of similar queries •  x-means clustering for deciding number of clusters automatically –  D. Pelleg [ICML2000] 37
  • 38. Problematic Queries •  90% of queries finishes within 2 min. –  But remaining 10% is still large •  10% of 10,000 queries is 1,000. •  Long-running queries •  Hog queries 38
  • 39. Long Running Queries •  Typical bottlenecks –  Cross joins –  IN (a, b, c, …) •  semi-join filtering process is slow –  Complex scan condition •  pushing down selection •  but delays column scan –  Tuple materialization •  coordinator generates json data –  Many aggregation columns •  group by 1, 2, 3, 4, 5, 6, … –  Full scan •  Scanning 100 billion rows… •  Adding more resources does not always make query faster •  Storing intermediate data to disks is necessary 39 Result are buffered (waiting fetch) slow process fast fast
  • 40. Hog Query •  Queries consuming a lot of CPU/memory resources –  Coined in S. Krompass et al. [EDBT2009] •  Example: –  select 1 as day, count(…) from … where time <= current_date - interval 1 day union all select 2 as day, count(…) from … where time <= current_date - interval 2 day union all –  … –  (up to 190 days) •  More than 1000 query stages. •  Presto tries to run all of the stages at once. –  High CPU usage at coordinator 40
  • 41. •  Query rewriting (better) –  With group by and window functions –  Not a perfect solution •  Need to understand the meaning of the query •  Semantic change is not allowed –  e.g., We cannot rewrite UNION to UNION ALL –  UNION includes duplicate elimination •  Workaround Idea –  Bushy plan -> Deep plan –  Introduce stage-wise resource assignment Query Rewriting? Plan Optimization? 41
  • 42. Future Work •  Reducing Queuing/Response Time –  Introducing shared queue between customers •  For utilizing remaining cluster resources –  Fair-Scheduling: C. Gupata [EDBT2009] –  Self-tuning DBMS. S. Chaudhuri [VLDB2007] •  Adjusting Running Query Size (hard) –  Limiting driver resources as small as possible for hog queries –  Query plan based cost estimation •  Predicting Query Running Time –  J. Duggan [SIGMOD2011], A.C. Konig [VLDB2011] 42
  • 43. Summary: Treasures in Treasure Data •  Treasures for our customers –  Data collected by fluentd (td-agent) –  Query analysis platform –  Query results - values •  For Treasure Data –  SQL query logs •  Stored in treasure data –  We know how customers use SQL •  Typical queries and failures –  We know which part of query can be improved 43