SlideShare a Scribd company logo
1 of 35
Page1 © Hortonworks Inc. 2014
Cost-based query optimization in
Apache Hive 0.14
Julian Hyde Julian Hyde
Seattle
September 24th, 2014
Page2 © Hortonworks Inc. 2014
About me
Julian Hyde
Architect at Hortonworks
Open source:
• Founder & lead, Apache Optiq (query optimization framework)
• Founder & lead, Pentaho Mondrian (analysis engine)
• Committer, Apache Drill
• Contributor, Apache Hive
• Contributor, Cascading Lingual (SQL interface to Cascading)
Past:
• SQLstream (streaming SQL)
• Broadbase (data warehouse)
• Oracle (SQL kernel development)
Page3 © Hortonworks Inc. 2014
Hadoop - A New Data Architecture for New DataAPPLICATIONSDATASYSTEM
REPOSITORIES
SOURCES
Existing Sources
(CRM, ERP, Clickstream, Logs)
RDBMS EDW MPP
Business Analytics
Custom Applications
Packaged Applications
OLTP, ERP, CRM Systems
Unstructured documents, emails
Clickstream
Server logs
Sentiment, Web Data
Sensor. Machine Data
Geolocation
New Data Requirements:
• Scale
• Economics
• Flexibility
Traditional Data Architecture
Page4 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Interactive SQL-IN-Hadoop Delivered
Stinger Initiative – DELIVERED
Next generation SQL based
interactive query in Hadoop
Speed
Improve Hive query performance has increased by 100X to allow for
interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed for queries that scale
from TB to PB
SQL
Support broadest range of SQL semantics for analytic applications
running against Hadoop
Apache Hive Contribution… an Open Community at its finest
1,672Jira Tickets Closed
145Developers
44Companies
~390,000Lines Of Code Added… (2x)
Apache YARN
Apache
MapReduce
1 ° ° °
° ° ° °
° ° ° °
°
°
N
HDFS
(Hadoop Distributed File System)
Apache
Tez
Apache Hive
SQL
Business Analytics Custom
Apps
Stinger Project
Stinger Phase 1:
• Base Optimizations
• SQL Types
• SQL Analytic Functions
• ORCFile Modern File Format
Stinger Phase 2:
• SQL Types
• SQL Analytic Functions
• Advanced Optimizations
• Performance Boosts via YARN
Stinger Phase 3
• Hive on Apache Tez
• Query Service (always on)
• Buffer Cache
• Cost Based Optimizer (Optiq)
13Months
Governance
&Integration
Security
Operations
Data Access
Data
Management
HDP 2.1
ORC File
Window
Functions
Page5 © Hortonworks Inc. 2014
Hive – Single tool for all SQL use cases
OLTP, ERP, CRM Systems
Unstructured documents, emails
Clickstream
Server logs
Sentiment, Web Data
Sensor. Machine Data
Geolocation
Interactive
Analytics
Batch Reports /
Deep Analytics
Hive - SQL
ETL / ELT
Page6 © Hortonworks Inc. 2014
Incremental cutover to cost-based optimization
Release Date Remarks
Apache Hive 0.12 October 2013 • Rule-based Optimizations
• No join reordering
• Main optimizations: predicate push-
down & partition pruning
• Semantic info and operator tree tightly
coupled
Apache Hive 0.13 April 2014 “Old-style” JOIN & push-down conditions:
… FROM t1, t2 WHERE …
HDP 2.1 April 2014 Cost-based ordering of joins
Apache Hive 0.14 soon Bushy joins, large joins, better operator
coverage, better statistics, …
Page7 © Hortonworks Inc. 2014
Apache Optiq
(incubating)
Page8 © Hortonworks Inc. 2014
Apache Optiq
Apache incubator project since May, 2014
Query planning framework
• Relational algebra, rewrite rules, cost model
• Extensible
• Usable standalone (JDBC) or embedded
Adoption
• Lingual – SQL interface to Cascading
• Apache Drill
• Apache Hive
Adapters: Splunk, Spark, MongoDB, JDBC, CSV, JSON, Web tables, In-
memory data
Page9 © Hortonworks Inc. 2014
Conventional DB architecture
Page10 © Hortonworks Inc. 2014
Optiq architecture
Page11 © Hortonworks Inc. 2014
MySQL
Splunk
Expression tree
SELECT p.“product_name”, COUNT(*) AS c
FROM “splunk”.”splunk” AS s
JOIN “mysql”.”products” AS p
ON s.”product_id” = p.”product_id”
WHERE s.“action” = 'purchase'
GROUP BY p.”product_name”
ORDER BY c DESC
join
Key: product_id
group
Key: product_name
Agg: count
filter
Condition:
action =
'purchase'
sort
Key: c DESC
scan
scan
Table: splunk
Table: products
Page12 © Hortonworks Inc. 2014
Splunk
Expression tree
(optimized)
SELECT p.“product_name”, COUNT(*) AS c
FROM “splunk”.”splunk” AS s
JOIN “mysql”.”products” AS p
ON s.”product_id” = p.”product_id”
WHERE s.“action” = 'purchase'
GROUP BY p.”product_name”
ORDER BY c DESC
join
Key: product_id
group
Key: product_name
Agg: count
filter
Condition:
action =
'purchase'
sort
Key: c DESC
scan
Table: splunk
MySQL
scan
Table: products
Page13 © Hortonworks Inc. 2014
Optiq – APIs and SPIs
Cost, statistics
RelOptCost
RelOptCostFactory
RelMetadataProvider
• RelMdColumnUniquensss
• RelMdDistinctRowCount
• RelMdSelectivity
SQL parser
SqlNode
SqlParser
SqlValidator
Transformation rules
RelOptRule
• MergeFilterRule
• PushAggregateThroughUni
onRule
• RemoveCorrelationForScal
arProjectRule
• 100+ more
Unification (materialized view)
Column trimming
Relational algebra
RelNode (operator)
• TableScan
• Filter
• Project
• Union
• Aggregate
• …
RelDataType (type)
RexNode (expression)
RelTrait (physical property)
• RelConvention (calling-convention)
• RelCollation (sortedness)
• TBD (bucketedness/distribution) JDBC driver
Metadata
Schema
Table
Function
• TableFunction
• TableMacro
Page14 © Hortonworks Inc. 2014
Under development in Optiq - materialized views
Query: SELECT x, SUM(y) FROM t GROUP BY x
In-memory
materialized
queries
Tables
on disk
http://hortonworks.com/blog/dmmq/
Page15 © Hortonworks Inc. 2014
Now… back to Hive
Page16 © Hortonworks Inc. 2014
CBO in Hive
Why cost-based optimization?
Ease of Use – Join Reordering
View Chaining
Ad hoc queries involving multiple views
Enables BI Tools as front ends to Hive
More efficient & maintainable query preparation process
Laying the groundwork for deeper optimizations, e.g. materialized views
Page 16
Page17 © Hortonworks Inc. 2014
Query preparation – Hive 0.13
SQL
parser
Semantic
analyzer
Logical
Optimizer
Physical
Optimizer
Abstract
Syntax
Tree (AST)
Hive SQL
Annotated
AST
Plan
Tez
Tuned
Plan
Page18 © Hortonworks Inc. 2014
Query preparation – full CBO
SQL
parser
Semantic
analyzer
Translate
to algebra
Physical
Optimizer
Abstract
Syntax
Tree (AST)
Hive SQL
Tez
Tuned
Plan
Optiq
optimizer
RelNode
Annotated
AST
Page19 © Hortonworks Inc. 2014
Query preparation – Hive 0.14
SQL
parser
Semantic
analyzer
Logical
Optimizer
Physical
Optimizer
Hive SQL
AST with optimized
join-ordering
Tez
Tuned
Plan
Translate
to algebra
Optiq
optimizer
Page20 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Query Execution – The basics
Page 20
SELECT R1.x
FROM R1
JOIN R2 ON R1.x = R2.x
JOIN R3 on R1.x = R3.x AND R2.x = R3.x
WHERE R1.z > 10;
p
s


R1 R2
R3
TS [R1]
TS [R2]
RS
RS
Shuffle
Join
TS [R3]
Map
Join
Filter FS
Page21 © Hortonworks Inc. 2014
Optiq Planner Process
Hive
Plan
Planner
RelNode
GraphRelNode Converter
RexNode Converter
Hive Op  RelNode
Hive Expr  RexNode
• Node for each node in
Input Plan
• Each node is a Set of
alternate Sub Plans
• Set further divided into
Subsets: based on
traits like sortedness
1. Plan Graph
• Rule: specifies a Operator
sub-graph to match and
logic to generate equivalent
‘better’ sub-graph.
• We only have Join
Reordering Rules.
2. Rules
• RelNodes have Cost (&
Cumulative Cost)
• We only use Cardinality
for Cost.
3. Cost Model
- Used to Plugin Schema,
Cost Formulas:
Selectivity, NDV
calculations etc.
- We only added
Selectivity and NDV
formulas; Schema is
only available at the
Node level
4. Metadata Providers
Rule Match Queue
- Add Rule matches to Queue
- Apply Rule match
transformations to Plan Graph
- Iterate for fixed iterations or
until Cost doesn’t change.
- Match importance based on
Cost of RelNode and height.
Best
RelNode
Graph
AST Converter
Revised
AST
Logical Plan
Physical traits:
Table Part./Buckets;
RedSink Ops
removed
Page22 © Hortonworks Inc. 2014
Star schema
Sales InventoryTime
Product
Customer
Warehouse
Key
Fact table
Dimension table
Many-to-one relationship
Page23 © Hortonworks Inc. 2014
Query combining two stars
SELECT product.id, sum(sales.units), sum(inventory.on_hand)
FROM sales ON …
JOIN customer ON …
JOIN time ON …
JOIN product ON …
JOIN inventory ON …
JOIN warehouse ON …
WHERE time.year = 2014
AND time.quarter = ‘Q1’
AND product.color = ‘Red’
AND warehouse.state = ‘WA’
GROUP BY …
Sales InventoryTime
Product
Customer
Warehouse
Page24 © Hortonworks Inc. 2014
Left-deep tree
Initial tree is “left-deep”
No join node is the right child of its parent
Join-ordering algorithm chooses what
order to join tables – does not re-shape the
tree
Typical plan:
• Start with largest table at bottom left
• Join tables with more selective
conditions first
Sales Customer
Time
Product
Inventory
Warehouse
Page25 © Hortonworks Inc. 2014
Bushy tree
No restrictions on where join
nodes can occur
“Bushes” consist of fact tables
(Sales and Inventory) surrounded
by many-to-one related
dimension tables
Dimension tables have a filtering
effect
This tree produces the same
result as previous plan but is
more efficient
Sales Customer
Time
Product
Inventory Warehouse
Page26 © Hortonworks Inc. 2014
Two approaches to join optimization
Algorithm #1 “exhaustive search”
Apply 3 transformation rules exhaustively:
• SwapJoinRule: A join B  B join A
• PushJoinThroughJoinRule: (A join B) join C  (A join C) join B
• CommutativeJoinRule: (A join B) join C  A join (B join C)
Finds every possible plan, but not practical for more than ~8 joins
Algorithm #2 “greedy”
Build a graph iteratively
Use heuristics to choose the “best” node to add next
Applying them to Hive
We can use both algorithms – and we do!
Both are sensitive to bad statistics – e.g. poor estimation of intermediate result set sizes
Page27 © Hortonworks Inc. 2014
Statistics
Feeding the beast
CBO disabled if your tables don’t have statistics
• No longer require statistics on all columns, just join columns
Better optimizations need ever-better statistics… so, statistics are getting better
Kinds of statistics
Raw statistics on stored data: row counts, number-of-distinct-values (NDV)
Statistics on intermediate operators, computed using selectivity estimates
• Much improved selectivity estimates this release, based on NDVs
• Planned improvements to raw statistics (e.g. histograms, unique keys, sort order) will help
• Materialized views
Run-time statistics
• Example 1: 90% of the rows in this join have the same key  use skew join
• Example 2: Only 10 distinct values of GROUP BY key  auto-reduce parallelism
Page28 © Hortonworks Inc. 2014
Stored statistics – recent improvements
ANALYZE
Clean up command syntax
Faster computation
Table vs partition statistics
All statistics now stored per partition
Statistics retrieval
Faster retrieval
Merge partition statistics
Extrapolate for missing statistics
Extrapolation
SQL:
SELECT productId, COUNT(*)
FROM Sales
WHERE year = 2014
GROUP BY productId
Required statistic: NDV(productId)
Statistics available
2013
Q1
2013
Q2
2013
Q3
2013
Q4
2014
Q1
2014
Q2
2014
Q3
2014
Q4
Used in query
Extrapolate
{Q1, Q2, Q3}
stats for Q4
Page29 © Hortonworks Inc. 2014
Dynamic partition pruning
Consider a query with a partitioned fact table, filters on the dimension
table:
SELECT … FROM Sales
JOIN Time ON Sales.time_id = Time.time_id
WHERE time.year = 2014 AND time.quarter IN (‘Q1’, ‘Q2’)
At execute time, DAG figures out which
partitions could possibly match, and cancels
scans of the others
2013
Q1
2013
Q2
2013
Q3
2013
Q4
2014
Q1
2014
Q2
2014
Q3
2014
Q4
Time
Page30 © Hortonworks Inc. 2014
Summary
Join-ordering: (exhaustive & heuristic), scalability, bushy joins
Statistics – faster, better, extrapolate if stats missing
Very few operators that CBO can’t handle – TABLESAMPLE, SCRIPT,
multi-INSERT
Dynamic partition pruning
Auto-reduce parallelism
Page 30
Page31 © Hortonworks Inc. 2014
Show me the numbers…
Page32 © Hortonworks Inc. 2014
TPC-DS (30TB) Q17
Joins Store Sales, Store Returns and Catalog
Sales fact tables.
Each of the fact tables are independently
restricted by time.
Analysis at Item and Store grain, so these
dimensions are also joined in.
As specified Query starts by joining the 3 Fact
tables.
SELECT i_item_id
,i_item_desc
,s_state
,count(ss_quantity) as store_sales_quantitycount
,….
FROM store_sales ss ,store_returns sr, catalog_sales cs,
date_dim d1, date_dim d2, date_dim d3, store s, item I
WHERE d1.d_quarter_name = '2000Q1’
AND d1.d_date_sk = ss.ss_sold_date_sk
AND i.i_item_sk = ss.ss_item_sk AND …
GROUP BY i_item_id ,i_item_desc, ,s_state
ORDER BY i_item_id ,i_item_desc, s_state
LIMIT 100;
CBO Elapsed
(s)
Intemediate
data (GB)
Off 10,683 5,017
On 1,284 275
Page33 © Hortonworks Inc. 2014
TPC-DS (200G) queries
Query Hive 14
CBO off
Hive 14
CBO on
Gain
Q15 84 44 91%
Q22 123 99 24%
Q29 1677 48 3,394%
Q40 118 29 307%
Q51 276 80 245%
Q80 842 70 1,103%
Q82 278 23 1,109%
Q87 275 51 439%
Q92 511 80 539%
Q93 160 69 132%
Q97 483 79 511%
Page34 © Hortonworks Inc. 2014
Stinger.next
• SQL compliance: interval data type, non-equi joins, set operators, more
sub-queries
• Transactions: COMMIT, savepoint, rollback)
• LLAP
• Materialized views
• In-memory
• Automatic or manual
http://hortonworks.com/blog/stinger-next-enterprise-sql-hadoop-scale-apache-hive/
Page35 © Hortonworks Inc. 2014
Thank you!
@julianhyde
http://hive.apache.org/
http://optiq.incubator.apache.org/

More Related Content

What's hot

PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysDatabricks
 
Lessons for the optimizer from running the TPC-DS benchmark
Lessons for the optimizer from running the TPC-DS benchmarkLessons for the optimizer from running the TPC-DS benchmark
Lessons for the optimizer from running the TPC-DS benchmarkSergey Petrunya
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityJulian Hyde
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systemsDave Gardner
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 

What's hot (20)

PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
 
Lessons for the optimizer from running the TPC-DS benchmark
Lessons for the optimizer from running the TPC-DS benchmarkLessons for the optimizer from running the TPC-DS benchmark
Lessons for the optimizer from running the TPC-DS benchmark
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systems
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 

Viewers also liked

The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too muchJulian Hyde
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
 
Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in HiveDataWorks Summit
 
How Pony ORM translates Python generators to SQL queries
How Pony ORM translates Python generators to SQL queriesHow Pony ORM translates Python generators to SQL queries
How Pony ORM translates Python generators to SQL queriesponyorm
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizationsSzehon Ho
 
Discardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With HadoopDiscardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With HadoopJulian Hyde
 
What's new in Mondrian 4?
What's new in Mondrian 4?What's new in Mondrian 4?
What's new in Mondrian 4?Julian Hyde
 
Optiq: A dynamic data management framework
Optiq: A dynamic data management frameworkOptiq: A dynamic data management framework
Optiq: A dynamic data management frameworkJulian Hyde
 
Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Julian Hyde
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentJulian Hyde
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memoryJulian Hyde
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache CalciteJulian Hyde
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Julian Hyde
 

Viewers also liked (20)

The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in Hive
 
How Pony ORM translates Python generators to SQL queries
How Pony ORM translates Python generators to SQL queriesHow Pony ORM translates Python generators to SQL queries
How Pony ORM translates Python generators to SQL queries
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizations
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Discardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With HadoopDiscardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With Hadoop
 
What's new in Mondrian 4?
What's new in Mondrian 4?What's new in Mondrian 4?
What's new in Mondrian 4?
 
Optiq: A dynamic data management framework
Optiq: A dynamic data management frameworkOptiq: A dynamic data management framework
Optiq: A dynamic data management framework
 
Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 

Similar to Cost-based query optimization in Apache Hive 0.14

An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureDataWorks Summit/Hadoop Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureDataWorks Summit
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on HadoopDataWorks Summit
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeDataWorks Summit
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data AnalyticsAmazon Web Services
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013alanfgates
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveDataWorks Summit
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?DataWorks Summit
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemasterAthemaster Co., Ltd.
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartEvans Ye
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Hortonworks
 

Similar to Cost-based query optimization in Apache Hive 0.14 (20)

An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present Future
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, Future
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application code
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in Hive
 
Hive_p
Hive_pHive_p
Hive_p
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemaster
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
 

More from Julian Hyde

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteJulian Hyde
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Julian Hyde
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQLJulian Hyde
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming languageJulian Hyde
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're IncubatingJulian Hyde
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteJulian Hyde
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesJulian Hyde
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineeringJulian Hyde
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Julian Hyde
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databasesJulian Hyde
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and FastJulian Hyde
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
 

More from Julian Hyde (20)

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using Calcite
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 

Recently uploaded

Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 

Recently uploaded (20)

Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 

Cost-based query optimization in Apache Hive 0.14

  • 1. Page1 © Hortonworks Inc. 2014 Cost-based query optimization in Apache Hive 0.14 Julian Hyde Julian Hyde Seattle September 24th, 2014
  • 2. Page2 © Hortonworks Inc. 2014 About me Julian Hyde Architect at Hortonworks Open source: • Founder & lead, Apache Optiq (query optimization framework) • Founder & lead, Pentaho Mondrian (analysis engine) • Committer, Apache Drill • Contributor, Apache Hive • Contributor, Cascading Lingual (SQL interface to Cascading) Past: • SQLstream (streaming SQL) • Broadbase (data warehouse) • Oracle (SQL kernel development)
  • 3. Page3 © Hortonworks Inc. 2014 Hadoop - A New Data Architecture for New DataAPPLICATIONSDATASYSTEM REPOSITORIES SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) RDBMS EDW MPP Business Analytics Custom Applications Packaged Applications OLTP, ERP, CRM Systems Unstructured documents, emails Clickstream Server logs Sentiment, Web Data Sensor. Machine Data Geolocation New Data Requirements: • Scale • Economics • Flexibility Traditional Data Architecture
  • 4. Page4 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Interactive SQL-IN-Hadoop Delivered Stinger Initiative – DELIVERED Next generation SQL based interactive query in Hadoop Speed Improve Hive query performance has increased by 100X to allow for interactive query times (seconds) Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB SQL Support broadest range of SQL semantics for analytic applications running against Hadoop Apache Hive Contribution… an Open Community at its finest 1,672Jira Tickets Closed 145Developers 44Companies ~390,000Lines Of Code Added… (2x) Apache YARN Apache MapReduce 1 ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Apache Tez Apache Hive SQL Business Analytics Custom Apps Stinger Project Stinger Phase 1: • Base Optimizations • SQL Types • SQL Analytic Functions • ORCFile Modern File Format Stinger Phase 2: • SQL Types • SQL Analytic Functions • Advanced Optimizations • Performance Boosts via YARN Stinger Phase 3 • Hive on Apache Tez • Query Service (always on) • Buffer Cache • Cost Based Optimizer (Optiq) 13Months Governance &Integration Security Operations Data Access Data Management HDP 2.1 ORC File Window Functions
  • 5. Page5 © Hortonworks Inc. 2014 Hive – Single tool for all SQL use cases OLTP, ERP, CRM Systems Unstructured documents, emails Clickstream Server logs Sentiment, Web Data Sensor. Machine Data Geolocation Interactive Analytics Batch Reports / Deep Analytics Hive - SQL ETL / ELT
  • 6. Page6 © Hortonworks Inc. 2014 Incremental cutover to cost-based optimization Release Date Remarks Apache Hive 0.12 October 2013 • Rule-based Optimizations • No join reordering • Main optimizations: predicate push- down & partition pruning • Semantic info and operator tree tightly coupled Apache Hive 0.13 April 2014 “Old-style” JOIN & push-down conditions: … FROM t1, t2 WHERE … HDP 2.1 April 2014 Cost-based ordering of joins Apache Hive 0.14 soon Bushy joins, large joins, better operator coverage, better statistics, …
  • 7. Page7 © Hortonworks Inc. 2014 Apache Optiq (incubating)
  • 8. Page8 © Hortonworks Inc. 2014 Apache Optiq Apache incubator project since May, 2014 Query planning framework • Relational algebra, rewrite rules, cost model • Extensible • Usable standalone (JDBC) or embedded Adoption • Lingual – SQL interface to Cascading • Apache Drill • Apache Hive Adapters: Splunk, Spark, MongoDB, JDBC, CSV, JSON, Web tables, In- memory data
  • 9. Page9 © Hortonworks Inc. 2014 Conventional DB architecture
  • 10. Page10 © Hortonworks Inc. 2014 Optiq architecture
  • 11. Page11 © Hortonworks Inc. 2014 MySQL Splunk Expression tree SELECT p.“product_name”, COUNT(*) AS c FROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' GROUP BY p.”product_name” ORDER BY c DESC join Key: product_id group Key: product_name Agg: count filter Condition: action = 'purchase' sort Key: c DESC scan scan Table: splunk Table: products
  • 12. Page12 © Hortonworks Inc. 2014 Splunk Expression tree (optimized) SELECT p.“product_name”, COUNT(*) AS c FROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' GROUP BY p.”product_name” ORDER BY c DESC join Key: product_id group Key: product_name Agg: count filter Condition: action = 'purchase' sort Key: c DESC scan Table: splunk MySQL scan Table: products
  • 13. Page13 © Hortonworks Inc. 2014 Optiq – APIs and SPIs Cost, statistics RelOptCost RelOptCostFactory RelMetadataProvider • RelMdColumnUniquensss • RelMdDistinctRowCount • RelMdSelectivity SQL parser SqlNode SqlParser SqlValidator Transformation rules RelOptRule • MergeFilterRule • PushAggregateThroughUni onRule • RemoveCorrelationForScal arProjectRule • 100+ more Unification (materialized view) Column trimming Relational algebra RelNode (operator) • TableScan • Filter • Project • Union • Aggregate • … RelDataType (type) RexNode (expression) RelTrait (physical property) • RelConvention (calling-convention) • RelCollation (sortedness) • TBD (bucketedness/distribution) JDBC driver Metadata Schema Table Function • TableFunction • TableMacro
  • 14. Page14 © Hortonworks Inc. 2014 Under development in Optiq - materialized views Query: SELECT x, SUM(y) FROM t GROUP BY x In-memory materialized queries Tables on disk http://hortonworks.com/blog/dmmq/
  • 15. Page15 © Hortonworks Inc. 2014 Now… back to Hive
  • 16. Page16 © Hortonworks Inc. 2014 CBO in Hive Why cost-based optimization? Ease of Use – Join Reordering View Chaining Ad hoc queries involving multiple views Enables BI Tools as front ends to Hive More efficient & maintainable query preparation process Laying the groundwork for deeper optimizations, e.g. materialized views Page 16
  • 17. Page17 © Hortonworks Inc. 2014 Query preparation – Hive 0.13 SQL parser Semantic analyzer Logical Optimizer Physical Optimizer Abstract Syntax Tree (AST) Hive SQL Annotated AST Plan Tez Tuned Plan
  • 18. Page18 © Hortonworks Inc. 2014 Query preparation – full CBO SQL parser Semantic analyzer Translate to algebra Physical Optimizer Abstract Syntax Tree (AST) Hive SQL Tez Tuned Plan Optiq optimizer RelNode Annotated AST
  • 19. Page19 © Hortonworks Inc. 2014 Query preparation – Hive 0.14 SQL parser Semantic analyzer Logical Optimizer Physical Optimizer Hive SQL AST with optimized join-ordering Tez Tuned Plan Translate to algebra Optiq optimizer
  • 20. Page20 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Query Execution – The basics Page 20 SELECT R1.x FROM R1 JOIN R2 ON R1.x = R2.x JOIN R3 on R1.x = R3.x AND R2.x = R3.x WHERE R1.z > 10; p s   R1 R2 R3 TS [R1] TS [R2] RS RS Shuffle Join TS [R3] Map Join Filter FS
  • 21. Page21 © Hortonworks Inc. 2014 Optiq Planner Process Hive Plan Planner RelNode GraphRelNode Converter RexNode Converter Hive Op  RelNode Hive Expr  RexNode • Node for each node in Input Plan • Each node is a Set of alternate Sub Plans • Set further divided into Subsets: based on traits like sortedness 1. Plan Graph • Rule: specifies a Operator sub-graph to match and logic to generate equivalent ‘better’ sub-graph. • We only have Join Reordering Rules. 2. Rules • RelNodes have Cost (& Cumulative Cost) • We only use Cardinality for Cost. 3. Cost Model - Used to Plugin Schema, Cost Formulas: Selectivity, NDV calculations etc. - We only added Selectivity and NDV formulas; Schema is only available at the Node level 4. Metadata Providers Rule Match Queue - Add Rule matches to Queue - Apply Rule match transformations to Plan Graph - Iterate for fixed iterations or until Cost doesn’t change. - Match importance based on Cost of RelNode and height. Best RelNode Graph AST Converter Revised AST Logical Plan Physical traits: Table Part./Buckets; RedSink Ops removed
  • 22. Page22 © Hortonworks Inc. 2014 Star schema Sales InventoryTime Product Customer Warehouse Key Fact table Dimension table Many-to-one relationship
  • 23. Page23 © Hortonworks Inc. 2014 Query combining two stars SELECT product.id, sum(sales.units), sum(inventory.on_hand) FROM sales ON … JOIN customer ON … JOIN time ON … JOIN product ON … JOIN inventory ON … JOIN warehouse ON … WHERE time.year = 2014 AND time.quarter = ‘Q1’ AND product.color = ‘Red’ AND warehouse.state = ‘WA’ GROUP BY … Sales InventoryTime Product Customer Warehouse
  • 24. Page24 © Hortonworks Inc. 2014 Left-deep tree Initial tree is “left-deep” No join node is the right child of its parent Join-ordering algorithm chooses what order to join tables – does not re-shape the tree Typical plan: • Start with largest table at bottom left • Join tables with more selective conditions first Sales Customer Time Product Inventory Warehouse
  • 25. Page25 © Hortonworks Inc. 2014 Bushy tree No restrictions on where join nodes can occur “Bushes” consist of fact tables (Sales and Inventory) surrounded by many-to-one related dimension tables Dimension tables have a filtering effect This tree produces the same result as previous plan but is more efficient Sales Customer Time Product Inventory Warehouse
  • 26. Page26 © Hortonworks Inc. 2014 Two approaches to join optimization Algorithm #1 “exhaustive search” Apply 3 transformation rules exhaustively: • SwapJoinRule: A join B  B join A • PushJoinThroughJoinRule: (A join B) join C  (A join C) join B • CommutativeJoinRule: (A join B) join C  A join (B join C) Finds every possible plan, but not practical for more than ~8 joins Algorithm #2 “greedy” Build a graph iteratively Use heuristics to choose the “best” node to add next Applying them to Hive We can use both algorithms – and we do! Both are sensitive to bad statistics – e.g. poor estimation of intermediate result set sizes
  • 27. Page27 © Hortonworks Inc. 2014 Statistics Feeding the beast CBO disabled if your tables don’t have statistics • No longer require statistics on all columns, just join columns Better optimizations need ever-better statistics… so, statistics are getting better Kinds of statistics Raw statistics on stored data: row counts, number-of-distinct-values (NDV) Statistics on intermediate operators, computed using selectivity estimates • Much improved selectivity estimates this release, based on NDVs • Planned improvements to raw statistics (e.g. histograms, unique keys, sort order) will help • Materialized views Run-time statistics • Example 1: 90% of the rows in this join have the same key  use skew join • Example 2: Only 10 distinct values of GROUP BY key  auto-reduce parallelism
  • 28. Page28 © Hortonworks Inc. 2014 Stored statistics – recent improvements ANALYZE Clean up command syntax Faster computation Table vs partition statistics All statistics now stored per partition Statistics retrieval Faster retrieval Merge partition statistics Extrapolate for missing statistics Extrapolation SQL: SELECT productId, COUNT(*) FROM Sales WHERE year = 2014 GROUP BY productId Required statistic: NDV(productId) Statistics available 2013 Q1 2013 Q2 2013 Q3 2013 Q4 2014 Q1 2014 Q2 2014 Q3 2014 Q4 Used in query Extrapolate {Q1, Q2, Q3} stats for Q4
  • 29. Page29 © Hortonworks Inc. 2014 Dynamic partition pruning Consider a query with a partitioned fact table, filters on the dimension table: SELECT … FROM Sales JOIN Time ON Sales.time_id = Time.time_id WHERE time.year = 2014 AND time.quarter IN (‘Q1’, ‘Q2’) At execute time, DAG figures out which partitions could possibly match, and cancels scans of the others 2013 Q1 2013 Q2 2013 Q3 2013 Q4 2014 Q1 2014 Q2 2014 Q3 2014 Q4 Time
  • 30. Page30 © Hortonworks Inc. 2014 Summary Join-ordering: (exhaustive & heuristic), scalability, bushy joins Statistics – faster, better, extrapolate if stats missing Very few operators that CBO can’t handle – TABLESAMPLE, SCRIPT, multi-INSERT Dynamic partition pruning Auto-reduce parallelism Page 30
  • 31. Page31 © Hortonworks Inc. 2014 Show me the numbers…
  • 32. Page32 © Hortonworks Inc. 2014 TPC-DS (30TB) Q17 Joins Store Sales, Store Returns and Catalog Sales fact tables. Each of the fact tables are independently restricted by time. Analysis at Item and Store grain, so these dimensions are also joined in. As specified Query starts by joining the 3 Fact tables. SELECT i_item_id ,i_item_desc ,s_state ,count(ss_quantity) as store_sales_quantitycount ,…. FROM store_sales ss ,store_returns sr, catalog_sales cs, date_dim d1, date_dim d2, date_dim d3, store s, item I WHERE d1.d_quarter_name = '2000Q1’ AND d1.d_date_sk = ss.ss_sold_date_sk AND i.i_item_sk = ss.ss_item_sk AND … GROUP BY i_item_id ,i_item_desc, ,s_state ORDER BY i_item_id ,i_item_desc, s_state LIMIT 100; CBO Elapsed (s) Intemediate data (GB) Off 10,683 5,017 On 1,284 275
  • 33. Page33 © Hortonworks Inc. 2014 TPC-DS (200G) queries Query Hive 14 CBO off Hive 14 CBO on Gain Q15 84 44 91% Q22 123 99 24% Q29 1677 48 3,394% Q40 118 29 307% Q51 276 80 245% Q80 842 70 1,103% Q82 278 23 1,109% Q87 275 51 439% Q92 511 80 539% Q93 160 69 132% Q97 483 79 511%
  • 34. Page34 © Hortonworks Inc. 2014 Stinger.next • SQL compliance: interval data type, non-equi joins, set operators, more sub-queries • Transactions: COMMIT, savepoint, rollback) • LLAP • Materialized views • In-memory • Automatic or manual http://hortonworks.com/blog/stinger-next-enterprise-sql-hadoop-scale-apache-hive/
  • 35. Page35 © Hortonworks Inc. 2014 Thank you! @julianhyde http://hive.apache.org/ http://optiq.incubator.apache.org/