SlideShare a Scribd company logo
1 of 29
Page1 © Hortonworks Inc. 2014
Cost-based query optimization in
Apache Hive
Julian Hyde Julian Hyde
June 4th, 2014
Page2 © Hortonworks Inc. 2014
About me
Julian Hyde
Architect at Hortonworks
Open source:
• Founder & lead, Apache Optiq (query optimization framework)
• Founder & lead, Pentaho Mondrian (analysis engine)
• Committer, Apache Drill
• Contributor, Apache Hive
• Contributor, Cascading Lingual (SQL interface to Cascading)
Past:
• SQLstream (streaming SQL)
• Broadbase (data warehouse)
• Oracle (SQL kernel development)
Page3 © Hortonworks Inc. 2014
(Thanks to
John Pullokkaran,
Harish Butani
for presentation content
and actually doing the work.)
Page4 © Hortonworks Inc. 2014
Apache Hive
The original “SQL on Hadoop”
Undergoing extensive renovation
• Tez execution engine
• YARN execution environment
• Vectorized data representation
• Column-oriented data storage (ORC)
• ACID transactions
• SQL standards compliance
• SQL authorization model
• Cost-based query optimization (CBO) What? Why? How? When?
“Stinger
Initiative”
Page5 © Hortonworks Inc. 2014
Incremental cutover to cost-based optimization
Release Date Remarks
Apache Hive 0.12 October 2013 • Rule-based Optimizations
• No join reordering
• Main optimizations: predicate push-
down & partition pruning
• Semantic info and operator tree tightly
coupled
Apache Hive 0.13 April 2014 “Old-style” JOIN & push-down conditions:
… FROM t1, t2 WHERE …
CBO just missed the deadline 
HDP 2.1 April 2014 Cost-based ordering of joins
• HIVE-6439 “Introduce CBO step in
Semantic Analyzer”
• HIVE-5775 “Introduce Cost Based
Optimizer in Hive”
Apache Hive 0.14 ? CBO patches
More rework of internals
More cost-based features…
Page6 © Hortonworks Inc. 2014
Apache Optiq
(incubating)
Page7 © Hortonworks Inc. 2014
Apache Optiq
Apache incubator project since May, 2014
Query planning framework
• Extensible
• Usable standalone (JDBC) or embedded
Adoption
Lingual – SQL interface to Cascading
Apache Drill
Apache Hive
Adapters: Splunk, Spark, MongoDB, JDBC, CSV, Web tables, In-memory
data
Page8 © Hortonworks Inc. 2014
Conventional DB architecture
Page9 © Hortonworks Inc. 2014
Optiq architecture
Page10 © Hortonworks Inc. 2014
Optiq – APIs and SPIs
Cost, statistics
RelOptCost
RelOptCostFactory
RelMetadataProvider
• RelMdColumnUniquensss
• RelMdDistinctRowCount
• RelMdSelectivity
SQL parser
SqlNode
SqlParser
SqlValidator
Transformation rules
RelOptRule
• MergeFilterRule
• PushAggregateThroughUni
onRule
• RemoveCorrelationForScal
arProjectRule
• 100+ more
Unification (materialized view)
Column trimming
Relational algebra
RelNode (operator)
• TableScan
• Filter
• Project
• Union
• Aggregate
• …
RelDataType (type)
RexNode (expression)
RelTrait (physical property)
• RelConvention (calling-convention)
• RelCollation (sortedness)
• TBD (bucketedness/distribution) JDBC driver
Metadata
Schema
Table
Function
• TableFunction
• TableMacro
Page11 © Hortonworks Inc. 2014
Now… back to Hive
Page12 © Hortonworks Inc. 2014
CBO in Hive
Why cost-based optimization?
Ease of Use – Join Reordering
View Chaining
Ad hoc queries involving multiple views
Enables BI Tools as front ends to Hive
First version
Modest goal
Concrete results
Join re-ordering
Page 12
Page13 © Hortonworks Inc. 2014
Query preparation – Hive 0.13
SQL
parser
Semantic
analyzer
Logical
Optimizer
Physical
Optimizer
Abstract
Syntax
Tree (AST)
Hive SQL
Annotated
AST
Plan
Tez
Tuned
Plan
Page14 © Hortonworks Inc. 2014
Query preparation – full CBO
SQL
parser
Semantic
analyzer
Translate
to algebra
Physical
Optimizer
Abstract
Syntax
Tree (AST)
Hive SQL
Tez
Tuned
Plan
Optiq
optimizer
RelNode
Annotated
AST
Page15 © Hortonworks Inc. 2014
Query preparation – initial CBO
SQL
parser
Semantic
analyzer
Logical
Optimizer
Physical
Optimizer
Hive SQL
AST with optimized
join-ordering
Tez
Tuned
Plan
Translate
to algebra
Optiq
optimizer
Page16 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Query Execution – The basics
Page 16
SELECT R1.x
FROM R1
JOIN R2 ON R1.x = R2.x
JOIN R3 on R1.x = R3.x AND R2.x = R3.x
WHERE R1.z > 10;
p
s


R1 R2
R3
TS [R1]
TS [R2]
RS
RS
Shuffle
Join
TS [R3]
Map
Join
Filter FS
Page17 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Query Optimization – Rule Based vs. Cost Based
Page 17
p
s


R1 R2
R3
p
s


R1
R2
R3
p
s


R1
R3
R2
p
s


R2
R3
R1
Page18 © Hortonworks Inc. 2014
Introduction of CBO into Hive Planning
cbo
enabled?
No
Generate Plan w/o
multi-way joins
Can
cbo handle
plan?
No
- Predicate Pushdown
- Part. Pruning
- Column Pruning
- Stats Annotation
Pre CBO Optimizer
Col stats
available?
No
Optiq-based
Planner
Hive Plan
Revised AST
Regular Planning route on
new AST with CBO
turned off.
Fallback to Regular
planning: as though cbo
is disabled.
- < 10 total Join
Ops
- No Outer Joins
- No Windowing,
Lateral Views,
Script Op.
Series of gating
factors to get a CBO
Plan.
Page19 © Hortonworks Inc. 2014
Optiq Planner Process
Hive
Plan
Planner
RelNode
GraphRelNode Converter
RexNode Converter
Hive Op  RelNode
Hive Expr  RexNode
• Node for each node in
Input Plan
• Each node is a Set of
alternate Sub Plans
• Set further divided into
Subsets: based on
traits like sortedness
1. Plan Graph
• Rule: specifies a Operator
sub-graph to match and
logic to generate equivalent
‘better’ sub-graph.
• We only have Join
Reordering Rules.
2. Rules
• RelNodes have Cost (&
Cumulative Cost)
• We only use Cardinality
for Cost.
3. Cost Model
- Used to Plugin Schema,
Cost Formulas:
Selectivity, NDV
calculations etc.
- We only added
Selectivity and NDV
formulas; Schema is
only available at the
Node level
4. Metadata Providers
Rule Match Queue
- Add Rule matches to Queue
- Apply Rule match
transformations to Plan Graph
- Iterate for fixed iterations or
until Cost doesn’t change.
- Match importance based on
Cost of RelNode and height.
Best
RelNode
Graph
AST Converter
Revised
AST
Logical Plan
Physical traits:
Table Part./Buckets;
RedSink Ops
removed
Page20 © Hortonworks Inc. 2014
Join Reordering Rules
a b
=
b a
1. Swap Join Rule
a b
=
2. Push Join Through Join Rule
c
a c
b
c b
a=
but is really:
Optiq
schema is
position
based
b
a c
3. So
a b
c
d
≠
a c
d
b
4. Pull Up Project above Join
b
a c
d
a c
b
d
=
Added bonus
Join permutations
across sub-query
blocks
5. Merge Projects
Page21 © Hortonworks Inc. 2014
Summary
Join re-ordering
Join cardinality is used for cost
All other operators are assumed to have tiny cost
Cardinality of filter, join, group-by is based on selectivity
Selectivity is computed based on number-of-distinct-values (NDV)
Table Stats and Column stats are required
Current limitations
Only supports: filter, inner join, group-by, project, order-by, limit
Not all UDFs
Does not attempt all join permutations (e.g. bushy trees; 10-way joins or more)
May not work well for Bucket, SMB & Skew Joins
Page 21
Page22 © Hortonworks Inc. 2014
TPC-DS Query 50
Joins Store Sales, and Store Returns fact tables.
Each of the fact tables are independently restricted by date.
Analysis at Store grain, so this dimension also joined in.
As specified Query starts by joining the 2 Fact tables.
select
s_store_name , .. other store details
,sum(case when (sr_returned_date_sk - ss_sold_date_sk <= 30 ) then 1 else 0 end) as `30 days`, …
from
store_sales ss,store_returns sr,store s ,date_dim d1 ,date_dim d2
where
d2.d_year = 2000 and d2.d_moy = 9
and ss.ss_ticket_number = sr.sr_ticket_number and ss.ss_item_sk = sr.sr_item_sk
and ss.ss_sold_date_sk = d1.d_date_sk and sr.sr_returned_date_sk = d2.d_date_sk
and ss.ss_customer_sk = sr.sr_customer_sk and ss.ss_store_sk = s.s_store_sk
group by store details
order by store details limit 100;
Join Graph
Page23 © Hortonworks Inc. 2014
TPC-DS Query 50
Specified
Join Tree
Non CBO Plan
CBO Plan
Page24 © Hortonworks Inc. 2014
TPC-DS Query 50
Run 1 Run 2
Non CBO 53.1 53.4
CBO 22.5 21.9
 1 year test
 > 10 mins for Non CBO
 CBO time was about the same
 Fact tables
 partitioned by Day,
 bucketed by Item
 Bucketing off
 Bucketing should help CBO plan.
 SR table much smaller. Better chance of Bucket Join in place of Shuffle
Join.
Join Ordering Cost Estimate
['d2', [[['store_sales', 'd1'], 'store_returns'], 'store']] 515074768.659
['d1', [[['store_sales', 'store'], 'store_returns'], 'd2']] 448155.355
…
['store_returns', 'd2'] 9938.93
['store_sales', 'store_returns'] 156727295.634
['d1', 'store_sales'] 123675664.449
Facts restricted to 3 months
Orderings considered by Planner
Page25 © Hortonworks Inc. 2014
TPC-DS Query 17
Joins Store Sales, Store Returns and Catalog
Sales fact tables.
Each of the fact tables are independently
restricted by time.
Analysis at Item and Store grain, so these
dimensions are also joined in.
As specified Query starts by joining the 3 Fact
tables.
select i_item_id
,i_item_desc
,s_state
,count(ss_quantity) as store_sales_quantitycount
,….
from store_sales ss ,store_returns sr, catalog_sales cs,
date_dim d1, date_dim d2, date_dim d3, store s, item I
where d1.d_quarter_name = '2000Q1’
and d1.d_date_sk = ss.ss_sold_date_sk
and i.i_item_sk = ss.ss_item_sk and …
group by i_item_id ,i_item_desc, ,s_state
order by i_item_id ,i_item_desc, s_state
limit 100;
Page26 © Hortonworks Inc. 2014
TPC-DS Query 17
Specified
Join Tree
Non CBO Plan
CBO Plan
Page27 © Hortonworks Inc. 2014
TPC-DS Query 17
Run 1 Run 2
Non CBO 100.71 127.53
CBO 50.9 44.52
 1 year test
 > 10 mins for Non CBO
 CBO time was about the same
 Fact tables
 partitioned by Day,
 bucketed by Item
 Bucketing off
 Bucketing should help CBO plan.
 SR table much smaller. Better chance of Bucket Join in place of Shuffle
Join.
Join Ordering Cost Estimate
['item', [[[[[['d2', 'store_returns'], 'store_sales'], 'catalog_sales'], 'd1'], 'd3'], 'store']] 3547898.061
…
['store_returns', 'd2’] 19224.71
['store_sales', 'store_returns’] 23057497.991
['d1', 'store_sales'] 26142.943
Facts restricted to 3 months
Orderings considered by Planner
Page28 © Hortonworks Inc. 2014
Next?
Outer joins
Scale to larger numbers of joins
Support all expressions (UDFs)
Join algorithm selection
Sortedness & distribution as a trait
Trait propagation
Better cost model
More statistics
Move all pre-planning and logical planning to Optiq
Use Optiq costs/statistics to help physical planning
Constant reduction & tree pruning
Rewrite query to use materialized view
Page29 © Hortonworks Inc. 2014
Thank you!
@julianhyde
http://hive.apache.org/
http://incubator.apache.org/projects/optiq.html

More Related Content

What's hot

簡単!AWRをEXCELピボットグラフで分析しよう♪
簡単!AWRをEXCELピボットグラフで分析しよう♪簡単!AWRをEXCELピボットグラフで分析しよう♪
簡単!AWRをEXCELピボットグラフで分析しよう♪Yohei Azekatsu
 
SQL Server 2016 - Always On.pptx
SQL Server 2016 - Always On.pptxSQL Server 2016 - Always On.pptx
SQL Server 2016 - Always On.pptxQuyVo27
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache RangerDataWorks Summit
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeKent Graziano
 
Oracle db performance tuning
Oracle db performance tuningOracle db performance tuning
Oracle db performance tuningSimon Huang
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Exadata
ExadataExadata
Exadatatalek
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd
 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxData
 
Oracle statistics by example
Oracle statistics by exampleOracle statistics by example
Oracle statistics by exampleMauro Pagano
 
EDB Postgres DBA Best Practices
EDB Postgres DBA Best PracticesEDB Postgres DBA Best Practices
EDB Postgres DBA Best PracticesEDB
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to GreenplumDave Cramer
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance ImprovementBiju Nair
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data EngineeringHarald Erb
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 

What's hot (20)

What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
簡単!AWRをEXCELピボットグラフで分析しよう♪
簡単!AWRをEXCELピボットグラフで分析しよう♪簡単!AWRをEXCELピボットグラフで分析しよう♪
簡単!AWRをEXCELピボットグラフで分析しよう♪
 
SQL Server 2016 - Always On.pptx
SQL Server 2016 - Always On.pptxSQL Server 2016 - Always On.pptx
SQL Server 2016 - Always On.pptx
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on Snowflake
 
Oracle db performance tuning
Oracle db performance tuningOracle db performance tuning
Oracle db performance tuning
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Exadata
ExadataExadata
Exadata
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
 
Oracle statistics by example
Oracle statistics by exampleOracle statistics by example
Oracle statistics by example
 
Informatica Cloud Overview
Informatica Cloud OverviewInformatica Cloud Overview
Informatica Cloud Overview
 
EDB Postgres DBA Best Practices
EDB Postgres DBA Best PracticesEDB Postgres DBA Best Practices
EDB Postgres DBA Best Practices
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to Greenplum
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data Engineering
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 

Similar to Cost-based Query Optimization in Hive

Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memoryJulian Hyde
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
 
Presentation v mware roi tco calculator
Presentation   v mware roi tco calculatorPresentation   v mware roi tco calculator
Presentation v mware roi tco calculatorsolarisyourep
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in sparkChester Chen
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with SupersetDataWorks Summit
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllMichael Mior
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeDataWorks Summit
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?DataWorks Summit
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereSAP Technology
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performanceDataWorks Summit
 
Powerpivot web wordpress present
Powerpivot web wordpress presentPowerpivot web wordpress present
Powerpivot web wordpress presentMariAnne Woehrle
 

Similar to Cost-based Query Optimization in Hive (20)

Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Presentation v mware roi tco calculator
Presentation   v mware roi tco calculatorPresentation   v mware roi tco calculator
Presentation v mware roi tco calculator
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application code
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL Anywhere
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Powerpivot web wordpress present
Powerpivot web wordpress presentPowerpivot web wordpress present
Powerpivot web wordpress present
 
Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Recently uploaded (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Cost-based Query Optimization in Hive

  • 1. Page1 © Hortonworks Inc. 2014 Cost-based query optimization in Apache Hive Julian Hyde Julian Hyde June 4th, 2014
  • 2. Page2 © Hortonworks Inc. 2014 About me Julian Hyde Architect at Hortonworks Open source: • Founder & lead, Apache Optiq (query optimization framework) • Founder & lead, Pentaho Mondrian (analysis engine) • Committer, Apache Drill • Contributor, Apache Hive • Contributor, Cascading Lingual (SQL interface to Cascading) Past: • SQLstream (streaming SQL) • Broadbase (data warehouse) • Oracle (SQL kernel development)
  • 3. Page3 © Hortonworks Inc. 2014 (Thanks to John Pullokkaran, Harish Butani for presentation content and actually doing the work.)
  • 4. Page4 © Hortonworks Inc. 2014 Apache Hive The original “SQL on Hadoop” Undergoing extensive renovation • Tez execution engine • YARN execution environment • Vectorized data representation • Column-oriented data storage (ORC) • ACID transactions • SQL standards compliance • SQL authorization model • Cost-based query optimization (CBO) What? Why? How? When? “Stinger Initiative”
  • 5. Page5 © Hortonworks Inc. 2014 Incremental cutover to cost-based optimization Release Date Remarks Apache Hive 0.12 October 2013 • Rule-based Optimizations • No join reordering • Main optimizations: predicate push- down & partition pruning • Semantic info and operator tree tightly coupled Apache Hive 0.13 April 2014 “Old-style” JOIN & push-down conditions: … FROM t1, t2 WHERE … CBO just missed the deadline  HDP 2.1 April 2014 Cost-based ordering of joins • HIVE-6439 “Introduce CBO step in Semantic Analyzer” • HIVE-5775 “Introduce Cost Based Optimizer in Hive” Apache Hive 0.14 ? CBO patches More rework of internals More cost-based features…
  • 6. Page6 © Hortonworks Inc. 2014 Apache Optiq (incubating)
  • 7. Page7 © Hortonworks Inc. 2014 Apache Optiq Apache incubator project since May, 2014 Query planning framework • Extensible • Usable standalone (JDBC) or embedded Adoption Lingual – SQL interface to Cascading Apache Drill Apache Hive Adapters: Splunk, Spark, MongoDB, JDBC, CSV, Web tables, In-memory data
  • 8. Page8 © Hortonworks Inc. 2014 Conventional DB architecture
  • 9. Page9 © Hortonworks Inc. 2014 Optiq architecture
  • 10. Page10 © Hortonworks Inc. 2014 Optiq – APIs and SPIs Cost, statistics RelOptCost RelOptCostFactory RelMetadataProvider • RelMdColumnUniquensss • RelMdDistinctRowCount • RelMdSelectivity SQL parser SqlNode SqlParser SqlValidator Transformation rules RelOptRule • MergeFilterRule • PushAggregateThroughUni onRule • RemoveCorrelationForScal arProjectRule • 100+ more Unification (materialized view) Column trimming Relational algebra RelNode (operator) • TableScan • Filter • Project • Union • Aggregate • … RelDataType (type) RexNode (expression) RelTrait (physical property) • RelConvention (calling-convention) • RelCollation (sortedness) • TBD (bucketedness/distribution) JDBC driver Metadata Schema Table Function • TableFunction • TableMacro
  • 11. Page11 © Hortonworks Inc. 2014 Now… back to Hive
  • 12. Page12 © Hortonworks Inc. 2014 CBO in Hive Why cost-based optimization? Ease of Use – Join Reordering View Chaining Ad hoc queries involving multiple views Enables BI Tools as front ends to Hive First version Modest goal Concrete results Join re-ordering Page 12
  • 13. Page13 © Hortonworks Inc. 2014 Query preparation – Hive 0.13 SQL parser Semantic analyzer Logical Optimizer Physical Optimizer Abstract Syntax Tree (AST) Hive SQL Annotated AST Plan Tez Tuned Plan
  • 14. Page14 © Hortonworks Inc. 2014 Query preparation – full CBO SQL parser Semantic analyzer Translate to algebra Physical Optimizer Abstract Syntax Tree (AST) Hive SQL Tez Tuned Plan Optiq optimizer RelNode Annotated AST
  • 15. Page15 © Hortonworks Inc. 2014 Query preparation – initial CBO SQL parser Semantic analyzer Logical Optimizer Physical Optimizer Hive SQL AST with optimized join-ordering Tez Tuned Plan Translate to algebra Optiq optimizer
  • 16. Page16 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Query Execution – The basics Page 16 SELECT R1.x FROM R1 JOIN R2 ON R1.x = R2.x JOIN R3 on R1.x = R3.x AND R2.x = R3.x WHERE R1.z > 10; p s   R1 R2 R3 TS [R1] TS [R2] RS RS Shuffle Join TS [R3] Map Join Filter FS
  • 17. Page17 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Query Optimization – Rule Based vs. Cost Based Page 17 p s   R1 R2 R3 p s   R1 R2 R3 p s   R1 R3 R2 p s   R2 R3 R1
  • 18. Page18 © Hortonworks Inc. 2014 Introduction of CBO into Hive Planning cbo enabled? No Generate Plan w/o multi-way joins Can cbo handle plan? No - Predicate Pushdown - Part. Pruning - Column Pruning - Stats Annotation Pre CBO Optimizer Col stats available? No Optiq-based Planner Hive Plan Revised AST Regular Planning route on new AST with CBO turned off. Fallback to Regular planning: as though cbo is disabled. - < 10 total Join Ops - No Outer Joins - No Windowing, Lateral Views, Script Op. Series of gating factors to get a CBO Plan.
  • 19. Page19 © Hortonworks Inc. 2014 Optiq Planner Process Hive Plan Planner RelNode GraphRelNode Converter RexNode Converter Hive Op  RelNode Hive Expr  RexNode • Node for each node in Input Plan • Each node is a Set of alternate Sub Plans • Set further divided into Subsets: based on traits like sortedness 1. Plan Graph • Rule: specifies a Operator sub-graph to match and logic to generate equivalent ‘better’ sub-graph. • We only have Join Reordering Rules. 2. Rules • RelNodes have Cost (& Cumulative Cost) • We only use Cardinality for Cost. 3. Cost Model - Used to Plugin Schema, Cost Formulas: Selectivity, NDV calculations etc. - We only added Selectivity and NDV formulas; Schema is only available at the Node level 4. Metadata Providers Rule Match Queue - Add Rule matches to Queue - Apply Rule match transformations to Plan Graph - Iterate for fixed iterations or until Cost doesn’t change. - Match importance based on Cost of RelNode and height. Best RelNode Graph AST Converter Revised AST Logical Plan Physical traits: Table Part./Buckets; RedSink Ops removed
  • 20. Page20 © Hortonworks Inc. 2014 Join Reordering Rules a b = b a 1. Swap Join Rule a b = 2. Push Join Through Join Rule c a c b c b a= but is really: Optiq schema is position based b a c 3. So a b c d ≠ a c d b 4. Pull Up Project above Join b a c d a c b d = Added bonus Join permutations across sub-query blocks 5. Merge Projects
  • 21. Page21 © Hortonworks Inc. 2014 Summary Join re-ordering Join cardinality is used for cost All other operators are assumed to have tiny cost Cardinality of filter, join, group-by is based on selectivity Selectivity is computed based on number-of-distinct-values (NDV) Table Stats and Column stats are required Current limitations Only supports: filter, inner join, group-by, project, order-by, limit Not all UDFs Does not attempt all join permutations (e.g. bushy trees; 10-way joins or more) May not work well for Bucket, SMB & Skew Joins Page 21
  • 22. Page22 © Hortonworks Inc. 2014 TPC-DS Query 50 Joins Store Sales, and Store Returns fact tables. Each of the fact tables are independently restricted by date. Analysis at Store grain, so this dimension also joined in. As specified Query starts by joining the 2 Fact tables. select s_store_name , .. other store details ,sum(case when (sr_returned_date_sk - ss_sold_date_sk <= 30 ) then 1 else 0 end) as `30 days`, … from store_sales ss,store_returns sr,store s ,date_dim d1 ,date_dim d2 where d2.d_year = 2000 and d2.d_moy = 9 and ss.ss_ticket_number = sr.sr_ticket_number and ss.ss_item_sk = sr.sr_item_sk and ss.ss_sold_date_sk = d1.d_date_sk and sr.sr_returned_date_sk = d2.d_date_sk and ss.ss_customer_sk = sr.sr_customer_sk and ss.ss_store_sk = s.s_store_sk group by store details order by store details limit 100; Join Graph
  • 23. Page23 © Hortonworks Inc. 2014 TPC-DS Query 50 Specified Join Tree Non CBO Plan CBO Plan
  • 24. Page24 © Hortonworks Inc. 2014 TPC-DS Query 50 Run 1 Run 2 Non CBO 53.1 53.4 CBO 22.5 21.9  1 year test  > 10 mins for Non CBO  CBO time was about the same  Fact tables  partitioned by Day,  bucketed by Item  Bucketing off  Bucketing should help CBO plan.  SR table much smaller. Better chance of Bucket Join in place of Shuffle Join. Join Ordering Cost Estimate ['d2', [[['store_sales', 'd1'], 'store_returns'], 'store']] 515074768.659 ['d1', [[['store_sales', 'store'], 'store_returns'], 'd2']] 448155.355 … ['store_returns', 'd2'] 9938.93 ['store_sales', 'store_returns'] 156727295.634 ['d1', 'store_sales'] 123675664.449 Facts restricted to 3 months Orderings considered by Planner
  • 25. Page25 © Hortonworks Inc. 2014 TPC-DS Query 17 Joins Store Sales, Store Returns and Catalog Sales fact tables. Each of the fact tables are independently restricted by time. Analysis at Item and Store grain, so these dimensions are also joined in. As specified Query starts by joining the 3 Fact tables. select i_item_id ,i_item_desc ,s_state ,count(ss_quantity) as store_sales_quantitycount ,…. from store_sales ss ,store_returns sr, catalog_sales cs, date_dim d1, date_dim d2, date_dim d3, store s, item I where d1.d_quarter_name = '2000Q1’ and d1.d_date_sk = ss.ss_sold_date_sk and i.i_item_sk = ss.ss_item_sk and … group by i_item_id ,i_item_desc, ,s_state order by i_item_id ,i_item_desc, s_state limit 100;
  • 26. Page26 © Hortonworks Inc. 2014 TPC-DS Query 17 Specified Join Tree Non CBO Plan CBO Plan
  • 27. Page27 © Hortonworks Inc. 2014 TPC-DS Query 17 Run 1 Run 2 Non CBO 100.71 127.53 CBO 50.9 44.52  1 year test  > 10 mins for Non CBO  CBO time was about the same  Fact tables  partitioned by Day,  bucketed by Item  Bucketing off  Bucketing should help CBO plan.  SR table much smaller. Better chance of Bucket Join in place of Shuffle Join. Join Ordering Cost Estimate ['item', [[[[[['d2', 'store_returns'], 'store_sales'], 'catalog_sales'], 'd1'], 'd3'], 'store']] 3547898.061 … ['store_returns', 'd2’] 19224.71 ['store_sales', 'store_returns’] 23057497.991 ['d1', 'store_sales'] 26142.943 Facts restricted to 3 months Orderings considered by Planner
  • 28. Page28 © Hortonworks Inc. 2014 Next? Outer joins Scale to larger numbers of joins Support all expressions (UDFs) Join algorithm selection Sortedness & distribution as a trait Trait propagation Better cost model More statistics Move all pre-planning and logical planning to Optiq Use Optiq costs/statistics to help physical planning Constant reduction & tree pruning Rewrite query to use materialized view
  • 29. Page29 © Hortonworks Inc. 2014 Thank you! @julianhyde http://hive.apache.org/ http://incubator.apache.org/projects/optiq.html

Editor's Notes

  1. Hive CBO didn’t quite make it into Apache Hive 0.13. This talk: What is CBO? Why are we putting it in Hive? How did we do it? When is it released? And what next?
  2. 0. Converters convert a Hive Op. Graph to an Optiq representation. In Optiq we have RelNodes and RexNodes in place of Operators and ExprNodes. The conversion creates a ‘Logical’ plan. RedSinks are dropped; Physical traits like Partitioning/Bucketness is lost. The Plan Graph is the central data structure of the Planner. There is a Node for each Node in the input Plan. A Node represents a Set of equivalent Sub Graphs(Plans). Each Set is further divided into Subsets based on traits: traits capture physical attributes like sortedness/bucketness Rules comprise of a Match Graph Template and an onMatch action. Action generates a ‘better’ equivalent Plan. So Rule match actions populates Plan Graph Sets. Metadata Providers provide all Metadata information to the Planner: Schema, but also Cost Formulas like Selectivity and NDV calculations. RelNodes have Cost. The Cost model encapsulates Cost calculations. Rule Match Queue is a Queue of Rule Matches. Planner runs until the Queue is empty for a fixed number of iterations. The application of a RuleMatch adds to the Plan Graph and also adds new Rule Matches to the Queue. RuleMatches are ordered based on importance: which is based on RelNode cost and distance of Node in Plan from Root.