SlideShare a Scribd company logo
1 of 31
© Hortonworks Inc. 2011
Hive Correlation Optimizer
Yin Huai
yhuai@hortonworks.com
huai@cse.ohio-state.edu
Page 1
Hadoop Summit 2013 Hive User Group Meetup
© Hortonworks Inc. 2011
About me
•Hive contributor
•Summer intern at Hortonworks
•4th year Ph.D. student at The Ohio State
University
•Research interests: query optimizations, file
formats, distributed systems, and storage
systems
Page 2
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Outline
•Query planning in Hive
•Correlations in a query (Intra-query
correlations)
•Case studies
•Automatically exploiting correlations (HIVE-
2206: Correlation Optimizer)
Page 3
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Query planning
Page 4
Architecting the Future of Big Data
SELECT t1.c2, count(*)
FROM t1 JOIN t2 ON (t1.c1=t2.c1)
GROUP BY t1.c2
t1 t2
JOIN
AGG
t1.c1=t2.c1
Calculate count(*) for
every group of t1.c2
© Hortonworks Inc. 2011
Query planning
Page 5
Architecting the Future of Big Data
SELECT t1.c2, count(*)
FROM t1 JOIN t2 ON (t1.c1=t2.c1)
GROUP BY t1.c2
t1 t2
JOIN
AGG Evaluate this query in
distributed systems
t1 t2
JOIN
AGG
Shuffle
Shuffle
c1
c2
How to shuffle?
Use the key column(s)
© Hortonworks Inc. 2011
Generating MapReduce jobs
Page 6
Architecting the Future of Big Data
t1 t2
JOIN
AGG
Shuffle
Shuffle c2
c1
t1 t2
JOIN
Shuffle
tmp
c1
tmp
AGG
Shuffle c2
1 MR job can shuffle
data once
Job 1
Job 2
© Hortonworks Inc. 2011
Generating MapReduce jobs
Page 7
Architecting the Future of Big Data
t1 t2
JOIN
Shuffle
tmp
c1
tmp
AGG
Shuffle c2
MapReuce will shuffle
data for us, we just
need to emit outputs
from the Map phase
We use ReduceSinkOperator
(RS) to emit Map outputs.
RSs are the end of a Map phase.
t1 t2
JOIN
tmp
tmp
AGG
RS1 RS2
RS2
Job 1
Map
Job 1
Reduce
Job 2
Map
Job 2
Reduce
© Hortonworks Inc. 2011
Outline
•Query planning in Hive
•Correlations in a query (Intra-query
correlations)
•Case studies
•Automatically exploiting correlations (HIVE-
2206: Correlation Optimizer)
Page 8
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Intra-query correlations
Page 9
Architecting the Future of Big Data
SELECT x.c1, count(*)
FROM t1 x JOIN t1 y ON (x.c1=y.c1)
GROUP BY x.c1
t1 as x t1 as y
JOIN
AGG
x.c1=y.c1
Calculate count(*) for
every group of x.c1
Correlations:
1. Same input tables
2. JOIN and AGG using the
same key
© Hortonworks Inc. 2011
Intra-query correlations
Page 10
Architecting the Future of Big Data
x.c1=y.c1
Calculate count(*)
for every group of
z.c1
t1 as x t2 as y
JOIN1
JOIN2
AGG1
t1 as z
p.c1=q.c1
SELECT p.c1, q.c2, q.cnt
FROM (SELECT x.c1 AS c1 FROM t1 x JOIN t2 y ON (x.c1=y.c1)) p
JOIN (SELECT z.c1 AS c1, count(*) AS cnt FROM t1 z GROUP BY z.c1) q
ON (p.c1=q.c1)
Correlations:
1. Same input tables (t1)
2. JOIN1 and AGG1 using the
same key
3. JOIN2 and all of its parents
using the same key
© Hortonworks Inc. 2011
Intra-query correlations
• Defined in “YSmart: Yet Another SQL-to-MapReduce Translator”
– http://ysmart.cse.ohio-state.edu/
– http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
• Targeting on operators which need to shuffle the data and inputs
• Three kinds of correlations
– Input correlation (IC): independent operators share the same input tables
– Transit correlation (TC): independent operators have input correlation and
also shuffle the data in the same way (e.g. using the same keys)
– Job flow correlation (JFC): two dependent operators shuffle the data in
the same way
Page 11
Architecting the Future of Big Data
t1 as x t2 as y
JOIN1 AGG1
t1 as z
IC
t1 as x t2 as y
JOIN1 AGG1
t1 as z
x.c1=y.c1 group by z.c1
TC
JOIN
AGG
x.c1=y.c1
group by z.c1
JFC
© Hortonworks Inc. 2011
Correlation-unaware query planning
Page 12
Architecting the Future of Big Data
t1 t1
JOIN
AGG
Shuffle
Shuffle c1
c1
Hive does not care:
1. If a table has been
used multiple
times
2. If data really needs
to be shuffled
t1 t1
JOIN
Shuffle
tmp
c1
Job 1
tmp
AGG
Shuffle c1 Job 2
Drawbacks:
1. Unnecessary data
loading
2. Unnecessary data
shuffling
3. Unnecessary data
materialization
© Hortonworks Inc. 2011
Outline
•Query planning in Hive
•Correlations in a query (Intra-query
correlations)
•Case studies
•Automatically exploiting correlations (HIVE-
2206: Correlation Optimizer)
Page 13
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Case studies: TPC-H Q17 (Flattened)
SELECT
sum(l_extendedprice) / 7.0 as avg_yearly
FROM
(SELECT l_partkey, l_quantity, l_extendedprice
FROM lineitem JOIN part ON (p_partkey=l_partkey)
WHERE p_brand='Brand#35’ AND
p_container = 'MED PKG’) touter
JOIN
(SELECT l_partkey as lp, 0.2 * avg(l_quantity) as lq
FROM lineitem
GROUP BY l_partkey) tinner
ON (touter.l_partkey = tinnter.lp)
WHERE touter.l_quantity < tinner.lq
Page 14
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Case studies: TPC-H Q17 (Flattened)
Page 15
Architecting the Future of Big Data
lineitem part
JOIN1
JOIN2
AGG1
lineitem
AGG2
lineitem is used by JOIN1 and AGG1
JOIN1, AGG1, and JOIN2 share the same key
© Hortonworks Inc. 2011
Case studies: TPC-H Q17 (Flattened)
Page 16
Architecting the Future of Big Data
lineitem part
JOIN1
JOIN2
AGG1
lineitem
AGG2
Job 1 Job 2
Job 3
Job 4
Without Correlation Optimizer
© Hortonworks Inc. 2011
Case studies: TPC-H Q17 (Flattened)
Page 17
Architecting the Future of Big Data
lineitem part
JOIN1
JOIN2
AGG1
lineitem
AGG2
part
JOIN1
JOIN2
AGG1
lineitem
AGG2
Job 1 Job 2
Job 3
Job 4 Job 2
Job 1
Without Correlation Optimizer With Correlation Optimizer
© Hortonworks Inc. 2011
Case studies: TPC-DS Q95 (Flattened)
SELECT count(distinct ws1.ws_order_number) as order_count,
sum(ws1.ws_ext_ship_cost) as total_shipping_cost,
sum(ws1.ws_net_profit) as total_net_profit
FROM web_sales ws1
JOIN customer_address ca ON (ws1.ws_ship_addr_sk = ca.ca_address_sk)
JOIN web_site s ON (ws1.ws_web_site_sk = s.web_site_sk)
JOIN date_dim d ON (ws1.ws_ship_date_sk = d.d_date_sk)
LEFT SEMI JOIN (SELECT ws2.ws_order_number as ws_order_number
FROM web_sales ws2 JOIN web_sales ws3
ON(ws2.ws_order_number = ws3.ws_order_number)
WHERE ws2.ws_warehouse_sk <> ws3.ws_warehouse_sk) ws_wh1
ON (ws1.ws_order_number = ws_wh1.ws_order_number)
LEFT SEMI JOIN (SELECT wr_order_number
FROM web_returns wr
JOIN (SELECT ws4.ws_order_number as ws_order_number
FROM web_sales ws4 JOIN web_sales ws5
ON (ws4.ws_order_number = ws5.ws_order_number)
WHERE ws4.ws_warehouse_sk <> ws5.ws_warehouse_sk) ws_wh2
ON (wr.wr_order_number = ws_wh2.ws_order_number)) tmp1
ON (ws1.ws_order_number = tmp1.wr_order_number)
WHERE d.d_date >= '2001-05-01' AND
d.d_date <= '2001-06-30’ AND
ca.ca_state = 'NC’ AND
s.web_company_name = 'pri'
Page 18
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Case studies: TPC-DS Q95 (Flattened)
Page 19
Architecting the Future of Big Data
web_sales
AGG
customer_address web_site
Map
Join
Semi
Join
web_sales web_sales
JOIN1
web_sales web_sales
JOIN1
web_returns
JOIN2
date_dim
© Hortonworks Inc. 2011
Case studies: TPC-DS Q95 (Flattened)
Page 20
Architecting the Future of Big Data
web_sales
AGG
customer_address web_site
Map
Join
Semi
Join
web_sales web_sales
JOIN1
web_sales web_sales
JOIN1
web_returns
JOIN2
Without Correlation Optimizer
• 6 MapReduce jobs
• Unnecessary data loading
(black web_sales nodes)
• Unnecessary data shuffling
Job 6
Job 2
Job 3
Job 4
Job 5
Job 1
date_dim
© Hortonworks Inc. 2011
Case studies: TPC-DS Q95 (Flattened)
Page 21
Architecting the Future of Big Data
web_sales
AGG
customer_address web_site
Map
Join
Semi
Join
web_sales
JOIN1
JOIN1
web_returns
JOIN2
With Correlation Optimizer
• Black web_sales nodes share
the same data loading
date_dim
© Hortonworks Inc. 2011
Case studies: TPC-DS Q95 (Flattened)
Page 22
Architecting the Future of Big Data
web_sales
AGG
customer_address web_site
Map
Join
Semi
Join
web_sales
JOIN1
JOIN1
web_returns
JOIN2
With Correlation Optimizer
• Black web_sales nodes share
the same data loading
• 3 MapReduce jobs
Job 1
Job 2
Job 3
date_dim
© Hortonworks Inc. 2011
Case studies: TPC-DS Q95 (Flattened)
Page 23
Architecting the Future of Big Data
web_sales
AGG
customer_address web_site
Map
Join
Semi
Join
web_sales
JOIN1
web_returns
JOIN2
Follow-up work
• Evaluate JOIN1 only once
without materializing a
temporary table
date_dim
© Hortonworks Inc. 2011
Case studies: TPC-DS Q95 (Flattened)
Page 24
Architecting the Future of Big Data
web_sales
AGG
customer_address web_site
Map
Join
Semi
Join
web_sales
JOIN1
web_returns
JOIN2
Follow-up work
• Evaluate JOIN1 only once
without materializing a
temporary table
• Only use 2 MapReduce jobs
Job 1
Job 2
date_dim
© Hortonworks Inc. 2011
Outline
•Query planning in Hive
•Correlations in a query (Intra-query
correlations)
•Case studies
•Automatically exploiting correlations (HIVE-
2206: Correlation Optimizer)
Page 25
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Objectives
• Eliminate unnecessary data loading
– Query planner will be aware what data will be loaded
– Do as many things as possible for loaded data
• Eliminate unnecessary data shuffling
– Query planner will be aware when data really needs to be shuffled
– Do as many things as possible before shuffling the data again
Page 26
Architecting the Future of Big Data
© Hortonworks Inc. 2011
ReduceSink Deduplication
• HIVE-2340
• Handle chained Job Flow Correlations
– e.g. Generating a single job for both Group By and Order By
• Cannot handle complex patterns
– e.g. Multiple Joins involved patterns
• Need a fundamental solution
• Need to exploit shared input tables
Page 27
Architecting the Future of Big Data
t1
RS1
AGG1
RS2
…
t1
RS1
AGG1
…
© Hortonworks Inc. 2011
Correlation Optimizer
• 2-phase optimizer
– Phase 1: Correlation Detection
– Phase 2: Query plan tree transformation
• This work is not just about the optimizer
– New operators to support the execution of an optimized plan
– A mechanism to coordinate the operator tree inside the Reduce phase
Page 28
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Correlation detection
Page 29
Architecting the Future of Big Data
SELECT p.c1, q.c2, q.cnt
FROM (SELECT x.c1 AS c1 FROM t1 x JOIN t2 y ON (x.c1=y.c1)) p
JOIN (SELECT z.c1 AS c1, count(*) AS cnt FROM t1 z GROUP BY z.c1) q
ON (p.c1=q.c1)
1. Traverse the tree all the way
down to find matching keys
in ReduceSinkOperators
2. Then, check input tables to
find shared data loading
opportunities
t1 as x t2 as y
JOIN1
JOIN2
AGG1
t1 as z
RS1 RS2 RS3
RS4 RS5
Key: p.c1 Key: q.c1
Key: x.c1 Key: y.c1 Key: z.c1
© Hortonworks Inc. 2011
Query plan tree transformation
Page 30
Architecting the Future of Big Data
SELECT p.c1, q.c2, q.cnt
FROM (SELECT x.c1 AS c1 FROM t1 x JOIN t2 y ON (x.c1=y.c1)) p
JOIN (SELECT z.c1 AS c1, count(*) AS cnt FROM t1 z GROUP BY z.c1) q
ON (p.c1=q.c1)
t1 as x t2 as y
JOIN1
JOIN2
AGG1
t1 as z
Key: p.c1
RS1 RS2 RS3
RS4 RS5
Key: q.c1
Key: x.c1 Key: y.c1 Key: z.c1
t1 as x, zt2 as y
JOIN1
JOIN2
AGG1
RS1RS2 RS3
© Hortonworks Inc. 2011
Thanks
Architecting the Future of Big Data
Page 31

More Related Content

What's hot

Bis 155 Education Organization / snaptutorial.com
Bis 155 Education Organization / snaptutorial.comBis 155 Education Organization / snaptutorial.com
Bis 155 Education Organization / snaptutorial.comBaileya121
 
Dagobahic2020orange
Dagobahic2020orangeDagobahic2020orange
Dagobahic2020orangeJixiongLIU
 
Bis 155 Enhance teaching / snaptutorial.com
Bis 155  Enhance teaching / snaptutorial.comBis 155  Enhance teaching / snaptutorial.com
Bis 155 Enhance teaching / snaptutorial.comHarrisGeorg46
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa
 
BIS 155 Education Organization -- snaptutorial.com
BIS 155   Education Organization -- snaptutorial.comBIS 155   Education Organization -- snaptutorial.com
BIS 155 Education Organization -- snaptutorial.comDavisMurphyB94
 
Bis 155 Exceptional Education / snaptutorial.com
Bis 155 Exceptional Education / snaptutorial.comBis 155 Exceptional Education / snaptutorial.com
Bis 155 Exceptional Education / snaptutorial.comDavis142
 
BIS 155 Exceptional Education - snaptutorial.com
BIS 155   Exceptional Education - snaptutorial.comBIS 155   Exceptional Education - snaptutorial.com
BIS 155 Exceptional Education - snaptutorial.comDavisMurphyB28
 
Spatial query tutorial for nyc subway income level along subway
Spatial query tutorial  for nyc subway income level along subwaySpatial query tutorial  for nyc subway income level along subway
Spatial query tutorial for nyc subway income level along subwayVivian S. Zhang
 
Olap Functions Suport in Informix
Olap Functions Suport in InformixOlap Functions Suport in Informix
Olap Functions Suport in InformixBingjie Miao
 

What's hot (9)

Bis 155 Education Organization / snaptutorial.com
Bis 155 Education Organization / snaptutorial.comBis 155 Education Organization / snaptutorial.com
Bis 155 Education Organization / snaptutorial.com
 
Dagobahic2020orange
Dagobahic2020orangeDagobahic2020orange
Dagobahic2020orange
 
Bis 155 Enhance teaching / snaptutorial.com
Bis 155  Enhance teaching / snaptutorial.comBis 155  Enhance teaching / snaptutorial.com
Bis 155 Enhance teaching / snaptutorial.com
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
BIS 155 Education Organization -- snaptutorial.com
BIS 155   Education Organization -- snaptutorial.comBIS 155   Education Organization -- snaptutorial.com
BIS 155 Education Organization -- snaptutorial.com
 
Bis 155 Exceptional Education / snaptutorial.com
Bis 155 Exceptional Education / snaptutorial.comBis 155 Exceptional Education / snaptutorial.com
Bis 155 Exceptional Education / snaptutorial.com
 
BIS 155 Exceptional Education - snaptutorial.com
BIS 155   Exceptional Education - snaptutorial.comBIS 155   Exceptional Education - snaptutorial.com
BIS 155 Exceptional Education - snaptutorial.com
 
Spatial query tutorial for nyc subway income level along subway
Spatial query tutorial  for nyc subway income level along subwaySpatial query tutorial  for nyc subway income level along subway
Spatial query tutorial for nyc subway income level along subway
 
Olap Functions Suport in Informix
Olap Functions Suport in InformixOlap Functions Suport in Informix
Olap Functions Suport in Informix
 

Viewers also liked

Join optimization in hive
Join optimization in hive Join optimization in hive
Join optimization in hive Liyin Tang
 
Hive contributors meetup apache sentry
Hive contributors meetup   apache sentryHive contributors meetup   apache sentry
Hive contributors meetup apache sentryBrock Noland
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object ModelZheng Shao
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinityShashwat Shriparv
 
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Cloudera, Inc.
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 

Viewers also liked (15)

Join optimization in hive
Join optimization in hive Join optimization in hive
Join optimization in hive
 
Hive contributors meetup apache sentry
Hive contributors meetup   apache sentryHive contributors meetup   apache sentry
Hive contributors meetup apache sentry
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object Model
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinity
 
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Hive ppt (1)
Hive ppt (1)Hive ppt (1)
Hive ppt (1)
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 

Similar to Hive Correlation Optimizer

Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd
 
Bring Cartography to the Cloud
Bring Cartography to the CloudBring Cartography to the Cloud
Bring Cartography to the CloudNick Dimiduk
 
Making pig fly optimizing data processing on hadoop presentation
Making pig fly  optimizing data processing on hadoop presentationMaking pig fly  optimizing data processing on hadoop presentation
Making pig fly optimizing data processing on hadoop presentationMd Rasool
 
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Databricks
 
Performance Tuning Oracle's BI Applications
Performance Tuning Oracle's BI ApplicationsPerformance Tuning Oracle's BI Applications
Performance Tuning Oracle's BI ApplicationsKPI Partners
 
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTJames Chittenden
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure DataTaro L. Saito
 
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022HostedbyConfluent
 
Release-3_TSD_Source_to_LZ_-_CIS_-_v1.2 2
Release-3_TSD_Source_to_LZ_-_CIS_-_v1.2 2Release-3_TSD_Source_to_LZ_-_CIS_-_v1.2 2
Release-3_TSD_Source_to_LZ_-_CIS_-_v1.2 2Saranya Mohan
 
SQL in the Hybrid World
SQL in the Hybrid WorldSQL in the Hybrid World
SQL in the Hybrid WorldTanel Poder
 
Powerpivot web wordpress present
Powerpivot web wordpress presentPowerpivot web wordpress present
Powerpivot web wordpress presentMariAnne Woehrle
 
Make streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQLMake streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQLDataWorks Summit
 
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...InfluxData
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer OverviewOlav Sandstå
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Future of Data Meetup
 
Basics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesBasics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesValmik Potbhare
 
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowHow to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowLucas Arruda
 
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...tdc-globalcode
 
Excel Secrets for Search Marketers
Excel Secrets for Search MarketersExcel Secrets for Search Marketers
Excel Secrets for Search MarketersChris Haleua
 

Similar to Hive Correlation Optimizer (20)

Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
 
Bring Cartography to the Cloud
Bring Cartography to the CloudBring Cartography to the Cloud
Bring Cartography to the Cloud
 
Making pig fly optimizing data processing on hadoop presentation
Making pig fly  optimizing data processing on hadoop presentationMaking pig fly  optimizing data processing on hadoop presentation
Making pig fly optimizing data processing on hadoop presentation
 
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
 
Performance Tuning Oracle's BI Applications
Performance Tuning Oracle's BI ApplicationsPerformance Tuning Oracle's BI Applications
Performance Tuning Oracle's BI Applications
 
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoT
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
 
Release-3_TSD_Source_to_LZ_-_CIS_-_v1.2 2
Release-3_TSD_Source_to_LZ_-_CIS_-_v1.2 2Release-3_TSD_Source_to_LZ_-_CIS_-_v1.2 2
Release-3_TSD_Source_to_LZ_-_CIS_-_v1.2 2
 
SQL in the Hybrid World
SQL in the Hybrid WorldSQL in the Hybrid World
SQL in the Hybrid World
 
Powerpivot web wordpress present
Powerpivot web wordpress presentPowerpivot web wordpress present
Powerpivot web wordpress present
 
Make streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQLMake streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQL
 
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer Overview
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
 
Basics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesBasics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration Techniques
 
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowHow to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
 
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
 
Excel Secrets for Search Marketers
Excel Secrets for Search MarketersExcel Secrets for Search Marketers
Excel Secrets for Search Marketers
 

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Hive Correlation Optimizer

  • 1. © Hortonworks Inc. 2011 Hive Correlation Optimizer Yin Huai yhuai@hortonworks.com huai@cse.ohio-state.edu Page 1 Hadoop Summit 2013 Hive User Group Meetup
  • 2. © Hortonworks Inc. 2011 About me •Hive contributor •Summer intern at Hortonworks •4th year Ph.D. student at The Ohio State University •Research interests: query optimizations, file formats, distributed systems, and storage systems Page 2 Architecting the Future of Big Data
  • 3. © Hortonworks Inc. 2011 Outline •Query planning in Hive •Correlations in a query (Intra-query correlations) •Case studies •Automatically exploiting correlations (HIVE- 2206: Correlation Optimizer) Page 3 Architecting the Future of Big Data
  • 4. © Hortonworks Inc. 2011 Query planning Page 4 Architecting the Future of Big Data SELECT t1.c2, count(*) FROM t1 JOIN t2 ON (t1.c1=t2.c1) GROUP BY t1.c2 t1 t2 JOIN AGG t1.c1=t2.c1 Calculate count(*) for every group of t1.c2
  • 5. © Hortonworks Inc. 2011 Query planning Page 5 Architecting the Future of Big Data SELECT t1.c2, count(*) FROM t1 JOIN t2 ON (t1.c1=t2.c1) GROUP BY t1.c2 t1 t2 JOIN AGG Evaluate this query in distributed systems t1 t2 JOIN AGG Shuffle Shuffle c1 c2 How to shuffle? Use the key column(s)
  • 6. © Hortonworks Inc. 2011 Generating MapReduce jobs Page 6 Architecting the Future of Big Data t1 t2 JOIN AGG Shuffle Shuffle c2 c1 t1 t2 JOIN Shuffle tmp c1 tmp AGG Shuffle c2 1 MR job can shuffle data once Job 1 Job 2
  • 7. © Hortonworks Inc. 2011 Generating MapReduce jobs Page 7 Architecting the Future of Big Data t1 t2 JOIN Shuffle tmp c1 tmp AGG Shuffle c2 MapReuce will shuffle data for us, we just need to emit outputs from the Map phase We use ReduceSinkOperator (RS) to emit Map outputs. RSs are the end of a Map phase. t1 t2 JOIN tmp tmp AGG RS1 RS2 RS2 Job 1 Map Job 1 Reduce Job 2 Map Job 2 Reduce
  • 8. © Hortonworks Inc. 2011 Outline •Query planning in Hive •Correlations in a query (Intra-query correlations) •Case studies •Automatically exploiting correlations (HIVE- 2206: Correlation Optimizer) Page 8 Architecting the Future of Big Data
  • 9. © Hortonworks Inc. 2011 Intra-query correlations Page 9 Architecting the Future of Big Data SELECT x.c1, count(*) FROM t1 x JOIN t1 y ON (x.c1=y.c1) GROUP BY x.c1 t1 as x t1 as y JOIN AGG x.c1=y.c1 Calculate count(*) for every group of x.c1 Correlations: 1. Same input tables 2. JOIN and AGG using the same key
  • 10. © Hortonworks Inc. 2011 Intra-query correlations Page 10 Architecting the Future of Big Data x.c1=y.c1 Calculate count(*) for every group of z.c1 t1 as x t2 as y JOIN1 JOIN2 AGG1 t1 as z p.c1=q.c1 SELECT p.c1, q.c2, q.cnt FROM (SELECT x.c1 AS c1 FROM t1 x JOIN t2 y ON (x.c1=y.c1)) p JOIN (SELECT z.c1 AS c1, count(*) AS cnt FROM t1 z GROUP BY z.c1) q ON (p.c1=q.c1) Correlations: 1. Same input tables (t1) 2. JOIN1 and AGG1 using the same key 3. JOIN2 and all of its parents using the same key
  • 11. © Hortonworks Inc. 2011 Intra-query correlations • Defined in “YSmart: Yet Another SQL-to-MapReduce Translator” – http://ysmart.cse.ohio-state.edu/ – http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf • Targeting on operators which need to shuffle the data and inputs • Three kinds of correlations – Input correlation (IC): independent operators share the same input tables – Transit correlation (TC): independent operators have input correlation and also shuffle the data in the same way (e.g. using the same keys) – Job flow correlation (JFC): two dependent operators shuffle the data in the same way Page 11 Architecting the Future of Big Data t1 as x t2 as y JOIN1 AGG1 t1 as z IC t1 as x t2 as y JOIN1 AGG1 t1 as z x.c1=y.c1 group by z.c1 TC JOIN AGG x.c1=y.c1 group by z.c1 JFC
  • 12. © Hortonworks Inc. 2011 Correlation-unaware query planning Page 12 Architecting the Future of Big Data t1 t1 JOIN AGG Shuffle Shuffle c1 c1 Hive does not care: 1. If a table has been used multiple times 2. If data really needs to be shuffled t1 t1 JOIN Shuffle tmp c1 Job 1 tmp AGG Shuffle c1 Job 2 Drawbacks: 1. Unnecessary data loading 2. Unnecessary data shuffling 3. Unnecessary data materialization
  • 13. © Hortonworks Inc. 2011 Outline •Query planning in Hive •Correlations in a query (Intra-query correlations) •Case studies •Automatically exploiting correlations (HIVE- 2206: Correlation Optimizer) Page 13 Architecting the Future of Big Data
  • 14. © Hortonworks Inc. 2011 Case studies: TPC-H Q17 (Flattened) SELECT sum(l_extendedprice) / 7.0 as avg_yearly FROM (SELECT l_partkey, l_quantity, l_extendedprice FROM lineitem JOIN part ON (p_partkey=l_partkey) WHERE p_brand='Brand#35’ AND p_container = 'MED PKG’) touter JOIN (SELECT l_partkey as lp, 0.2 * avg(l_quantity) as lq FROM lineitem GROUP BY l_partkey) tinner ON (touter.l_partkey = tinnter.lp) WHERE touter.l_quantity < tinner.lq Page 14 Architecting the Future of Big Data
  • 15. © Hortonworks Inc. 2011 Case studies: TPC-H Q17 (Flattened) Page 15 Architecting the Future of Big Data lineitem part JOIN1 JOIN2 AGG1 lineitem AGG2 lineitem is used by JOIN1 and AGG1 JOIN1, AGG1, and JOIN2 share the same key
  • 16. © Hortonworks Inc. 2011 Case studies: TPC-H Q17 (Flattened) Page 16 Architecting the Future of Big Data lineitem part JOIN1 JOIN2 AGG1 lineitem AGG2 Job 1 Job 2 Job 3 Job 4 Without Correlation Optimizer
  • 17. © Hortonworks Inc. 2011 Case studies: TPC-H Q17 (Flattened) Page 17 Architecting the Future of Big Data lineitem part JOIN1 JOIN2 AGG1 lineitem AGG2 part JOIN1 JOIN2 AGG1 lineitem AGG2 Job 1 Job 2 Job 3 Job 4 Job 2 Job 1 Without Correlation Optimizer With Correlation Optimizer
  • 18. © Hortonworks Inc. 2011 Case studies: TPC-DS Q95 (Flattened) SELECT count(distinct ws1.ws_order_number) as order_count, sum(ws1.ws_ext_ship_cost) as total_shipping_cost, sum(ws1.ws_net_profit) as total_net_profit FROM web_sales ws1 JOIN customer_address ca ON (ws1.ws_ship_addr_sk = ca.ca_address_sk) JOIN web_site s ON (ws1.ws_web_site_sk = s.web_site_sk) JOIN date_dim d ON (ws1.ws_ship_date_sk = d.d_date_sk) LEFT SEMI JOIN (SELECT ws2.ws_order_number as ws_order_number FROM web_sales ws2 JOIN web_sales ws3 ON(ws2.ws_order_number = ws3.ws_order_number) WHERE ws2.ws_warehouse_sk <> ws3.ws_warehouse_sk) ws_wh1 ON (ws1.ws_order_number = ws_wh1.ws_order_number) LEFT SEMI JOIN (SELECT wr_order_number FROM web_returns wr JOIN (SELECT ws4.ws_order_number as ws_order_number FROM web_sales ws4 JOIN web_sales ws5 ON (ws4.ws_order_number = ws5.ws_order_number) WHERE ws4.ws_warehouse_sk <> ws5.ws_warehouse_sk) ws_wh2 ON (wr.wr_order_number = ws_wh2.ws_order_number)) tmp1 ON (ws1.ws_order_number = tmp1.wr_order_number) WHERE d.d_date >= '2001-05-01' AND d.d_date <= '2001-06-30’ AND ca.ca_state = 'NC’ AND s.web_company_name = 'pri' Page 18 Architecting the Future of Big Data
  • 19. © Hortonworks Inc. 2011 Case studies: TPC-DS Q95 (Flattened) Page 19 Architecting the Future of Big Data web_sales AGG customer_address web_site Map Join Semi Join web_sales web_sales JOIN1 web_sales web_sales JOIN1 web_returns JOIN2 date_dim
  • 20. © Hortonworks Inc. 2011 Case studies: TPC-DS Q95 (Flattened) Page 20 Architecting the Future of Big Data web_sales AGG customer_address web_site Map Join Semi Join web_sales web_sales JOIN1 web_sales web_sales JOIN1 web_returns JOIN2 Without Correlation Optimizer • 6 MapReduce jobs • Unnecessary data loading (black web_sales nodes) • Unnecessary data shuffling Job 6 Job 2 Job 3 Job 4 Job 5 Job 1 date_dim
  • 21. © Hortonworks Inc. 2011 Case studies: TPC-DS Q95 (Flattened) Page 21 Architecting the Future of Big Data web_sales AGG customer_address web_site Map Join Semi Join web_sales JOIN1 JOIN1 web_returns JOIN2 With Correlation Optimizer • Black web_sales nodes share the same data loading date_dim
  • 22. © Hortonworks Inc. 2011 Case studies: TPC-DS Q95 (Flattened) Page 22 Architecting the Future of Big Data web_sales AGG customer_address web_site Map Join Semi Join web_sales JOIN1 JOIN1 web_returns JOIN2 With Correlation Optimizer • Black web_sales nodes share the same data loading • 3 MapReduce jobs Job 1 Job 2 Job 3 date_dim
  • 23. © Hortonworks Inc. 2011 Case studies: TPC-DS Q95 (Flattened) Page 23 Architecting the Future of Big Data web_sales AGG customer_address web_site Map Join Semi Join web_sales JOIN1 web_returns JOIN2 Follow-up work • Evaluate JOIN1 only once without materializing a temporary table date_dim
  • 24. © Hortonworks Inc. 2011 Case studies: TPC-DS Q95 (Flattened) Page 24 Architecting the Future of Big Data web_sales AGG customer_address web_site Map Join Semi Join web_sales JOIN1 web_returns JOIN2 Follow-up work • Evaluate JOIN1 only once without materializing a temporary table • Only use 2 MapReduce jobs Job 1 Job 2 date_dim
  • 25. © Hortonworks Inc. 2011 Outline •Query planning in Hive •Correlations in a query (Intra-query correlations) •Case studies •Automatically exploiting correlations (HIVE- 2206: Correlation Optimizer) Page 25 Architecting the Future of Big Data
  • 26. © Hortonworks Inc. 2011 Objectives • Eliminate unnecessary data loading – Query planner will be aware what data will be loaded – Do as many things as possible for loaded data • Eliminate unnecessary data shuffling – Query planner will be aware when data really needs to be shuffled – Do as many things as possible before shuffling the data again Page 26 Architecting the Future of Big Data
  • 27. © Hortonworks Inc. 2011 ReduceSink Deduplication • HIVE-2340 • Handle chained Job Flow Correlations – e.g. Generating a single job for both Group By and Order By • Cannot handle complex patterns – e.g. Multiple Joins involved patterns • Need a fundamental solution • Need to exploit shared input tables Page 27 Architecting the Future of Big Data t1 RS1 AGG1 RS2 … t1 RS1 AGG1 …
  • 28. © Hortonworks Inc. 2011 Correlation Optimizer • 2-phase optimizer – Phase 1: Correlation Detection – Phase 2: Query plan tree transformation • This work is not just about the optimizer – New operators to support the execution of an optimized plan – A mechanism to coordinate the operator tree inside the Reduce phase Page 28 Architecting the Future of Big Data
  • 29. © Hortonworks Inc. 2011 Correlation detection Page 29 Architecting the Future of Big Data SELECT p.c1, q.c2, q.cnt FROM (SELECT x.c1 AS c1 FROM t1 x JOIN t2 y ON (x.c1=y.c1)) p JOIN (SELECT z.c1 AS c1, count(*) AS cnt FROM t1 z GROUP BY z.c1) q ON (p.c1=q.c1) 1. Traverse the tree all the way down to find matching keys in ReduceSinkOperators 2. Then, check input tables to find shared data loading opportunities t1 as x t2 as y JOIN1 JOIN2 AGG1 t1 as z RS1 RS2 RS3 RS4 RS5 Key: p.c1 Key: q.c1 Key: x.c1 Key: y.c1 Key: z.c1
  • 30. © Hortonworks Inc. 2011 Query plan tree transformation Page 30 Architecting the Future of Big Data SELECT p.c1, q.c2, q.cnt FROM (SELECT x.c1 AS c1 FROM t1 x JOIN t2 y ON (x.c1=y.c1)) p JOIN (SELECT z.c1 AS c1, count(*) AS cnt FROM t1 z GROUP BY z.c1) q ON (p.c1=q.c1) t1 as x t2 as y JOIN1 JOIN2 AGG1 t1 as z Key: p.c1 RS1 RS2 RS3 RS4 RS5 Key: q.c1 Key: x.c1 Key: y.c1 Key: z.c1 t1 as x, zt2 as y JOIN1 JOIN2 AGG1 RS1RS2 RS3
  • 31. © Hortonworks Inc. 2011 Thanks Architecting the Future of Big Data Page 31