SlideShare a Scribd company logo
1 of 33
How to understand and analyze
Apache Hive query execution plan
for performance debugging
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Pengcheng Xiong and Ashutosh Chauhan
Hortonworks Inc., Apache Hive
Community
{pxiong,ashutosh}@hortonworks.com
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Goals:
• WHY old style Hive explain plan is hard to read
• Compare the old style explain with Postgres over a body of 500+ realistic SQL queries.
• WHAT is new style Hive explain plan
• Show orchestration of the tasks and operator trees, join sequences and algorithms, operator
execution costs
• HOW to performance debug a real query by analyzing the new Hive
explain plan
• Identify the potential improvement by changing join sequence, join algorithm and etc
• Show the real improvement by running the query in real cluster
• Integration/interaction with other system/tools
• Future work
Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHY old style Hive explain plan is hard to read
• M**** company’s schema and queries.
• Comparison of explain plans between Hive and Postgres for the 528
queries they can both execute.
• Hive: “explain”, Postgres: “explain verbose”
Hive “old
style”, 233.5
postgres,
53.8
Hive “old
style”, 1289
postgres,
328.6
Average Lines Per Explain Plan Average Words Per Explain Plan
N = 528 queries
Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
We can see that Hive old style explain is quite
verbose, is it necessary?
select a11.PBTNAME PBTNAME
from PMT_INVENTORY a11
join LU_MONTH a12
on (a11.QUARTER_ID = a12.QUARTER_ID)
where a12.MONTH_ID in (200607, 200606)
group by a11.PBTNAME;
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
High level plan comparison: Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
16 lines
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
High level plan comparison: Hive
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
Edges:
Map 2 <- Map 1 (BROADCAST_EDGE)
Reducer 3 <- Map 2 (SIMPLE_EDGE)
DagName: carter_20151114133018_2f2f0101-d14d-4688-bbb1-db67055016c3:946
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: a11
filterExpr: quarter_id is not null (type: boolean)
Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: quarter_id is not null (type: boolean)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: quarter_id (type: int)
sort order: +
Map-reduce partition columns: quarter_id (type: int)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
value expressions: pbtname (type: string)
Execution mode: vectorized
Map 2
Map Operator Tree:
TableScan
alias: a12
filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 quarter_id (type: int)
1 quarter_id (type: int)
outputColumnNames: _col1
input vertices:
0 Map 1
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
HybridGraceHashJoin: true
Group By Operator
keys: _col1 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Reducer 3
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 6 Data size: 5933 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 6 Data size: 5933 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Too verbose, need a magnifier!
80+ lines
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Map 2 Operator
Map 2
Map Operator Tree:
TableScan
alias: a12
filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 quarter_id (type: int)
1 quarter_id (type: int)
outputColumnNames: _col1
input vertices:
0 Map 1
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
HybridGraceHashJoin: true
Group By Operator
keys: _col1 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Data flows from top to bottom
Each operator has 0 or 1
child
Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Map 2 Operator
Map 2
Map Operator Tree:
TableScan
alias: a12
filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 quarter_id (type: int)
1 quarter_id (type: int)
outputColumnNames: _col1
input vertices:
0 Map 1
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
HybridGraceHashJoin: true
Group By Operator
keys: _col1 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Must scroll to another part
of the plan to see what this is
Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Map 1 Operator
Map 1
Map Operator Tree:
TableScan
alias: a11
filterExpr: quarter_id is not null (type: boolean)
Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: quarter_id is not null (type: boolean)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: quarter_id (type: int)
sort order: +
Map-reduce partition columns: quarter_id (type: int)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
value expressions: pbtname (type: string)
Execution mode: vectorized
Actual table name (PMT_INVENTORY)
not mentioned anywhere, only
the alias
Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Map 1 Operator
Map 1
Map Operator Tree:
TableScan
alias: a11
filterExpr: quarter_id is not null (type: boolean)
Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: quarter_id is not null (type: boolean)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: quarter_id (type: int)
sort order: +
Map-reduce partition columns: quarter_id (type: int)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
value expressions: pbtname (type: string)
Execution mode: vectorized
How much of this information is really
necessary to SQL users?
Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Data flows bottom to top
Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Operators have multiple
children when it makes
sense
Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Join is done using a scan of pmt_inventory and a hash
following a scan of lu_month.
All this info is available without referring to a stage plan.
IOW you don’t have to jump around in the plan.
Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Actual schema / table names
visible
Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Cost mentioned once per operator
Cost monotonically increases
as you go up
Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
• Set hive.explain.user=true; (by default). Use Tez, LLAP, etc
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Immediate Notes:
1. Much smaller
2. Can be read in order
Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Data flows bottom to top
Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Operators have multiple
children when it makes
sense
Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Join’s information is clear
pmt_inventory is broadcasted to lu_month
and a MapJoin is done
Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Cost mentioned once per operator
Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HOW to performance debug a real query (TPC-DS Q3, 1TB)
select
dt.d_year, item.i_brand_id brand_id, item.i_brand brand, sum(ss_ext_sales_price) sum_agg
from
date_dim dt, store_sales, item
where
dt.d_date_sk = store_sales.ss_sold_date_sk
and store_sales.ss_item_sk = item.i_item_sk
and item.i_manufact_id = 436
and dt.d_moy = 12
group by dt.d_year , item.i_brand , item.i_brand_id
order by dt.d_year , sum_agg desc , brand_id
limit 10
partitioned by ss_sold_date_sk
Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
<-Reducer 3 [SIMPLE_EDGE]
SHUFFLE [RS_15]
PartitionCols:_col0, _col1, _col2
Group By Operator [GBY_14] (rows=1 width=116)
Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col45)"],keys:_col6, _col65, _col64
Select Operator [SEL_13] (rows=76515 width=128)
Output:["_col6","_col65","_col64","_col45"]
Filter Operator [FIL_23] (rows=76515 width=128)
predicate:((_col0 = _col53) and (_col32 = _col57))
Merge Join Operator [MERGEJOIN_28] (rows=306061 width=128)
Conds:RS_8._col32=RS_34.i_item_sk(Inner),Output:["_col0","_col6","_col32","_col45","_col53","_col57","_col64","_col65"]
<-Map 7 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_34]
PartitionCols:i_item_sk
Filter Operator [FIL_33] (rows=434 width=111)
predicate:(i_item_sk is not null and (i_manufact_id = 436))
TableScan [TS_2] (rows=300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
<-Reducer 2 [SIMPLE_EDGE]
SHUFFLE [RS_8]
PartitionCols:_col32
Merge Join Operator [MERGEJOIN_27] (rows=211562452 width=20)
Conds:RS_30.d_date_sk=RS_32.ss_sold_date_sk(Inner),Output:["_col0","_col6","_col32","_col45","_col53"]
<-Map 1 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_30]
PartitionCols:d_date_sk
Filter Operator [FIL_29] (rows=5619 width=12)
predicate:(d_date_sk is not null and (d_moy = 12))
TableScan [TS_0] (rows=73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
<-Map 6 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_32]
PartitionCols:ss_sold_date_sk
Filter Operator [FIL_31] (rows=2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_1] (rows=2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
Original plan runs 163.33s. Sounds
like column pruning and predicate
push down are working fine.
However, the join sequence
store_sales✖date_dim✖item is not
good enough. A better one is
store_sales✖item✖date_dim
Table Cardinality Cardinality after filter Selectivity
date_dim 73K 5619 7.6%
item 300K 434 0.14%
Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
<-Reducer 3 [SIMPLE_EDGE]
SHUFFLE [RS_17]
PartitionCols:_col0, _col1, _col2
Group By Operator [GBY_16] (rows=9 width=116)
Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5
Select Operator [SEL_15] (rows=306061 width=112)
Output:["_col8","_col4","_col5","_col1"]
Merge Join Operator [MERGEJOIN_29] (rows=306061 width=112)
Conds:RS_12._col2=RS_38._col0(Inner),Output:["_col1","_col4","_col5","_col8"]
<-Map 7 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_38]
PartitionCols:_col0
Select Operator [SEL_37] (rows=5619 width=12)
Output:["_col0","_col1"]
Filter Operator [FIL_36] (rows=5619 width=12)
predicate:((d_moy = 12) and d_date_sk is not null)
TableScan [TS_6] (rows=73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
<-Reducer 2 [SIMPLE_EDGE]
SHUFFLE [RS_12]
PartitionCols:_col2
Merge Join Operator [MERGEJOIN_28] (rows=3978894 width=112)
Conds:RS_32._col0=RS_35._col0(Inner),Output:["_col1","_col2","_col4","_col5"]
<-Map 1 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_32]
PartitionCols:_col0
Select Operator [SEL_31] (rows=2750387156 width=11)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_30] (rows=2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_0] (rows=2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
<-Map 6 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_35]
PartitionCols:_col0
Select Operator [SEL_34] (rows=434 width=111)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_33] (rows=434 width=111)
predicate:((i_manufact_id = 436) and i_item_sk is not null)
TableScan [TS_3] (rows=300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
CBO on, new plan runs 143.97s
with new join sequence
store_sales✖item✖date_dim.
The input data size of one branch of
join is pretty small, should use map
join, rather than merge join.
Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SHUFFLE [RS_44]
PartitionCols:_col0, _col1, _col2
Group By Operator [GBY_43] (rows=9 width=116)
Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5
Select Operator [SEL_42] (rows=306061 width=112)
Output:["_col8","_col4","_col5","_col1"]
Map Join Operator [MAPJOIN_41] (rows=306061 width=112)
Conds:MAPJOIN_40._col2=RS_37._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col4","_col5","_col8"]
<-Map 5 [BROADCAST_EDGE] vectorized
BROADCAST [RS_37]
PartitionCols:_col0
Select Operator [SEL_36] (rows=5619 width=12)
Output:["_col0","_col1"]
Filter Operator [FIL_35] (rows=5619 width=12)
predicate:((d_moy = 12) and d_date_sk is not null)
TableScan [TS_6] (rows=73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
<-Map Join Operator [MAPJOIN_40] (rows=3978894 width=112)
Conds:SEL_39._col0=RS_34._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col2","_col4","_col5"]
<-Map 4 [BROADCAST_EDGE] vectorized
BROADCAST [RS_34]
PartitionCols:_col0
Select Operator [SEL_33] (rows=434 width=111)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_32] (rows=434 width=111)
predicate:((i_manufact_id = 436) and i_item_sk is not null)
TableScan [TS_3] (rows=300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
<-Select Operator [SEL_39] (rows=2750387156 width=11)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_38] (rows=2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_0] (rows=2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
Increase
hive.auto.convert.join.noconditionaltask.size=
1,359,688,499, we can see it is now using map
join operators. New plan runs 45.84s.
store_sales is a partitioned table on
the join key ss_sold_date_sk with
date_dim table.
Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SHUFFLE [RS_55]
PartitionCols:_col0, _col1, _col2
Group By Operator [GBY_54] (rows=9 width=116)
Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5
Select Operator [SEL_53] (rows=306061 width=112)
Output:["_col8","_col4","_col5","_col1"]
Map Join Operator [MAPJOIN_52] (rows=306061 width=112)
Conds:MAPJOIN_51._col2=RS_45._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col4","_col5","_col8"]
<-Map 5 [BROADCAST_EDGE] vectorized
BROADCAST [RS_45]
PartitionCols:_col0
Select Operator [SEL_44] (rows=5619 width=12)
Output:["_col0","_col1"]
Filter Operator [FIL_43] (rows=5619 width=12)
predicate:((d_moy = 12) and d_date_sk is not null)
TableScan [TS_6] (rows=73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
Dynamic Partitioning Event Operator [EVENT_48] (rows=2809 width=12)
Group By Operator [GBY_47] (rows=2809 width=12)
Output:["_col0"],keys:_col0
Select Operator [SEL_46] (rows=5619 width=12)
Output:["_col0"]
Please refer to the previous Select Operator [SEL_44]
<-Map Join Operator [MAPJOIN_51] (rows=3978894 width=112)
Conds:SEL_50._col0=RS_42._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col2","_col4","_col5"]
<-Map 4 [BROADCAST_EDGE] vectorized
BROADCAST [RS_42]
PartitionCols:_col0
Select Operator [SEL_41] (rows=434 width=111)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_40] (rows=434 width=111)
predicate:((i_manufact_id = 436) and i_item_sk is not null)
TableScan [TS_3] (rows=300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
<-Select Operator [SEL_50] (rows=2750387156 width=11)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_49] (rows=2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_0] (rows=2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
By setting
hive.tez.dynamic.partition.pruning=true,
we can see dynamic partitioning
event operators. See more about this
in HIVE-7826. New plan runs 31.35s.
In the run time, dynamic partition event
operator will send values needed to
prune to the application master - where
splits are generated and tasks are
submitted. Using these values we can
strip out any unneeded partitions
dynamically, while the query is running.
Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Performance debugging summary (TPC-DS Q3, 1TB)
0
20
40
60
80
100
120
140
160
180
Original Join re-order Join selection Dynamic partition pruning
Queryexecutiontime(s)
Query execution time
Continuous improvement
Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Integration with Apache Ambari
1. Type the query here
2. Click “explain”
3. explain plan will be shown
Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
 “set hive.tez.exec.print.summary=true;”
 Get more insights on query performance
Can be used along with Tez vertex runtime stats
Runtime stats
Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Summary
• Show old style Hive explain plan is hard to read.
• Verbose with too much redundant information, hard to follow how data
flows, cost of operator is unclear
• Compare with Postgres over a body of 500+ realistic SQL queries and
identify the candidate improving points
• Introduce new style Hive explain plan
• Use a concrete example to help understand the explain: execution cost, join
sequence and orchestration of the operator tree
• Use the new Hive explain plan to performance debug TPC-DS Q3
• Show the improvement after join re-ordering, join selection, and dynamic
partition pruning
• Integration/interaction with other system/tools
Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Future work -- Some gaps remain after HIVE-9780
• Put the real schema, table and column names in the
explain plan, e.g., no more _col0 etc.
• This will help users to understand the plan better
• HIVE-8681: CBO: Column names are missing from join expression
in Map join with CBO enabled
• Get an equivalent of “EXPLAIN ANALYZE” – such as
operator level runtime stats and warnings.
• This will help users to find out the gap between estimated cost and
real cost
• HIVE-14362: Support explain analyze in Hive
Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SHUFFLE [RS_55]
PartitionCols:d_year, i_brand, i_brand_id
Group By Operator [GBY_54] (rows=9/10 width=116)
Output:["d_year","sum_agg","i_brand","i_brand_id"],aggregations:["sum(ss_ext_sales_price)"],keys:d_year, i_brand, i_brand_id
Select Operator [SEL_53] (rows=306061/324651 width=112)
Output:["d_year","i_brand","i_brand_id","ss_ext_sales_price"]
Map Join Operator [MAPJOIN_52] (rows=306061/324651 width=112)
Conds:MAPJOIN_51. ss_sold_date_sk=RS_45. d_date_sk(Inner),HybridGraceHashJoin:true,Output:["d_year","i_brand","i_brand_id","ss_ext_sales_price"]
<-Map 5 [BROADCAST_EDGE] vectorized
BROADCAST [RS_45]
PartitionCols:d_date_sk
Select Operator [SEL_44] (rows=5619/6034 width=12)
Output:["d_date_sk","d_year"]
Filter Operator [FIL_43] (rows=5619/6034 width=12)
predicate:((d_moy = 12) and d_date_sk is not null)
TableScan [TS_6] (rows=73049/73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
Dynamic Partitioning Event Operator [EVENT_48] (rows=2809 width=12)
Group By Operator [GBY_47] (rows=2809/2324 width=12)
Output:["d_date_sk"],keys:d_date_sk
Select Operator [SEL_46] (rows=5619/6034 width=12)
Output:["d_date_sk"]
Please refer to the previous Select Operator [SEL_44]
<-Map Join Operator [MAPJOIN_51] (rows=3978894/4202377 width=112)
Conds:SEL_50.ss_item_sk=RS_42. i_item_sk(Inner),HybridGraceHashJoin:true,Output:["ss_ext_sales_price",” ss_sold_date_sk","i_brand","i_brand_id"]
<-Map 4 [BROADCAST_EDGE] vectorized
BROADCAST [RS_42]
PartitionCols:i_item_sk
Select Operator [SEL_41] (rows=434/453 width=111)
Output:[” i_item_sk","i_brand_id","i_brand"]
Filter Operator [FIL_40] (rows=434/453 width=111)
predicate:((i_manufact_id = 436) and i_item_sk is not null)
TableScan [TS_3] (rows=300000/300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
<-Select Operator [SEL_50] (rows=2750387156/2750387156 width=11)
Output:["ss_item_sk","ss_ext_sales_price","ss_sold_date_sk"]
Filter Operator [FIL_49] (rows=2750387156/2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_0] (rows=2750387156/2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Acknowledgement
• We thank all the anonymous reviewers’ votes to give us this
opportunity to share our work.
• Part of the slides are borrowed from or modified based on Carter
Shanklin and Rajesh Balamohan’s slides.
• We thank Gunther Hagleitner for all the support and inputs.
• We thank Sapin Amin for setting up the testing cluster.
Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Thank you! Questions?

More Related Content

What's hot

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 

What's hot (20)

Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 

Viewers also liked

Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 

Viewers also liked (10)

Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 

Similar to How to understand and analyze Apache Hive query execution plan for performance debugging

Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala
Cloudera, Inc.
 
Myth busters - performance tuning 101 2007
Myth busters - performance tuning 101 2007Myth busters - performance tuning 101 2007
Myth busters - performance tuning 101 2007
paulguerin
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
n5712036
 
Project Management System
Project Management SystemProject Management System
Project Management System
Divyen Patel
 

Similar to How to understand and analyze Apache Hive query execution plan for performance debugging (20)

PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
interenship.pptx
interenship.pptxinterenship.pptx
interenship.pptx
 
Myth busters - performance tuning 101 2007
Myth busters - performance tuning 101 2007Myth busters - performance tuning 101 2007
Myth busters - performance tuning 101 2007
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
Basic Analysis using Python
Basic Analysis using PythonBasic Analysis using Python
Basic Analysis using Python
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018
 
DATA REPRESENTATION
DATA  REPRESENTATIONDATA  REPRESENTATION
DATA REPRESENTATION
 
Project Management System
Project Management SystemProject Management System
Project Management System
 

More from DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

How to understand and analyze Apache Hive query execution plan for performance debugging

  • 1. How to understand and analyze Apache Hive query execution plan for performance debugging © Hortonworks Inc. 2011 – 2015. All Rights Reserved Pengcheng Xiong and Ashutosh Chauhan Hortonworks Inc., Apache Hive Community {pxiong,ashutosh}@hortonworks.com
  • 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Goals: • WHY old style Hive explain plan is hard to read • Compare the old style explain with Postgres over a body of 500+ realistic SQL queries. • WHAT is new style Hive explain plan • Show orchestration of the tasks and operator trees, join sequences and algorithms, operator execution costs • HOW to performance debug a real query by analyzing the new Hive explain plan • Identify the potential improvement by changing join sequence, join algorithm and etc • Show the real improvement by running the query in real cluster • Integration/interaction with other system/tools • Future work
  • 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHY old style Hive explain plan is hard to read • M**** company’s schema and queries. • Comparison of explain plans between Hive and Postgres for the 528 queries they can both execute. • Hive: “explain”, Postgres: “explain verbose” Hive “old style”, 233.5 postgres, 53.8 Hive “old style”, 1289 postgres, 328.6 Average Lines Per Explain Plan Average Words Per Explain Plan N = 528 queries
  • 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved We can see that Hive old style explain is quite verbose, is it necessary? select a11.PBTNAME PBTNAME from PMT_INVENTORY a11 join LU_MONTH a12 on (a11.QUARTER_ID = a12.QUARTER_ID) where a12.MONTH_ID in (200607, 200606) group by a11.PBTNAME;
  • 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved High level plan comparison: Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) 16 lines
  • 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved High level plan comparison: Hive STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez Edges: Map 2 <- Map 1 (BROADCAST_EDGE) Reducer 3 <- Map 2 (SIMPLE_EDGE) DagName: carter_20151114133018_2f2f0101-d14d-4688-bbb1-db67055016c3:946 Vertices: Map 1 Map Operator Tree: TableScan alias: a11 filterExpr: quarter_id is not null (type: boolean) Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: quarter_id is not null (type: boolean) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: quarter_id (type: int) sort order: + Map-reduce partition columns: quarter_id (type: int) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE value expressions: pbtname (type: string) Execution mode: vectorized Map 2 Map Operator Tree: TableScan alias: a12 filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 quarter_id (type: int) 1 quarter_id (type: int) outputColumnNames: _col1 input vertices: 0 Map 1 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE HybridGraceHashJoin: true Group By Operator keys: _col1 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Execution mode: vectorized Reducer 3 Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: string) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 6 Data size: 5933 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 6 Data size: 5933 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Too verbose, need a magnifier! 80+ lines
  • 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Map 2 Operator Map 2 Map Operator Tree: TableScan alias: a12 filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 quarter_id (type: int) 1 quarter_id (type: int) outputColumnNames: _col1 input vertices: 0 Map 1 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE HybridGraceHashJoin: true Group By Operator keys: _col1 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Execution mode: vectorized Data flows from top to bottom Each operator has 0 or 1 child
  • 8. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Map 2 Operator Map 2 Map Operator Tree: TableScan alias: a12 filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 quarter_id (type: int) 1 quarter_id (type: int) outputColumnNames: _col1 input vertices: 0 Map 1 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE HybridGraceHashJoin: true Group By Operator keys: _col1 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Execution mode: vectorized Must scroll to another part of the plan to see what this is
  • 9. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Map 1 Operator Map 1 Map Operator Tree: TableScan alias: a11 filterExpr: quarter_id is not null (type: boolean) Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: quarter_id is not null (type: boolean) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: quarter_id (type: int) sort order: + Map-reduce partition columns: quarter_id (type: int) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE value expressions: pbtname (type: string) Execution mode: vectorized Actual table name (PMT_INVENTORY) not mentioned anywhere, only the alias
  • 10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Map 1 Operator Map 1 Map Operator Tree: TableScan alias: a11 filterExpr: quarter_id is not null (type: boolean) Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: quarter_id is not null (type: boolean) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: quarter_id (type: int) sort order: + Map-reduce partition columns: quarter_id (type: int) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE value expressions: pbtname (type: string) Execution mode: vectorized How much of this information is really necessary to SQL users?
  • 11. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Data flows bottom to top
  • 12. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Operators have multiple children when it makes sense
  • 13. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Join is done using a scan of pmt_inventory and a hash following a scan of lu_month. All this info is available without referring to a stage plan. IOW you don’t have to jump around in the plan.
  • 14. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Actual schema / table names visible
  • 15. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Cost mentioned once per operator Cost monotonically increases as you go up
  • 16. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) • Set hive.explain.user=true; (by default). Use Tez, LLAP, etc Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Immediate Notes: 1. Much smaller 2. Can be read in order
  • 17. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Data flows bottom to top
  • 18. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Operators have multiple children when it makes sense
  • 19. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Join’s information is clear pmt_inventory is broadcasted to lu_month and a MapJoin is done
  • 20. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Cost mentioned once per operator
  • 21. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HOW to performance debug a real query (TPC-DS Q3, 1TB) select dt.d_year, item.i_brand_id brand_id, item.i_brand brand, sum(ss_ext_sales_price) sum_agg from date_dim dt, store_sales, item where dt.d_date_sk = store_sales.ss_sold_date_sk and store_sales.ss_item_sk = item.i_item_sk and item.i_manufact_id = 436 and dt.d_moy = 12 group by dt.d_year , item.i_brand , item.i_brand_id order by dt.d_year , sum_agg desc , brand_id limit 10 partitioned by ss_sold_date_sk
  • 22. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved <-Reducer 3 [SIMPLE_EDGE] SHUFFLE [RS_15] PartitionCols:_col0, _col1, _col2 Group By Operator [GBY_14] (rows=1 width=116) Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col45)"],keys:_col6, _col65, _col64 Select Operator [SEL_13] (rows=76515 width=128) Output:["_col6","_col65","_col64","_col45"] Filter Operator [FIL_23] (rows=76515 width=128) predicate:((_col0 = _col53) and (_col32 = _col57)) Merge Join Operator [MERGEJOIN_28] (rows=306061 width=128) Conds:RS_8._col32=RS_34.i_item_sk(Inner),Output:["_col0","_col6","_col32","_col45","_col53","_col57","_col64","_col65"] <-Map 7 [SIMPLE_EDGE] vectorized SHUFFLE [RS_34] PartitionCols:i_item_sk Filter Operator [FIL_33] (rows=434 width=111) predicate:(i_item_sk is not null and (i_manufact_id = 436)) TableScan [TS_2] (rows=300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] <-Reducer 2 [SIMPLE_EDGE] SHUFFLE [RS_8] PartitionCols:_col32 Merge Join Operator [MERGEJOIN_27] (rows=211562452 width=20) Conds:RS_30.d_date_sk=RS_32.ss_sold_date_sk(Inner),Output:["_col0","_col6","_col32","_col45","_col53"] <-Map 1 [SIMPLE_EDGE] vectorized SHUFFLE [RS_30] PartitionCols:d_date_sk Filter Operator [FIL_29] (rows=5619 width=12) predicate:(d_date_sk is not null and (d_moy = 12)) TableScan [TS_0] (rows=73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] <-Map 6 [SIMPLE_EDGE] vectorized SHUFFLE [RS_32] PartitionCols:ss_sold_date_sk Filter Operator [FIL_31] (rows=2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_1] (rows=2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"] Original plan runs 163.33s. Sounds like column pruning and predicate push down are working fine. However, the join sequence store_sales✖date_dim✖item is not good enough. A better one is store_sales✖item✖date_dim Table Cardinality Cardinality after filter Selectivity date_dim 73K 5619 7.6% item 300K 434 0.14%
  • 23. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved <-Reducer 3 [SIMPLE_EDGE] SHUFFLE [RS_17] PartitionCols:_col0, _col1, _col2 Group By Operator [GBY_16] (rows=9 width=116) Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5 Select Operator [SEL_15] (rows=306061 width=112) Output:["_col8","_col4","_col5","_col1"] Merge Join Operator [MERGEJOIN_29] (rows=306061 width=112) Conds:RS_12._col2=RS_38._col0(Inner),Output:["_col1","_col4","_col5","_col8"] <-Map 7 [SIMPLE_EDGE] vectorized SHUFFLE [RS_38] PartitionCols:_col0 Select Operator [SEL_37] (rows=5619 width=12) Output:["_col0","_col1"] Filter Operator [FIL_36] (rows=5619 width=12) predicate:((d_moy = 12) and d_date_sk is not null) TableScan [TS_6] (rows=73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] <-Reducer 2 [SIMPLE_EDGE] SHUFFLE [RS_12] PartitionCols:_col2 Merge Join Operator [MERGEJOIN_28] (rows=3978894 width=112) Conds:RS_32._col0=RS_35._col0(Inner),Output:["_col1","_col2","_col4","_col5"] <-Map 1 [SIMPLE_EDGE] vectorized SHUFFLE [RS_32] PartitionCols:_col0 Select Operator [SEL_31] (rows=2750387156 width=11) Output:["_col0","_col1","_col2"] Filter Operator [FIL_30] (rows=2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_0] (rows=2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"] <-Map 6 [SIMPLE_EDGE] vectorized SHUFFLE [RS_35] PartitionCols:_col0 Select Operator [SEL_34] (rows=434 width=111) Output:["_col0","_col1","_col2"] Filter Operator [FIL_33] (rows=434 width=111) predicate:((i_manufact_id = 436) and i_item_sk is not null) TableScan [TS_3] (rows=300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] CBO on, new plan runs 143.97s with new join sequence store_sales✖item✖date_dim. The input data size of one branch of join is pretty small, should use map join, rather than merge join.
  • 24. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved SHUFFLE [RS_44] PartitionCols:_col0, _col1, _col2 Group By Operator [GBY_43] (rows=9 width=116) Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5 Select Operator [SEL_42] (rows=306061 width=112) Output:["_col8","_col4","_col5","_col1"] Map Join Operator [MAPJOIN_41] (rows=306061 width=112) Conds:MAPJOIN_40._col2=RS_37._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col4","_col5","_col8"] <-Map 5 [BROADCAST_EDGE] vectorized BROADCAST [RS_37] PartitionCols:_col0 Select Operator [SEL_36] (rows=5619 width=12) Output:["_col0","_col1"] Filter Operator [FIL_35] (rows=5619 width=12) predicate:((d_moy = 12) and d_date_sk is not null) TableScan [TS_6] (rows=73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] <-Map Join Operator [MAPJOIN_40] (rows=3978894 width=112) Conds:SEL_39._col0=RS_34._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col2","_col4","_col5"] <-Map 4 [BROADCAST_EDGE] vectorized BROADCAST [RS_34] PartitionCols:_col0 Select Operator [SEL_33] (rows=434 width=111) Output:["_col0","_col1","_col2"] Filter Operator [FIL_32] (rows=434 width=111) predicate:((i_manufact_id = 436) and i_item_sk is not null) TableScan [TS_3] (rows=300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] <-Select Operator [SEL_39] (rows=2750387156 width=11) Output:["_col0","_col1","_col2"] Filter Operator [FIL_38] (rows=2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_0] (rows=2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"] Increase hive.auto.convert.join.noconditionaltask.size= 1,359,688,499, we can see it is now using map join operators. New plan runs 45.84s. store_sales is a partitioned table on the join key ss_sold_date_sk with date_dim table.
  • 25. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved SHUFFLE [RS_55] PartitionCols:_col0, _col1, _col2 Group By Operator [GBY_54] (rows=9 width=116) Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5 Select Operator [SEL_53] (rows=306061 width=112) Output:["_col8","_col4","_col5","_col1"] Map Join Operator [MAPJOIN_52] (rows=306061 width=112) Conds:MAPJOIN_51._col2=RS_45._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col4","_col5","_col8"] <-Map 5 [BROADCAST_EDGE] vectorized BROADCAST [RS_45] PartitionCols:_col0 Select Operator [SEL_44] (rows=5619 width=12) Output:["_col0","_col1"] Filter Operator [FIL_43] (rows=5619 width=12) predicate:((d_moy = 12) and d_date_sk is not null) TableScan [TS_6] (rows=73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] Dynamic Partitioning Event Operator [EVENT_48] (rows=2809 width=12) Group By Operator [GBY_47] (rows=2809 width=12) Output:["_col0"],keys:_col0 Select Operator [SEL_46] (rows=5619 width=12) Output:["_col0"] Please refer to the previous Select Operator [SEL_44] <-Map Join Operator [MAPJOIN_51] (rows=3978894 width=112) Conds:SEL_50._col0=RS_42._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col2","_col4","_col5"] <-Map 4 [BROADCAST_EDGE] vectorized BROADCAST [RS_42] PartitionCols:_col0 Select Operator [SEL_41] (rows=434 width=111) Output:["_col0","_col1","_col2"] Filter Operator [FIL_40] (rows=434 width=111) predicate:((i_manufact_id = 436) and i_item_sk is not null) TableScan [TS_3] (rows=300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] <-Select Operator [SEL_50] (rows=2750387156 width=11) Output:["_col0","_col1","_col2"] Filter Operator [FIL_49] (rows=2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_0] (rows=2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"] By setting hive.tez.dynamic.partition.pruning=true, we can see dynamic partitioning event operators. See more about this in HIVE-7826. New plan runs 31.35s. In the run time, dynamic partition event operator will send values needed to prune to the application master - where splits are generated and tasks are submitted. Using these values we can strip out any unneeded partitions dynamically, while the query is running.
  • 26. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Performance debugging summary (TPC-DS Q3, 1TB) 0 20 40 60 80 100 120 140 160 180 Original Join re-order Join selection Dynamic partition pruning Queryexecutiontime(s) Query execution time Continuous improvement
  • 27. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Integration with Apache Ambari 1. Type the query here 2. Click “explain” 3. explain plan will be shown
  • 28. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved  “set hive.tez.exec.print.summary=true;”  Get more insights on query performance Can be used along with Tez vertex runtime stats Runtime stats
  • 29. Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Summary • Show old style Hive explain plan is hard to read. • Verbose with too much redundant information, hard to follow how data flows, cost of operator is unclear • Compare with Postgres over a body of 500+ realistic SQL queries and identify the candidate improving points • Introduce new style Hive explain plan • Use a concrete example to help understand the explain: execution cost, join sequence and orchestration of the operator tree • Use the new Hive explain plan to performance debug TPC-DS Q3 • Show the improvement after join re-ordering, join selection, and dynamic partition pruning • Integration/interaction with other system/tools
  • 30. Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Future work -- Some gaps remain after HIVE-9780 • Put the real schema, table and column names in the explain plan, e.g., no more _col0 etc. • This will help users to understand the plan better • HIVE-8681: CBO: Column names are missing from join expression in Map join with CBO enabled • Get an equivalent of “EXPLAIN ANALYZE” – such as operator level runtime stats and warnings. • This will help users to find out the gap between estimated cost and real cost • HIVE-14362: Support explain analyze in Hive
  • 31. Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved SHUFFLE [RS_55] PartitionCols:d_year, i_brand, i_brand_id Group By Operator [GBY_54] (rows=9/10 width=116) Output:["d_year","sum_agg","i_brand","i_brand_id"],aggregations:["sum(ss_ext_sales_price)"],keys:d_year, i_brand, i_brand_id Select Operator [SEL_53] (rows=306061/324651 width=112) Output:["d_year","i_brand","i_brand_id","ss_ext_sales_price"] Map Join Operator [MAPJOIN_52] (rows=306061/324651 width=112) Conds:MAPJOIN_51. ss_sold_date_sk=RS_45. d_date_sk(Inner),HybridGraceHashJoin:true,Output:["d_year","i_brand","i_brand_id","ss_ext_sales_price"] <-Map 5 [BROADCAST_EDGE] vectorized BROADCAST [RS_45] PartitionCols:d_date_sk Select Operator [SEL_44] (rows=5619/6034 width=12) Output:["d_date_sk","d_year"] Filter Operator [FIL_43] (rows=5619/6034 width=12) predicate:((d_moy = 12) and d_date_sk is not null) TableScan [TS_6] (rows=73049/73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] Dynamic Partitioning Event Operator [EVENT_48] (rows=2809 width=12) Group By Operator [GBY_47] (rows=2809/2324 width=12) Output:["d_date_sk"],keys:d_date_sk Select Operator [SEL_46] (rows=5619/6034 width=12) Output:["d_date_sk"] Please refer to the previous Select Operator [SEL_44] <-Map Join Operator [MAPJOIN_51] (rows=3978894/4202377 width=112) Conds:SEL_50.ss_item_sk=RS_42. i_item_sk(Inner),HybridGraceHashJoin:true,Output:["ss_ext_sales_price",” ss_sold_date_sk","i_brand","i_brand_id"] <-Map 4 [BROADCAST_EDGE] vectorized BROADCAST [RS_42] PartitionCols:i_item_sk Select Operator [SEL_41] (rows=434/453 width=111) Output:[” i_item_sk","i_brand_id","i_brand"] Filter Operator [FIL_40] (rows=434/453 width=111) predicate:((i_manufact_id = 436) and i_item_sk is not null) TableScan [TS_3] (rows=300000/300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] <-Select Operator [SEL_50] (rows=2750387156/2750387156 width=11) Output:["ss_item_sk","ss_ext_sales_price","ss_sold_date_sk"] Filter Operator [FIL_49] (rows=2750387156/2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_0] (rows=2750387156/2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
  • 32. Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Acknowledgement • We thank all the anonymous reviewers’ votes to give us this opportunity to share our work. • Part of the slides are borrowed from or modified based on Carter Shanklin and Rajesh Balamohan’s slides. • We thank Gunther Hagleitner for all the support and inputs. • We thank Sapin Amin for setting up the testing cluster.
  • 33. Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank you! Questions?

Editor's Notes

  1. Hive contributors have striven to improve the capability of Hive in terms of both performance and functionality. We assert that understanding and analyzing Apache Hive query execution plan is crucial for performance debugging. In this talk, we study why Apache Hive’s query execution plan today are so difficult to analyze. We identify a set of pain points from our Apache Hive performance engineers, development engineers as well as real users/customers. We propose and show a new presentation data model that can well address the pain points. The three most critical parts of the presentation are (1) the estimated query execution cost, which is the planner's guess at how long it will take to run the query (measured in #rows); (2) the orchestration of the operator tree across consecutive M/R (Tez) jobs; and (3) integration and extension support with other presentation tools, e.g., Apache Ambari.
  2. We can see that Hive old style explain is quite verbose, is it necessary?
  3. We pick a real query out of the 500 queries.
  4. 16 lines
  5. 80 lines
  6. Although Reduce Output Operator is crucial as it defines the boundary between a map task and a reduce task, we are thinking how much information of the reduce output operator is.... SQL users care about more about relational operators rather than MR operators.
  7. We can also see more details about this join.
  8. It seems that Postgres sets up a good example for hive to improve on.
  9. So this is the basic introduction of the new style hive explain plan
  10. This query involves 3 tables.
  11. splits
  12. For Query 87
  13. The next version of Hive explain plan will look like this. We highlight the changes that we want to make in red color.
  14. We thank the audience for your attention.
  15. I am ready to take any questions.