Major advancements in Apache Hive towards full support of SQL compliance

Major advancements in Apache
Hive towards full support of SQL
compliance
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Pengcheng Xiong and Ashutosh Chauhan
Hortonworks Inc., Apache Hive
Community
{pxiong,ashutosh}@hortonworks.com

Motivation
• Standard provides Safety and reliability, Interoperability,
Business benefits, Consumer choice.
• SQL was one of the first commercial languages for
relational model.
• SQL:2011 standard
• Different DB vendors have different SQL
implementations
• Hive provides the necessary SQL abstraction to integrate
SQL-like Queries (HiveQL) into the underlying Java API.

What we already have before Hive 1.2.0
• SQL operators
• Projection, Selection, Aggregation, Join, Union all, Order by, Windowing, UDF, UDAF, UDTF
• Views
• CREATE VIEW v AS SELECT * FROM src;
• SQL Standard-based Authorization
• GRANT SELECT ON TABLE src_autho_test TO USER hive_test_user;
• Most of subqueries
• SELECT state, net_payments FROM transfer_payments WHERE transfer_payments.year IN
(SELECT year FROM us_census);
• Common Table Expressions
• WITH q1 AS (SELECT key from src where key = '5') SELECT * from q1;
• Insert/Update/Delete

Agenda
• SQL2011 keywords and reserved keywords
• Primary key and foreign key
• Set operations
• Summary

Motivation to reserve key words
• Reduce ambiguity
• English: Foreigners are hunting dogs – It is unclear whether
dogs were being hunted or foreigners are being spoken of as
dogs.
• HiveQL does not reserve enough key words before 1.2.0. This
causes a lot of parser ambiguity problems
select key from a right
join b on ...;
“right” is an identifier as table alias?
“right” means right outer join?
There are 314 outstanding issues!

Reserved key words and non reserved ones
• Following SQL2011, we made changes to the grammar/parser
and reserved certain key words. There is NO ambiguity now.
Some of the reserved key words
ALL,ALTER,ARRAY,AS,AUTHORIZATION,BETWEEN,BIGINT,BINARY,BOOL
EAN,BOTH,BY,CONSTRAINT,CREATE,CUBE,CURRENT_DATE,CURRENT
_TIMESTAMP,CURSOR,DATE,DECIMAL,DELETE,DESCRIBE,DOUBLE,DR
OP,EXISTS,EXTERNAL,FALSE,FETCH,FLOAT,FOR,FOREIGN,FULL,GRAN
T,GROUP,GROUPING,IMPORT,IN,INNER,INSERT,INT,INTERSECT,INTO,IS,
LATERAL,LEFT,LIKE,LOCAL,NONE,NULL,OF,ORDER,OUT,OUTER,PARTITI
ON,PERCENT,PRECISION,PRIMARY,PROCEDURE,RANGE,READS,REFE
RENCES,REGEXP,REVOKE,RIGHT,RLIKE,ROLLUP,ROW,ROWS,SET,SMA
LLINT,TABLE,TIMESTAMP,TO,TRIGGER,TRUE,TRUNCATE,UNION,UPDAT
E,USER,USING,VALUES,WITH

Primary Key and Foreign Key
Primary Key
Foreign Key
Primary Key

Motivation to have Primary Key(PK) / Foreign Key (FK)
• General purpose in a database
• Data cleanliness
• Query optimization
• Current stats in Hive
• We infer PK by column and table info
– Number of distinct value (NDV) >= #rows
– Range (i.e., max – min + 1) = #rows
• We infer FK by Range
– isWithin(fkRange, pkRange)
• Improvement
• Directly define and retrieve the info.
• Mainly use it for query optimization: Cardinality estimation of join
=> Join order => Query execution time
necessary but not
sufficient condition

Cardinality estimation of join A, B (general case)
• join cardinality = cardA * cardB * join selectivity
• join selectivity = 1 / max (ndvA_join_col, ndvB_join_col)
• The underlying assumption for this formula is the
“principle of inclusion”: each value of the smaller domain
has a match in the larger domain.
TableScan A
TableScan B
Filter
Filter
Join
PK: 0,5,8
FK: 0,1,2,2,4,4,6,8
est. 3*8/6=4 rows
real. 2 rows
0,0
8,8
PK: 0-9
FK: 0-9

Cardinality estimation of join A, B (with PK-FK info.)
• join cardinality = cardB * join selectivity
• join selectivity = PKsel * FKscale
TableScan A
TableScan B
Filter
Filter
Join
PK: 0,5,8
FK: 0,1,2,2,4,4,6,8
est. 8*0.3=2.4 rows
real. 2 rows
0,0
8,8
PK: 0-9
FK: 0-9

Notes about Primary Key and Foreign Key Usage
CREATE TABLE vendor (vendor_id INT, PRIMARY KEY
(vendor_id) disable novalidate);
CREATE TABLE product(product_id INT,
product_vendor_id INT,
PRIMARY KEY (product_id) DISABLE NOVALIDATE RELY,
CONSTRAINT product_fk_1 FOREIGN KEY
(product_vendor_id)
REFERENCES vendor(vendor_id) DISABLE NOVALIDATE
RELY);

Motivation for intersect except set operations
• They are two important set operations and part of the
SQL compliance. Lots of analytic queries use them
• For example, they are heavily used in TPCDS benchmark
• It is complicated to implement them
• New operator involves huge change, better rewrite
• More difficult than Union (distinct) rewrite
– Need to consider the performance (optimization rules should apply)
– Need to consider the scalability (intersect has multiple branches)
– Easy to make mistakes (e.g. dealing with NULLs)

Set Operations – UNION (ALL)
In SQL the UNION clause combines the
results of two SQL queries into a single
table of all matching rows. Any duplicate
records are automatically removed
unless UNION ALL is used.
ta tb
1,2,2,3,3,NULL,NULL
1,1,2,4,NULL,NULL,NULL
ta UNION ALL tb = {1,2,2,3,3,NULL,NULL,
1,1,2,4,NULL,NULL,NULL}
ta UNION (DISTINCT) tb = {1,2,3,NULL,4}

Set Operations – UNION (DISTINCT) - Design
• Rewrite: Union All - GB (on All attributes)
Example: R1 Union Distinct R2
R3: R1 Union All R2
return GB(R3.x1, R3.x2… R3.xn)

Set Operations – UNION (DISTINCT) - Explain
explain
select * from ta
union
select * from tb;
Reducer 3
File Output Operator [FS_11]
Group By Operator [GBY_9] (rows=7 width=1)
Output:["_col0"],keys:KEY._col0
<-Union 2 [SIMPLE_EDGE]
<-Map 1 [CONTAINS]
Reduce Output Operator [RS_8]
PartitionCols:_col0
Output:["_col0"],keys:_col0
Select Operator [SEL_1] (rows=7 width=1)
Output:["_col0"]
TableScan [TS_0] (rows=7 width=1)
Output:["col"]
<-Map 4 [CONTAINS]
PartitionCols:_col0
Output:["_col0"],keys:_col0
Output:["_col0"]
Output:["col"]

Set Operations – INTERSECT (ALL)
The SQL INTERSECT operator takes the
results of two queries and returns only
rows that appear in both result sets. The
INTERSECT operator removes duplicate
rows from the final result set
ta tb
2,3,3 1,4, NULL
ta INTERSECT ALL tb = {1,2,NULL,NULL}
ta INTERSECT (DISTINCT) tb = {1,2,NULL}
1,2,
NULL,
NULL

Set Operations – UDTF (replicate_rows)
• UDTF: user defined table function
• Difference among UDF, UDAF and UDTF?
• UDF (1 to 1), e.g., abs()
• UDAF (many to 1), e.g., sum()
• UDTF (1 to many), e.g., replicate_rows()
• UDTF replicate_rows will duplicate the row based on the value of
the 1st column.
2,“first row”,3.14
-1,“2nd row”,5.34
3,“final row”,19112.0
replicate_rows
3,“final row”,19112.0
3,“final row”,19112.0
3,“final row”,19112.0

Set Operations – INTERSECT - Design
• Rewrite: (GB-Union All-GB)-GB-UDTF (on all attributes)
Example: R1 Intersect All R2
R3 = GB(R1 on all attributes + count() as c) union all GB(R2 on
all attributes + count() as c)
R4 = GB(R3 on all attributes + count(c) as cnt + min(c) as m)
R5 = Fil ( cnt == #branch )
if INTERSECT ALL
R6 = UDTF (R5) which will explode the tuples based on min(c).
else
R6 = Proj(R5 on all attributes)

Set Operations – INTERSECT (ALL) - DataFlow
R1.col
1
2
2
3
3
NULL
NULL
R2.col
1
1
2
4
NULL
NULL
NULL
R1.co
l
count
() as c
1
2
3
NULL
1
2
2
2
R2.co
l
count
() as c
1
2
4
NULL
2
1
1
3
col c
1
2
3
NULL
1
2
4
NULL
1
2
2
2
2
1
1
3
col count
(c)
min
(c)
1
2
3
NULL
4
2
2
1
2
1
1
1
2
2
1
col count
(c)
min
(c)
1
2
NULL
NULL
2
2
2
2
1
1
2
2
GroupBy
GroupBy
Union all
GroupBy
Filter
count(c)=#branches
UDTF based on min
Project

Set Operations – INTERSECT (DISTINCT) - DataFlow
R1.col
1
2
2
3
3
NULL
NULL
R2.col
1
1
2
4
NULL
NULL
NULL
R1.co
l
count
() as c
1
2
3
NULL
1
2
2
2
R2.co
l
count
() as c
1
2
4
NULL
2
1
1
3
col c
1
2
3
NULL
1
2
4
NULL
1
2
2
2
2
1
1
3
col count
(c)
1
2
3
NULL
4
2
2
1
2
1
col
1
2
NULL
GroupBy
GroupBy
Union all
GroupBy
Filter
count(c)=#branches
Project

Set Operations – INTERSECT ALL - Explain
explain
select * from ta
intersect all
select * from tb;
Reducer 4
Output:["_col0"]
UDTF Operator [UDTF_23] (rows=1 width=1)
function name:UDTFReplicateRows
Output:["_col0","_col1"]
Filter Operator [FIL_20] (rows=1 width=1)
predicate:(_col2 = 2)
Output:["_col0","_col1","_col2"],aggregations:["min(VALUE._col0)","count(VALUE._col1)"],keys:KEY._col0
<-Reducer 2 [CONTAINS]
PartitionCols:_col0
Output:["_col0","_col1","_col2"],aggregations:["min(_col1)","count(_col1)"],keys:_col0
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_4]
PartitionCols:_col0
Output:["_col0","_col1"],aggregations:["count(1)"],keys:_col0
Output:["_col0"]
default@ta,ta,Tbl:COMPLETE,Col:NONE,Output:["col"]
PartitionCols:_col0
Output:["_col0","_col1","_col2"],aggregations:["min(_col1)","count(_col1)"],keys:_col0
SHUFFLE [RS_11]
PartitionCols:_col0
Output:["_col0"]
default@tb,tb,Tbl:COMPLETE,Col:NONE,Output:["col"]

Set Operations – INTERSECT DISTINCT - Explain
explain
select * from ta
intersect
select * from tb;
Reducer 4
Output:["_col0"]
Filter Operator [FIL_20] (rows=1 width=1)
predicate:(_col1 = 2)
PartitionCols:_col0
Output:["_col0","_col1"],aggregations:["count(_col1)"],keys:_col0
SHUFFLE [RS_4]
PartitionCols:_col0
Output:["_col0"]
PartitionCols:_col0
Output:["_col0","_col1"],aggregations:["count(_col1)"],keys:_col0
SHUFFLE [RS_11]
PartitionCols:_col0
Output:["_col0"]

Set Operations – EXCEPT (ALL)
The SQL EXCEPT operator takes the
distinct rows of one query and returns
the rows that do not appear in a second
result set. The EXCEPT ALL operator
does not remove duplicates.
ta tb
2,3,3 1,4, NULL
ta EXCEPT ALL tb = {3,3,2}
ta EXCEPT (DISTINCT) tb = {3}
1,2,
NULL,
NULL

Set Operations – EXCEPT (ALL) - Design
• Rewrite: (GB-Union All-GB)-GB-UDTF (on all attributes)
Example: R1 Except All R2
R1 introduce VCol ‘2’, R2 introduce VCol ‘1’
R3 = GB(R1 on all attributes + VCol + count(VCol) as c) union all GB(R2 on all
attributes + VCol + count(VCol) as c)
R4 = GB(R3 on all attributes + sum(c) as a + sum(VCol*c) as b)
we have m+n=a, 2m+n=b where m is the #row in R1 and n is the #row in R2 then
m=b-a, n=2a-b, m-n=2b-3a
if it is except (distinct)
then R5 = Fil (b-a>0 && 2a-b=0) R6 = select only attributes from R5
else R5 = UDTF (R4) which will explode the tuples based on 2b-3a.

Set Operations – EXCEPT (ALL) - DataFlow
R1.col Vcol
1
2
2
3
3
NULL
NULL
2
2
2
2
2
2
2
R2.col Vcol
1
1
2
4
NULL
NULL
NULL
1
1
1
1
1
1
1
R1.col Vcol count
(Vcol)
as c
1
2
3
NULL
2
2
2
2
1
2
2
2
R2.col Vcol count
(Vcol)
as c
1
2
4
NULL
1
1
1
1
2
1
1
3
col Vcol c
1
2
3
NULL
1
2
4
NULL
2
2
2
2
1
1
1
1
1
2
2
2
2
1
1
3
col sum(c)
as a
sum
(Vcol*
c) as b
1
2
3
NULL
4
3
3
2
5
1
4
5
4
7
1
col 2b-3a
1
2
3
NULL
4
-1
1
2
-1
-1
GroupBy
GroupBy
Union all GroupBy
Project
Project col 2b-3a
2
3
3
1
2
2
UDTF

Set Operations – EXCEPT (ALL) - Explain
explain
select * from ta
except all
select * from tb;
Reducer 4
Output:["_col0"]
UDTF Operator [UDTF_22] (rows=3 width=1)
function name:UDTFReplicateRows
Output:["_col0","_col1"]
Output:["_col0","_col1","_col2"],aggregations:["sum(VALUE._col0)","sum(VALUE._col1)"],keys:KEY._col0
PartitionCols:_col0
Output:["_col0","_col1","_col2"],aggregations:["sum(_col3)","sum(_col2)"],keys:_col0
Output:["_col0","_col3","_col2"]
SHUFFLE [RS_4]
PartitionCols:_col0
Output:["_col0"]
PartitionCols:_col0
Output:["_col0","_col1","_col2"],aggregations:["sum(_col3)","sum(_col2)"],keys:_col0
SHUFFLE [RS_11]
PartitionCols:_col0
Output:["_col0"]

Notes about Set Operations
• Support Union (distinct/all), Intersect (distinct/all), Except
(distinct/all)
• Also support precedence using parentheses.
• select * from ta union (select * from tb intersect all select * from tc)
• Design is purely based on query rewriting, does not introduce new
operators
• change in query parser and compiler, no change to optimizer and executor
(MR, Tez, LLAP, Spark, etc)
• Implementation only uses GB, UNION and UDTF
• easy to maintain
• better performance and scalability than join based ones.

Summary for Hive towards full SQL compliance
1.0.0 Feb 4 2015
SQL operators
Views
SQL Standard-based Authorization
Subqueries
Common Table Expressions
Insert/Update/Delete
1.2.0 May 18, 2015
More reserved keywords
Union Distinct
2.1.0 Jun 20, 2016
Primary/Foreign key
2.2.0 TBD
Set Operations
Intersect (Distinct/All)
Except (Distinct/All)
Minus (Distinct/All)

Thank you! Questions?

Major advancements in Apache Hive towards full support of SQL compliance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Major advancements in Apache Hive towards full support of SQL compliance

Similar to Major advancements in Apache Hive towards full support of SQL compliance (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Major advancements in Apache Hive towards full support of SQL compliance

Editor's Notes