SlideShare a Scribd company logo
1 of 33
Major advancements in Apache
Hive towards full support of SQL
compliance
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Pengcheng Xiong and Ashutosh Chauhan
Hortonworks Inc., Apache Hive
Community
{pxiong,ashutosh}@hortonworks.com
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Motivation
• Standard provides Safety and reliability, Interoperability,
Business benefits, Consumer choice.
• SQL was one of the first commercial languages for
relational model.
• SQL:2011 standard
• Different DB vendors have different SQL
implementations
• Hive provides the necessary SQL abstraction to integrate
SQL-like Queries (HiveQL) into the underlying Java API.
Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What we already have before Hive 1.2.0
• SQL operators
• Projection, Selection, Aggregation, Join, Union all, Order by, Windowing, UDF, UDAF, UDTF
• Views
• CREATE VIEW v AS SELECT * FROM src;
• SQL Standard-based Authorization
• GRANT SELECT ON TABLE src_autho_test TO USER hive_test_user;
• Most of subqueries
• SELECT state, net_payments FROM transfer_payments WHERE transfer_payments.year IN
(SELECT year FROM us_census);
• Common Table Expressions
• WITH q1 AS (SELECT key from src where key = '5') SELECT * from q1;
• Insert/Update/Delete
Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
• SQL2011 keywords and reserved keywords
• Primary key and foreign key
• Set operations
• Summary
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
• SQL2011 keywords and reserved keywords
• Primary key and foreign key
• Set operations
• Summary
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Motivation to reserve key words
• Reduce ambiguity
• English: Foreigners are hunting dogs – It is unclear whether
dogs were being hunted or foreigners are being spoken of as
dogs.
• HiveQL does not reserve enough key words before 1.2.0. This
causes a lot of parser ambiguity problems
select key from a right
join b on ...;
“right” is an identifier as table alias?
“right” means right outer join?
There are 314 outstanding issues!
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Reserved key words and non reserved ones
• Following SQL2011, we made changes to the grammar/parser
and reserved certain key words. There is NO ambiguity now.
Some of the reserved key words
ALL,ALTER,ARRAY,AS,AUTHORIZATION,BETWEEN,BIGINT,BINARY,BOOL
EAN,BOTH,BY,CONSTRAINT,CREATE,CUBE,CURRENT_DATE,CURRENT
_TIMESTAMP,CURSOR,DATE,DECIMAL,DELETE,DESCRIBE,DOUBLE,DR
OP,EXISTS,EXTERNAL,FALSE,FETCH,FLOAT,FOR,FOREIGN,FULL,GRAN
T,GROUP,GROUPING,IMPORT,IN,INNER,INSERT,INT,INTERSECT,INTO,IS,
LATERAL,LEFT,LIKE,LOCAL,NONE,NULL,OF,ORDER,OUT,OUTER,PARTITI
ON,PERCENT,PRECISION,PRIMARY,PROCEDURE,RANGE,READS,REFE
RENCES,REGEXP,REVOKE,RIGHT,RLIKE,ROLLUP,ROW,ROWS,SET,SMA
LLINT,TABLE,TIMESTAMP,TO,TRIGGER,TRUE,TRUNCATE,UNION,UPDAT
E,USER,USING,VALUES,WITH
Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
• SQL2011 keywords and reserved keywords
• Primary key and foreign key
• Set operations
• Summary
Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Primary Key and Foreign Key
Primary Key
Foreign Key
Primary Key
Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Motivation to have Primary Key(PK) / Foreign Key (FK)
• General purpose in a database
• Data cleanliness
• Query optimization
• Current stats in Hive
• We infer PK by column and table info
– Number of distinct value (NDV) >= #rows
– Range (i.e., max – min + 1) = #rows
• We infer FK by Range
– isWithin(fkRange, pkRange)
• Improvement
• Directly define and retrieve the info.
• Mainly use it for query optimization: Cardinality estimation of join
=> Join order => Query execution time
necessary but not
sufficient condition
Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Cardinality estimation of join A, B (general case)
• join cardinality = cardA * cardB * join selectivity
• join selectivity = 1 / max (ndvA_join_col, ndvB_join_col)
• The underlying assumption for this formula is the
“principle of inclusion”: each value of the smaller domain
has a match in the larger domain.
TableScan A
TableScan B
Filter
Filter
Join
PK: 0,5,8
FK: 0,1,2,2,4,4,6,8
est. 3*8/6=4 rows
real. 2 rows
0,0
8,8
PK: 0-9
FK: 0-9
Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Cardinality estimation of join A, B (with PK-FK info.)
• join cardinality = cardB * join selectivity
• join selectivity = PKsel * FKscale
TableScan A
TableScan B
Filter
Filter
Join
PK: 0,5,8
FK: 0,1,2,2,4,4,6,8
est. 8*0.3=2.4 rows
real. 2 rows
0,0
8,8
PK: 0-9
FK: 0-9
Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Notes about Primary Key and Foreign Key Usage
CREATE TABLE vendor (vendor_id INT, PRIMARY KEY
(vendor_id) disable novalidate);
CREATE TABLE product(product_id INT,
product_vendor_id INT,
PRIMARY KEY (product_id) DISABLE NOVALIDATE RELY,
CONSTRAINT product_fk_1 FOREIGN KEY
(product_vendor_id)
REFERENCES vendor(vendor_id) DISABLE NOVALIDATE
RELY);
Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
• SQL2011 keywords and reserved keywords
• Primary key and foreign key
• Set operations
• Summary
Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Motivation for intersect except set operations
• They are two important set operations and part of the
SQL compliance. Lots of analytic queries use them
• For example, they are heavily used in TPCDS benchmark
• It is complicated to implement them
• New operator involves huge change, better rewrite
• More difficult than Union (distinct) rewrite
– Need to consider the performance (optimization rules should apply)
– Need to consider the scalability (intersect has multiple branches)
– Easy to make mistakes (e.g. dealing with NULLs)
Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Set Operations – UNION (ALL)
In SQL the UNION clause combines the
results of two SQL queries into a single
table of all matching rows. Any duplicate
records are automatically removed
unless UNION ALL is used.
ta tb
1,2,2,3,3,NULL,NULL
1,1,2,4,NULL,NULL,NULL
ta UNION ALL tb = {1,2,2,3,3,NULL,NULL,
1,1,2,4,NULL,NULL,NULL}
ta UNION (DISTINCT) tb = {1,2,3,NULL,4}
Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Set Operations – UNION (DISTINCT) - Design
• Rewrite: Union All - GB (on All attributes)
Example: R1 Union Distinct R2
R3: R1 Union All R2
return GB(R3.x1, R3.x2… R3.xn)
Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Set Operations – UNION (DISTINCT) - Explain
explain
select * from ta
union
select * from tb;
Reducer 3
File Output Operator [FS_11]
Group By Operator [GBY_9] (rows=7 width=1)
Output:["_col0"],keys:KEY._col0
<-Union 2 [SIMPLE_EDGE]
<-Map 1 [CONTAINS]
Reduce Output Operator [RS_8]
PartitionCols:_col0
Group By Operator [GBY_7] (rows=14 width=1)
Output:["_col0"],keys:_col0
Select Operator [SEL_1] (rows=7 width=1)
Output:["_col0"]
TableScan [TS_0] (rows=7 width=1)
Output:["col"]
<-Map 4 [CONTAINS]
Reduce Output Operator [RS_8]
PartitionCols:_col0
Group By Operator [GBY_7] (rows=14 width=1)
Output:["_col0"],keys:_col0
Select Operator [SEL_3] (rows=7 width=1)
Output:["_col0"]
TableScan [TS_2] (rows=7 width=1)
Output:["col"]
Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Set Operations – INTERSECT (ALL)
The SQL INTERSECT operator takes the
results of two queries and returns only
rows that appear in both result sets. The
INTERSECT operator removes duplicate
rows from the final result set
ta tb
2,3,3 1,4, NULL
ta INTERSECT ALL tb = {1,2,NULL,NULL}
ta INTERSECT (DISTINCT) tb = {1,2,NULL}
1,2,
NULL,
NULL
Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Set Operations – UDTF (replicate_rows)
• UDTF: user defined table function
• Difference among UDF, UDAF and UDTF?
• UDF (1 to 1), e.g., abs()
• UDAF (many to 1), e.g., sum()
• UDTF (1 to many), e.g., replicate_rows()
• UDTF replicate_rows will duplicate the row based on the value of
the 1st column.
2,“first row”,3.14
-1,“2nd row”,5.34
3,“final row”,19112.0
replicate_rows
2,“first row”,3.14
2,“first row”,3.14
3,“final row”,19112.0
3,“final row”,19112.0
3,“final row”,19112.0
Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Set Operations – INTERSECT - Design
• Rewrite: (GB-Union All-GB)-GB-UDTF (on all attributes)
Example: R1 Intersect All R2
R3 = GB(R1 on all attributes + count() as c) union all GB(R2 on
all attributes + count() as c)
R4 = GB(R3 on all attributes + count(c) as cnt + min(c) as m)
R5 = Fil ( cnt == #branch )
if INTERSECT ALL
R6 = UDTF (R5) which will explode the tuples based on min(c).
else
R6 = Proj(R5 on all attributes)
Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Set Operations – INTERSECT (ALL) - DataFlow
R1.col
1
2
2
3
3
NULL
NULL
R2.col
1
1
2
4
NULL
NULL
NULL
R1.co
l
count
() as c
1
2
3
NULL
1
2
2
2
R2.co
l
count
() as c
1
2
4
NULL
2
1
1
3
col c
1
2
3
NULL
1
2
4
NULL
1
2
2
2
2
1
1
3
col count
(c)
min
(c)
1
2
3
NULL
4
2
2
1
2
1
1
1
2
2
1
col count
(c)
min
(c)
1
2
NULL
NULL
2
2
2
2
1
1
2
2
GroupBy
GroupBy
Union all
GroupBy
Filter
count(c)=#branches
UDTF based on min
Project
Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Set Operations – INTERSECT (DISTINCT) - DataFlow
R1.col
1
2
2
3
3
NULL
NULL
R2.col
1
1
2
4
NULL
NULL
NULL
R1.co
l
count
() as c
1
2
3
NULL
1
2
2
2
R2.co
l
count
() as c
1
2
4
NULL
2
1
1
3
col c
1
2
3
NULL
1
2
4
NULL
1
2
2
2
2
1
1
3
col count
(c)
1
2
3
NULL
4
2
2
1
2
1
col
1
2
NULL
GroupBy
GroupBy
Union all
GroupBy
Filter
count(c)=#branches
Project
Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Set Operations – INTERSECT ALL - Explain
explain
select * from ta
intersect all
select * from tb;
Reducer 4
File Output Operator [FS_25]
Select Operator [SEL_24] (rows=1 width=1)
Output:["_col0"]
UDTF Operator [UDTF_23] (rows=1 width=1)
function name:UDTFReplicateRows
Select Operator [SEL_21] (rows=1 width=1)
Output:["_col0","_col1"]
Filter Operator [FIL_20] (rows=1 width=1)
predicate:(_col2 = 2)
Group By Operator [GBY_19] (rows=3 width=1)
Output:["_col0","_col1","_col2"],aggregations:["min(VALUE._col0)","count(VALUE._col1)"],keys:KEY._col0
<-Union 3 [SIMPLE_EDGE]
<-Reducer 2 [CONTAINS]
Reduce Output Operator [RS_18]
PartitionCols:_col0
Group By Operator [GBY_17] (rows=6 width=1)
Output:["_col0","_col1","_col2"],aggregations:["min(_col1)","count(_col1)"],keys:_col0
Group By Operator [GBY_5] (rows=3 width=1)
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_4]
PartitionCols:_col0
Group By Operator [GBY_3] (rows=7 width=1)
Output:["_col0","_col1"],aggregations:["count(1)"],keys:_col0
Select Operator [SEL_1] (rows=7 width=1)
Output:["_col0"]
TableScan [TS_0] (rows=7 width=1)
default@ta,ta,Tbl:COMPLETE,Col:NONE,Output:["col"]
<-Reducer 6 [CONTAINS]
Reduce Output Operator [RS_18]
PartitionCols:_col0
Group By Operator [GBY_17] (rows=6 width=1)
Output:["_col0","_col1","_col2"],aggregations:["min(_col1)","count(_col1)"],keys:_col0
Group By Operator [GBY_12] (rows=3 width=1)
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0
<-Map 5 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=7 width=1)
Output:["_col0","_col1"],aggregations:["count(1)"],keys:_col0
Select Operator [SEL_8] (rows=7 width=1)
Output:["_col0"]
TableScan [TS_7] (rows=7 width=1)
default@tb,tb,Tbl:COMPLETE,Col:NONE,Output:["col"]
Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Set Operations – INTERSECT DISTINCT - Explain
explain
select * from ta
intersect
select * from tb;
Reducer 4
File Output Operator [FS_22]
Select Operator [SEL_21] (rows=1 width=1)
Output:["_col0"]
Filter Operator [FIL_20] (rows=1 width=1)
predicate:(_col1 = 2)
Group By Operator [GBY_19] (rows=3 width=1)
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0
<-Union 3 [SIMPLE_EDGE]
<-Reducer 2 [CONTAINS]
Reduce Output Operator [RS_18]
PartitionCols:_col0
Group By Operator [GBY_17] (rows=6 width=1)
Output:["_col0","_col1"],aggregations:["count(_col1)"],keys:_col0
Group By Operator [GBY_5] (rows=3 width=1)
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_4]
PartitionCols:_col0
Group By Operator [GBY_3] (rows=7 width=1)
Output:["_col0","_col1"],aggregations:["count(1)"],keys:_col0
Select Operator [SEL_1] (rows=7 width=1)
Output:["_col0"]
TableScan [TS_0] (rows=7 width=1)
default@ta,ta,Tbl:COMPLETE,Col:NONE,Output:["col"]
<-Reducer 6 [CONTAINS]
Reduce Output Operator [RS_18]
PartitionCols:_col0
Group By Operator [GBY_17] (rows=6 width=1)
Output:["_col0","_col1"],aggregations:["count(_col1)"],keys:_col0
Group By Operator [GBY_12] (rows=3 width=1)
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0
<-Map 5 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=7 width=1)
Output:["_col0","_col1"],aggregations:["count(1)"],keys:_col0
Select Operator [SEL_8] (rows=7 width=1)
Output:["_col0"]
TableScan [TS_7] (rows=7 width=1)
default@tb,tb,Tbl:COMPLETE,Col:NONE,Output:["col"]
Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Set Operations – EXCEPT (ALL)
The SQL EXCEPT operator takes the
distinct rows of one query and returns
the rows that do not appear in a second
result set. The EXCEPT ALL operator
does not remove duplicates.
ta tb
2,3,3 1,4, NULL
ta EXCEPT ALL tb = {3,3,2}
ta EXCEPT (DISTINCT) tb = {3}
1,2,
NULL,
NULL
Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Set Operations – EXCEPT (ALL) - Design
• Rewrite: (GB-Union All-GB)-GB-UDTF (on all attributes)
Example: R1 Except All R2
R1 introduce VCol ‘2’, R2 introduce VCol ‘1’
R3 = GB(R1 on all attributes + VCol + count(VCol) as c) union all GB(R2 on all
attributes + VCol + count(VCol) as c)
R4 = GB(R3 on all attributes + sum(c) as a + sum(VCol*c) as b)
we have m+n=a, 2m+n=b where m is the #row in R1 and n is the #row in R2 then
m=b-a, n=2a-b, m-n=2b-3a
if it is except (distinct)
then R5 = Fil (b-a>0 && 2a-b=0) R6 = select only attributes from R5
else R5 = UDTF (R4) which will explode the tuples based on 2b-3a.
Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Set Operations – EXCEPT (ALL) - DataFlow
R1.col Vcol
1
2
2
3
3
NULL
NULL
2
2
2
2
2
2
2
R2.col Vcol
1
1
2
4
NULL
NULL
NULL
1
1
1
1
1
1
1
R1.col Vcol count
(Vcol)
as c
1
2
3
NULL
2
2
2
2
1
2
2
2
R2.col Vcol count
(Vcol)
as c
1
2
4
NULL
1
1
1
1
2
1
1
3
col Vcol c
1
2
3
NULL
1
2
4
NULL
2
2
2
2
1
1
1
1
1
2
2
2
2
1
1
3
col sum(c)
as a
sum
(Vcol*
c) as b
1
2
3
NULL
4
3
3
2
5
1
4
5
4
7
1
col 2b-3a
1
2
3
NULL
4
-1
1
2
-1
-1
GroupBy
GroupBy
Union all GroupBy
Project
Project col 2b-3a
2
3
3
1
2
2
UDTF
Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Set Operations – EXCEPT (ALL) - Explain
explain
select * from ta
except all
select * from tb;
Reducer 4
File Output Operator [FS_24]
Select Operator [SEL_23] (rows=3 width=1)
Output:["_col0"]
UDTF Operator [UDTF_22] (rows=3 width=1)
function name:UDTFReplicateRows
Select Operator [SEL_20] (rows=3 width=1)
Output:["_col0","_col1"]
Group By Operator [GBY_19] (rows=3 width=1)
Output:["_col0","_col1","_col2"],aggregations:["sum(VALUE._col0)","sum(VALUE._col1)"],keys:KEY._col0
<-Union 3 [SIMPLE_EDGE]
<-Reducer 2 [CONTAINS]
Reduce Output Operator [RS_18]
PartitionCols:_col0
Group By Operator [GBY_17] (rows=6 width=1)
Output:["_col0","_col1","_col2"],aggregations:["sum(_col3)","sum(_col2)"],keys:_col0
Select Operator [SEL_15] (rows=6 width=1)
Output:["_col0","_col3","_col2"]
Select Operator [SEL_6] (rows=3 width=1)
Output:["_col0","_col1","_col2"]
Group By Operator [GBY_5] (rows=3 width=1)
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_4]
PartitionCols:_col0
Group By Operator [GBY_3] (rows=7 width=1)
Output:["_col0","_col1"],aggregations:["count(2)"],keys:_col0
Select Operator [SEL_1] (rows=7 width=1)
Output:["_col0"]
TableScan [TS_0] (rows=7 width=1)
default@ta,ta,Tbl:COMPLETE,Col:NONE,Output:["col"]
<-Reducer 6 [CONTAINS]
Reduce Output Operator [RS_18]
PartitionCols:_col0
Group By Operator [GBY_17] (rows=6 width=1)
Output:["_col0","_col1","_col2"],aggregations:["sum(_col3)","sum(_col2)"],keys:_col0
Select Operator [SEL_15] (rows=6 width=1)
Output:["_col0","_col3","_col2"]
Select Operator [SEL_13] (rows=3 width=1)
Output:["_col0","_col1","_col2"]
Group By Operator [GBY_12] (rows=3 width=1)
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0
<-Map 5 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=7 width=1)
Output:["_col0","_col1"],aggregations:["count(1)"],keys:_col0
Select Operator [SEL_8] (rows=7 width=1)
Output:["_col0"]
TableScan [TS_7] (rows=7 width=1)
default@tb,tb,Tbl:COMPLETE,Col:NONE,Output:["col"]
Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Notes about Set Operations
• Support Union (distinct/all), Intersect (distinct/all), Except
(distinct/all)
• Also support precedence using parentheses.
• select * from ta union (select * from tb intersect all select * from tc)
• Design is purely based on query rewriting, does not introduce new
operators
• change in query parser and compiler, no change to optimizer and executor
(MR, Tez, LLAP, Spark, etc)
• Implementation only uses GB, UNION and UDTF
• easy to maintain
• better performance and scalability than join based ones.
Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
• SQL2011 keywords and reserved keywords
• Primary key and foreign key
• Set operations
• Summary
Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Summary for Hive towards full SQL compliance
1.0.0 Feb 4 2015
SQL operators
Views
SQL Standard-based Authorization
Subqueries
Common Table Expressions
Insert/Update/Delete
1.2.0 May 18, 2015
More reserved keywords
Union Distinct
2.1.0 Jun 20, 2016
Primary/Foreign key
2.2.0 TBD
Set Operations
Intersect (Distinct/All)
Except (Distinct/All)
Minus (Distinct/All)
Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Thank you! Questions?

More Related Content

What's hot

What's hot (20)

Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Next Generation Execution Engine for Apache Storm
Next Generation Execution Engine for Apache StormNext Generation Execution Engine for Apache Storm
Next Generation Execution Engine for Apache Storm
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetup
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Schema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeSchema Registry - Set Your Data Free
Schema Registry - Set Your Data Free
 
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Dead
 

Similar to Major advancements in Apache Hive towards full support of SQL compliance

Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoParallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Lucidworks
 
Db2 migration -_tips,_tricks,_and_pitfalls
Db2 migration -_tips,_tricks,_and_pitfallsDb2 migration -_tips,_tricks,_and_pitfalls
Db2 migration -_tips,_tricks,_and_pitfalls
sam2sung2
 

Similar to Major advancements in Apache Hive towards full support of SQL compliance (20)

Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
 
Transactional SQL in Apache Hive
Transactional SQL in Apache HiveTransactional SQL in Apache Hive
Transactional SQL in Apache Hive
 
ClickHouse new features and development roadmap, by Aleksei Milovidov
ClickHouse new features and development roadmap, by Aleksei MilovidovClickHouse new features and development roadmap, by Aleksei Milovidov
ClickHouse new features and development roadmap, by Aleksei Milovidov
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
 
ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in Hive
 
Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
April 2014 HUG : Apache Phoenix
April 2014 HUG : Apache PhoenixApril 2014 HUG : Apache Phoenix
April 2014 HUG : Apache Phoenix
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoParallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
 
Db2 migration -_tips,_tricks,_and_pitfalls
Db2 migration -_tips,_tricks,_and_pitfallsDb2 migration -_tips,_tricks,_and_pitfalls
Db2 migration -_tips,_tricks,_and_pitfalls
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at Alibaba
 
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger InitiativeOverview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
 
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choiTajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
 
Excel Secrets for Search Marketers
Excel Secrets for Search MarketersExcel Secrets for Search Marketers
Excel Secrets for Search Marketers
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
 

More from DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Major advancements in Apache Hive towards full support of SQL compliance

  • 1. Major advancements in Apache Hive towards full support of SQL compliance © Hortonworks Inc. 2011 – 2015. All Rights Reserved Pengcheng Xiong and Ashutosh Chauhan Hortonworks Inc., Apache Hive Community {pxiong,ashutosh}@hortonworks.com
  • 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Motivation • Standard provides Safety and reliability, Interoperability, Business benefits, Consumer choice. • SQL was one of the first commercial languages for relational model. • SQL:2011 standard • Different DB vendors have different SQL implementations • Hive provides the necessary SQL abstraction to integrate SQL-like Queries (HiveQL) into the underlying Java API.
  • 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved What we already have before Hive 1.2.0 • SQL operators • Projection, Selection, Aggregation, Join, Union all, Order by, Windowing, UDF, UDAF, UDTF • Views • CREATE VIEW v AS SELECT * FROM src; • SQL Standard-based Authorization • GRANT SELECT ON TABLE src_autho_test TO USER hive_test_user; • Most of subqueries • SELECT state, net_payments FROM transfer_payments WHERE transfer_payments.year IN (SELECT year FROM us_census); • Common Table Expressions • WITH q1 AS (SELECT key from src where key = '5') SELECT * from q1; • Insert/Update/Delete
  • 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda • SQL2011 keywords and reserved keywords • Primary key and foreign key • Set operations • Summary
  • 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda • SQL2011 keywords and reserved keywords • Primary key and foreign key • Set operations • Summary
  • 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Motivation to reserve key words • Reduce ambiguity • English: Foreigners are hunting dogs – It is unclear whether dogs were being hunted or foreigners are being spoken of as dogs. • HiveQL does not reserve enough key words before 1.2.0. This causes a lot of parser ambiguity problems select key from a right join b on ...; “right” is an identifier as table alias? “right” means right outer join? There are 314 outstanding issues!
  • 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Reserved key words and non reserved ones • Following SQL2011, we made changes to the grammar/parser and reserved certain key words. There is NO ambiguity now. Some of the reserved key words ALL,ALTER,ARRAY,AS,AUTHORIZATION,BETWEEN,BIGINT,BINARY,BOOL EAN,BOTH,BY,CONSTRAINT,CREATE,CUBE,CURRENT_DATE,CURRENT _TIMESTAMP,CURSOR,DATE,DECIMAL,DELETE,DESCRIBE,DOUBLE,DR OP,EXISTS,EXTERNAL,FALSE,FETCH,FLOAT,FOR,FOREIGN,FULL,GRAN T,GROUP,GROUPING,IMPORT,IN,INNER,INSERT,INT,INTERSECT,INTO,IS, LATERAL,LEFT,LIKE,LOCAL,NONE,NULL,OF,ORDER,OUT,OUTER,PARTITI ON,PERCENT,PRECISION,PRIMARY,PROCEDURE,RANGE,READS,REFE RENCES,REGEXP,REVOKE,RIGHT,RLIKE,ROLLUP,ROW,ROWS,SET,SMA LLINT,TABLE,TIMESTAMP,TO,TRIGGER,TRUE,TRUNCATE,UNION,UPDAT E,USER,USING,VALUES,WITH
  • 8. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda • SQL2011 keywords and reserved keywords • Primary key and foreign key • Set operations • Summary
  • 9. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Primary Key and Foreign Key Primary Key Foreign Key Primary Key
  • 10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Motivation to have Primary Key(PK) / Foreign Key (FK) • General purpose in a database • Data cleanliness • Query optimization • Current stats in Hive • We infer PK by column and table info – Number of distinct value (NDV) >= #rows – Range (i.e., max – min + 1) = #rows • We infer FK by Range – isWithin(fkRange, pkRange) • Improvement • Directly define and retrieve the info. • Mainly use it for query optimization: Cardinality estimation of join => Join order => Query execution time necessary but not sufficient condition
  • 11. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Cardinality estimation of join A, B (general case) • join cardinality = cardA * cardB * join selectivity • join selectivity = 1 / max (ndvA_join_col, ndvB_join_col) • The underlying assumption for this formula is the “principle of inclusion”: each value of the smaller domain has a match in the larger domain. TableScan A TableScan B Filter Filter Join PK: 0,5,8 FK: 0,1,2,2,4,4,6,8 est. 3*8/6=4 rows real. 2 rows 0,0 8,8 PK: 0-9 FK: 0-9
  • 12. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Cardinality estimation of join A, B (with PK-FK info.) • join cardinality = cardB * join selectivity • join selectivity = PKsel * FKscale TableScan A TableScan B Filter Filter Join PK: 0,5,8 FK: 0,1,2,2,4,4,6,8 est. 8*0.3=2.4 rows real. 2 rows 0,0 8,8 PK: 0-9 FK: 0-9
  • 13. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Notes about Primary Key and Foreign Key Usage CREATE TABLE vendor (vendor_id INT, PRIMARY KEY (vendor_id) disable novalidate); CREATE TABLE product(product_id INT, product_vendor_id INT, PRIMARY KEY (product_id) DISABLE NOVALIDATE RELY, CONSTRAINT product_fk_1 FOREIGN KEY (product_vendor_id) REFERENCES vendor(vendor_id) DISABLE NOVALIDATE RELY);
  • 14. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda • SQL2011 keywords and reserved keywords • Primary key and foreign key • Set operations • Summary
  • 15. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Motivation for intersect except set operations • They are two important set operations and part of the SQL compliance. Lots of analytic queries use them • For example, they are heavily used in TPCDS benchmark • It is complicated to implement them • New operator involves huge change, better rewrite • More difficult than Union (distinct) rewrite – Need to consider the performance (optimization rules should apply) – Need to consider the scalability (intersect has multiple branches) – Easy to make mistakes (e.g. dealing with NULLs)
  • 16. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Set Operations – UNION (ALL) In SQL the UNION clause combines the results of two SQL queries into a single table of all matching rows. Any duplicate records are automatically removed unless UNION ALL is used. ta tb 1,2,2,3,3,NULL,NULL 1,1,2,4,NULL,NULL,NULL ta UNION ALL tb = {1,2,2,3,3,NULL,NULL, 1,1,2,4,NULL,NULL,NULL} ta UNION (DISTINCT) tb = {1,2,3,NULL,4}
  • 17. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Set Operations – UNION (DISTINCT) - Design • Rewrite: Union All - GB (on All attributes) Example: R1 Union Distinct R2 R3: R1 Union All R2 return GB(R3.x1, R3.x2… R3.xn)
  • 18. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Set Operations – UNION (DISTINCT) - Explain explain select * from ta union select * from tb; Reducer 3 File Output Operator [FS_11] Group By Operator [GBY_9] (rows=7 width=1) Output:["_col0"],keys:KEY._col0 <-Union 2 [SIMPLE_EDGE] <-Map 1 [CONTAINS] Reduce Output Operator [RS_8] PartitionCols:_col0 Group By Operator [GBY_7] (rows=14 width=1) Output:["_col0"],keys:_col0 Select Operator [SEL_1] (rows=7 width=1) Output:["_col0"] TableScan [TS_0] (rows=7 width=1) Output:["col"] <-Map 4 [CONTAINS] Reduce Output Operator [RS_8] PartitionCols:_col0 Group By Operator [GBY_7] (rows=14 width=1) Output:["_col0"],keys:_col0 Select Operator [SEL_3] (rows=7 width=1) Output:["_col0"] TableScan [TS_2] (rows=7 width=1) Output:["col"]
  • 19. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Set Operations – INTERSECT (ALL) The SQL INTERSECT operator takes the results of two queries and returns only rows that appear in both result sets. The INTERSECT operator removes duplicate rows from the final result set ta tb 2,3,3 1,4, NULL ta INTERSECT ALL tb = {1,2,NULL,NULL} ta INTERSECT (DISTINCT) tb = {1,2,NULL} 1,2, NULL, NULL
  • 20. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Set Operations – UDTF (replicate_rows) • UDTF: user defined table function • Difference among UDF, UDAF and UDTF? • UDF (1 to 1), e.g., abs() • UDAF (many to 1), e.g., sum() • UDTF (1 to many), e.g., replicate_rows() • UDTF replicate_rows will duplicate the row based on the value of the 1st column. 2,“first row”,3.14 -1,“2nd row”,5.34 3,“final row”,19112.0 replicate_rows 2,“first row”,3.14 2,“first row”,3.14 3,“final row”,19112.0 3,“final row”,19112.0 3,“final row”,19112.0
  • 21. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Set Operations – INTERSECT - Design • Rewrite: (GB-Union All-GB)-GB-UDTF (on all attributes) Example: R1 Intersect All R2 R3 = GB(R1 on all attributes + count() as c) union all GB(R2 on all attributes + count() as c) R4 = GB(R3 on all attributes + count(c) as cnt + min(c) as m) R5 = Fil ( cnt == #branch ) if INTERSECT ALL R6 = UDTF (R5) which will explode the tuples based on min(c). else R6 = Proj(R5 on all attributes)
  • 22. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Set Operations – INTERSECT (ALL) - DataFlow R1.col 1 2 2 3 3 NULL NULL R2.col 1 1 2 4 NULL NULL NULL R1.co l count () as c 1 2 3 NULL 1 2 2 2 R2.co l count () as c 1 2 4 NULL 2 1 1 3 col c 1 2 3 NULL 1 2 4 NULL 1 2 2 2 2 1 1 3 col count (c) min (c) 1 2 3 NULL 4 2 2 1 2 1 1 1 2 2 1 col count (c) min (c) 1 2 NULL NULL 2 2 2 2 1 1 2 2 GroupBy GroupBy Union all GroupBy Filter count(c)=#branches UDTF based on min Project
  • 23. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Set Operations – INTERSECT (DISTINCT) - DataFlow R1.col 1 2 2 3 3 NULL NULL R2.col 1 1 2 4 NULL NULL NULL R1.co l count () as c 1 2 3 NULL 1 2 2 2 R2.co l count () as c 1 2 4 NULL 2 1 1 3 col c 1 2 3 NULL 1 2 4 NULL 1 2 2 2 2 1 1 3 col count (c) 1 2 3 NULL 4 2 2 1 2 1 col 1 2 NULL GroupBy GroupBy Union all GroupBy Filter count(c)=#branches Project
  • 24. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Set Operations – INTERSECT ALL - Explain explain select * from ta intersect all select * from tb; Reducer 4 File Output Operator [FS_25] Select Operator [SEL_24] (rows=1 width=1) Output:["_col0"] UDTF Operator [UDTF_23] (rows=1 width=1) function name:UDTFReplicateRows Select Operator [SEL_21] (rows=1 width=1) Output:["_col0","_col1"] Filter Operator [FIL_20] (rows=1 width=1) predicate:(_col2 = 2) Group By Operator [GBY_19] (rows=3 width=1) Output:["_col0","_col1","_col2"],aggregations:["min(VALUE._col0)","count(VALUE._col1)"],keys:KEY._col0 <-Union 3 [SIMPLE_EDGE] <-Reducer 2 [CONTAINS] Reduce Output Operator [RS_18] PartitionCols:_col0 Group By Operator [GBY_17] (rows=6 width=1) Output:["_col0","_col1","_col2"],aggregations:["min(_col1)","count(_col1)"],keys:_col0 Group By Operator [GBY_5] (rows=3 width=1) Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 <-Map 1 [SIMPLE_EDGE] SHUFFLE [RS_4] PartitionCols:_col0 Group By Operator [GBY_3] (rows=7 width=1) Output:["_col0","_col1"],aggregations:["count(1)"],keys:_col0 Select Operator [SEL_1] (rows=7 width=1) Output:["_col0"] TableScan [TS_0] (rows=7 width=1) default@ta,ta,Tbl:COMPLETE,Col:NONE,Output:["col"] <-Reducer 6 [CONTAINS] Reduce Output Operator [RS_18] PartitionCols:_col0 Group By Operator [GBY_17] (rows=6 width=1) Output:["_col0","_col1","_col2"],aggregations:["min(_col1)","count(_col1)"],keys:_col0 Group By Operator [GBY_12] (rows=3 width=1) Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 <-Map 5 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=7 width=1) Output:["_col0","_col1"],aggregations:["count(1)"],keys:_col0 Select Operator [SEL_8] (rows=7 width=1) Output:["_col0"] TableScan [TS_7] (rows=7 width=1) default@tb,tb,Tbl:COMPLETE,Col:NONE,Output:["col"]
  • 25. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Set Operations – INTERSECT DISTINCT - Explain explain select * from ta intersect select * from tb; Reducer 4 File Output Operator [FS_22] Select Operator [SEL_21] (rows=1 width=1) Output:["_col0"] Filter Operator [FIL_20] (rows=1 width=1) predicate:(_col1 = 2) Group By Operator [GBY_19] (rows=3 width=1) Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 <-Union 3 [SIMPLE_EDGE] <-Reducer 2 [CONTAINS] Reduce Output Operator [RS_18] PartitionCols:_col0 Group By Operator [GBY_17] (rows=6 width=1) Output:["_col0","_col1"],aggregations:["count(_col1)"],keys:_col0 Group By Operator [GBY_5] (rows=3 width=1) Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 <-Map 1 [SIMPLE_EDGE] SHUFFLE [RS_4] PartitionCols:_col0 Group By Operator [GBY_3] (rows=7 width=1) Output:["_col0","_col1"],aggregations:["count(1)"],keys:_col0 Select Operator [SEL_1] (rows=7 width=1) Output:["_col0"] TableScan [TS_0] (rows=7 width=1) default@ta,ta,Tbl:COMPLETE,Col:NONE,Output:["col"] <-Reducer 6 [CONTAINS] Reduce Output Operator [RS_18] PartitionCols:_col0 Group By Operator [GBY_17] (rows=6 width=1) Output:["_col0","_col1"],aggregations:["count(_col1)"],keys:_col0 Group By Operator [GBY_12] (rows=3 width=1) Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 <-Map 5 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=7 width=1) Output:["_col0","_col1"],aggregations:["count(1)"],keys:_col0 Select Operator [SEL_8] (rows=7 width=1) Output:["_col0"] TableScan [TS_7] (rows=7 width=1) default@tb,tb,Tbl:COMPLETE,Col:NONE,Output:["col"]
  • 26. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Set Operations – EXCEPT (ALL) The SQL EXCEPT operator takes the distinct rows of one query and returns the rows that do not appear in a second result set. The EXCEPT ALL operator does not remove duplicates. ta tb 2,3,3 1,4, NULL ta EXCEPT ALL tb = {3,3,2} ta EXCEPT (DISTINCT) tb = {3} 1,2, NULL, NULL
  • 27. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Set Operations – EXCEPT (ALL) - Design • Rewrite: (GB-Union All-GB)-GB-UDTF (on all attributes) Example: R1 Except All R2 R1 introduce VCol ‘2’, R2 introduce VCol ‘1’ R3 = GB(R1 on all attributes + VCol + count(VCol) as c) union all GB(R2 on all attributes + VCol + count(VCol) as c) R4 = GB(R3 on all attributes + sum(c) as a + sum(VCol*c) as b) we have m+n=a, 2m+n=b where m is the #row in R1 and n is the #row in R2 then m=b-a, n=2a-b, m-n=2b-3a if it is except (distinct) then R5 = Fil (b-a>0 && 2a-b=0) R6 = select only attributes from R5 else R5 = UDTF (R4) which will explode the tuples based on 2b-3a.
  • 28. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Set Operations – EXCEPT (ALL) - DataFlow R1.col Vcol 1 2 2 3 3 NULL NULL 2 2 2 2 2 2 2 R2.col Vcol 1 1 2 4 NULL NULL NULL 1 1 1 1 1 1 1 R1.col Vcol count (Vcol) as c 1 2 3 NULL 2 2 2 2 1 2 2 2 R2.col Vcol count (Vcol) as c 1 2 4 NULL 1 1 1 1 2 1 1 3 col Vcol c 1 2 3 NULL 1 2 4 NULL 2 2 2 2 1 1 1 1 1 2 2 2 2 1 1 3 col sum(c) as a sum (Vcol* c) as b 1 2 3 NULL 4 3 3 2 5 1 4 5 4 7 1 col 2b-3a 1 2 3 NULL 4 -1 1 2 -1 -1 GroupBy GroupBy Union all GroupBy Project Project col 2b-3a 2 3 3 1 2 2 UDTF
  • 29. Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Set Operations – EXCEPT (ALL) - Explain explain select * from ta except all select * from tb; Reducer 4 File Output Operator [FS_24] Select Operator [SEL_23] (rows=3 width=1) Output:["_col0"] UDTF Operator [UDTF_22] (rows=3 width=1) function name:UDTFReplicateRows Select Operator [SEL_20] (rows=3 width=1) Output:["_col0","_col1"] Group By Operator [GBY_19] (rows=3 width=1) Output:["_col0","_col1","_col2"],aggregations:["sum(VALUE._col0)","sum(VALUE._col1)"],keys:KEY._col0 <-Union 3 [SIMPLE_EDGE] <-Reducer 2 [CONTAINS] Reduce Output Operator [RS_18] PartitionCols:_col0 Group By Operator [GBY_17] (rows=6 width=1) Output:["_col0","_col1","_col2"],aggregations:["sum(_col3)","sum(_col2)"],keys:_col0 Select Operator [SEL_15] (rows=6 width=1) Output:["_col0","_col3","_col2"] Select Operator [SEL_6] (rows=3 width=1) Output:["_col0","_col1","_col2"] Group By Operator [GBY_5] (rows=3 width=1) Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 <-Map 1 [SIMPLE_EDGE] SHUFFLE [RS_4] PartitionCols:_col0 Group By Operator [GBY_3] (rows=7 width=1) Output:["_col0","_col1"],aggregations:["count(2)"],keys:_col0 Select Operator [SEL_1] (rows=7 width=1) Output:["_col0"] TableScan [TS_0] (rows=7 width=1) default@ta,ta,Tbl:COMPLETE,Col:NONE,Output:["col"] <-Reducer 6 [CONTAINS] Reduce Output Operator [RS_18] PartitionCols:_col0 Group By Operator [GBY_17] (rows=6 width=1) Output:["_col0","_col1","_col2"],aggregations:["sum(_col3)","sum(_col2)"],keys:_col0 Select Operator [SEL_15] (rows=6 width=1) Output:["_col0","_col3","_col2"] Select Operator [SEL_13] (rows=3 width=1) Output:["_col0","_col1","_col2"] Group By Operator [GBY_12] (rows=3 width=1) Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 <-Map 5 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=7 width=1) Output:["_col0","_col1"],aggregations:["count(1)"],keys:_col0 Select Operator [SEL_8] (rows=7 width=1) Output:["_col0"] TableScan [TS_7] (rows=7 width=1) default@tb,tb,Tbl:COMPLETE,Col:NONE,Output:["col"]
  • 30. Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Notes about Set Operations • Support Union (distinct/all), Intersect (distinct/all), Except (distinct/all) • Also support precedence using parentheses. • select * from ta union (select * from tb intersect all select * from tc) • Design is purely based on query rewriting, does not introduce new operators • change in query parser and compiler, no change to optimizer and executor (MR, Tez, LLAP, Spark, etc) • Implementation only uses GB, UNION and UDTF • easy to maintain • better performance and scalability than join based ones.
  • 31. Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda • SQL2011 keywords and reserved keywords • Primary key and foreign key • Set operations • Summary
  • 32. Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Summary for Hive towards full SQL compliance 1.0.0 Feb 4 2015 SQL operators Views SQL Standard-based Authorization Subqueries Common Table Expressions Insert/Update/Delete 1.2.0 May 18, 2015 More reserved keywords Union Distinct 2.1.0 Jun 20, 2016 Primary/Foreign key 2.2.0 TBD Set Operations Intersect (Distinct/All) Except (Distinct/All) Minus (Distinct/All)
  • 33. Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank you! Questions?

Editor's Notes

  1. Major advancements in Apache Hive towards full support of SQL compliance Although Hive becomes de facto standard for SQL queries in Hadoop, many potential users hesitate to adopt it due to its lack of full support of SQL compliance. In this talk, we will give overview of the major advancements in Apache Hive towards full support of SQL compliance that we have already achieved in the recent years. More specifically, we will talk about the keywords and quoted identifiers, transactions and insert/update/delete, common table expressions and subqueries, and newly added set operators like union, intersect, except and minus. We will not only show concrete examples of how to use them, but also share the experience and lessons that we have gained during the development process. Finally, we will show the list of advancements that we are planning to make in the near future. Speaker(s): Pengcheng Xiong, Hortonworks, Inc Pengcheng Xiong has extensive research and development experiences in centralized/distributed RDBMS internals, Hive and etc. He has dozens of publications in database conferences, e.g., SIGMOD, VLDB and ICDE. He is now working on the next-generation Hive optimizer and he serves and contributes as an Apache Hive PMC member. He holds a PhD from Georgia Tech. Ashutosh Chauhan, Hortonworks, Inc Ashutosh is working in Hadoop ecosystem for more than seven years spending most of his time on higher level languages on Hadoop. He is a VP of Apache Hive and PMC member of Apache Pig.
  2. If we would like to enjoy the benefits of the standard, we have to conform to the standard as much as we can
  3. That is the work that we have already done towards sql compliance
  4. 3 items in the sequence from easiest one to the most complicated one.
  5. difficulty from easies to the hardest to adjust to different audience.
  6. In a computer language, a reserved word (also known as a reserved identifier) is a word that cannot be used as an identifier. Actually, almost every language has ambiguity if not well-defined HiveQL has similar problem. 他们是两个高中的学生。(即可以理解为两个高中学校,也个以理解为同一学校的两个学生) “我要炒肉丝”。(“炒肉丝”, 可理解为动宾结构“炒肉丝”也可理解为是一菜名,是一个名词。) 咬死了猎人的狗。 (此短语可理解为动宾结构“咬死了猎人的狗”,也可理解为偏正结构“(咬死了猎人)的狗。”)
  7. difficulty from easies to the hardest to adjust to different audience.
  8. A primary key is a special relational database table column (or combination of columns) designated to uniquely identify all table records. A primary key's main features are: It must contain a unique value for each row of data. It cannot contain null values.
  9. Enforcement In order to use a constraint for enforcement, the constraint must be in the ENABLE state. An enabled constraint ensures that all data modifications upon a given table (or tables) satisfy the conditions of the constraints. Data modification operations which produce data that violates the constraint fail with a constraint violation error. Validation To use a constraint for validation, the constraint must be in the VALIDATE state. If the constraint is validated, then all data that currently resides in the table satisfies the constraint. Note that validation is independent of enforcement. Although the typical constraint in an operational system is both enabled and validated, any constraint could be validated but not enabled or vice versa (enabled but not validated). These latter two cases are useful for data warehouses. Belief In some cases, you will know that the conditions for a given constraint are true, so you do not need to validate or enforce the constraint. However, you may wish for the constraint to be present anyway to improve query optimization and performance. When you use a constraint in this way, it is called a belief or RELY constraint, and the constraint must be in the RELY state. The RELY state provides you with a mechanism for telling Oracle that a given constraint is believed to be true. Note that the RELY state only affects constraints that have not been validated.
  10. if (fkRangeDelta > 0 && pkRangeDelta > 0 && fkRangeDelta < pkRangeDelta) { scaledSelectivity = (float) pkRangeDelta / (float) fkRangeDelta; }
  11. difficulty from easies to the hardest to adjust to different audience.
  12. Intersect Distinct Rewrite 1: Inner Join (Null Safe) - GB - Proj Example: R1 Intersect Distinct R2 (R1 Inner Join R2 on R1.x1<=>R2.x1 and R1.x2<=>R2.x2.. ) - Proj(R1.x1, R1.x2… R1.xn)-GB (R1.x1, R1.x2… R1.xn) Rewrite 2: GB-Semi Join (NULL Safe) - Proj Example: R1 Intersect Distinct R2 (GB(R1 on R1.x1,R1.x2..) Semi Join R2 on R1.x1<=>R2.x1 and R1.x2=R2.x2.. ) - Proj(R1.x1, R1.x2… R1.xn)
  13. So the problem is how to get these results?
  14. We thank the audience for your attention.
  15. difficulty from easies to the hardest to adjust to different audience.
  16. I am ready to take any questions.