Hive Data Modeling and Query Optimization

© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive Data Modeling & Query Optimization
Eyad Garelnabi

Agenda
•  File
Formats

•  Hive
Table
Types

•  Hive
Data
Layout

•  What
About
Data
Modeling

•  Hive
Join
Strategies

•  Op?mizing
Queries

File
Formats:

Text,
Parquet,
ORC,
etc…

Text
•  Requires SerDes
–  CSV: comma delimited
–  Additional SerDes online
•  Does not compress well
•  Row based separation
•  Slow to read and write
•  Usually used for initial data load

Parquet
•  Faster access to data
•  Efficient compression
•  Effective for select queries

ORCFile
High Performance: Split-able, columnar storage file
Efficient Reads: Break into large “stripes” of data for
efficient read
Fast Filtering: Built in index, min/max, metadata for
fast filtering blocks - bloom filters if desired
Efficient Compression: Decompose complex row
types into primitives: massive compression and efficient
comparisons for filtering
Precomputation: Built in aggregates per block (min,
max, count, sum, etc.)
Proven at 300 PB scale: Facebook uses ORC for their
300 PB Hive Warehouse

etc…
•  Avro
–  JSON formatted
–  Good for select * queries
–  Slow to read for other queries
•  Sequence
–  Optimized for Java MapReduce jobs
–  Ineficient for Hive
–  Rarely used

High Compression with ORCFile

HIVE
Tables:

External,
Managed,
Views

External Tables
•  Hive manages schema/metadata
•  When dropped, only schema is deleted
CREATE EXTERNAL TABLE my_external_table
(
'id' int,
'name' string,
'department' string,
'country' string,
)
ROW FORMAT DELIMETED FIELDS TERMINATED BY ','
STORED AS orc;

Internal/Managed Tables
•  Hive manages schema and data
•  Data is saved by default in /usr/hive/warehouse/my_managed_table
•  When dropped, both schema and data are deleted
CREATE TABLE my_managed_table
(
'id' int,
'name' string,
'country' string,
)
ROW FORMAT DELIMETED FIELDS TERMINATED BY ',’
SET LOCATION ‘/usr/Scotiabank/demo’
STORED AS parquet;

Views
•  Virtual table
•  No data is stored to HDFS
•  When dropped, only schema is deleted
CREATE VIEW my_view
(
'id' int,
'name' string,
'country' string,
)
AS {select_statement};

HIVE
Data
Layout:

Par??oning,
Bucke?ng
and
Skews

Data Abstractions in Hive
Par??ons,
buckets
and
skews
facilitate

faster,
more
direct
data
access.

Database

Table
Table

Par??on
Par??on
Par??on

Bucket

Bucket

Bucket

Op?onal
Per
Table

Skewed
Keys

Unskewed

Keys

Partitioning
•  Breaks up data horizontally by column value sets
•  When partitioning you will use 1 or more “virtual” columns break up data
•  Virtual columns cause directories to be created in HDFS.
–  Files for that partition are stored within that subdirectory.
•  Partitioning makes queries go fast.
–  Partitioning works particularly well when querying with the “virtual column”
–  If queries use various columns, it may be hard to decide which columns should we
partition by

Partitioning
•  Static Partitioning
–  Partitioning is done on selected column fields
CREATE TABLE static_partioned_table
(
'id' int,
'name' string,
'department' string
)
PARTITIONED BY ('country' string)
STORED AS ORCFile;
INSERT OVERWRITE TABLE static_partioned_table
PARTITION (country='canada')
SELECT id, name, department
FROM my_external_table
WHERE country='canada'

Partitioning
•  Dynamic Partitioning
–  Partitioning is automatically done on all column fields
CREATE TABLE dynamic_partioned_table
(
'id' int,
'name' string,
'department' string
)
PARTITIONED BY ('country' string)
STORED AS ORCFile;
INSERT OVERWRITE TABLE dynamic_partioned_table
PARTITION (country)
SELECT id, name, country
FROM my_external_table;

Partitioning
•  IMPORTANT: dynamic partitioning will not work by default
–  When creating tables, make sure:
–  set hive.exec.dynamic.partition=true
•  Also, set maximum number of partitions to avoid going overboard
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;

Partitioning
•  Multi-layer Partitioning is possible but often not efficient
–  Number of partitions becomes too much and will overwhelm the Metastore
•  Limit the number of partitions. Less may be better
–  1000 partitions will often perform better than 10000
•  Hadoop likes big files
–  avoid creating partitions with mostly small files
•  Only use when
–  Data is very large and there are lots of table scans
–  Data is queried aginst a particular column frequently
–  Column data must have low cardinality

Partitioning
•  Often better to partition by Date not Year/Month
–  By date you will only have 365 partitions at most
–  Partitioning by date will allow you to easily perform queiries that require ‘BETWEEN’and ‘IN’.
( https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html )
SELECT * FROM TableA WHERE DateStamp IN (‘2015-01-01’, ‘2015-02-03’, ‘2016-01-01’)
VS
SELECT * FROM TableB WHERE (YEAR=2015 AND MONTH=01 AND DAY=01) OR (YEAR=2015 AND MONTH=02 AND
DAY=03) OR (YEAR=2016 AND MONTH=01 AND DAY=01)

Bucketing
•  Breaks up data vertically by hashed key sets
•  When bucketing, you specify the number of buckets
•  Works particularly well when a lot of queries contain joins
CREATE TABLE bucketed_table
(
'id' int,
'name' string,
'country' string
)
CLUSTERED BY (id) INTO 12 BUCKETS
STORED AS ORC;

Bucketing
•  IMPORTANT: the bucketing specified at table creation is NOT enforced when
the table is written to…
•  So when writing data, must make sure:
–  Hive.enforce.bucketing = true
SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT INTO TABLE sale (xdate, state)
SELECT * FROM staging_table;

Bucketing
•  Works well when there is very large data volume and most queries are joins
•  Partitioning and bucketing may be combined, of course
–  Be careful not to wind up with very many small files that can overwhelm the
NameNode
–  Ideal file size is 200-500mb
•  Partition and Bucket frequently joined tables in a similar way to improve join
efficiency
CREATE TABLE sale (
id int, amount decimal, ...
) PARTITIONED BY (xdate string, state string)
CLUSTERED BY (id) SORTED BY (id) INTO 256 buckets;

Skewed Tables and List Bucketing
•  When table is skewed with on or more column values taking up most space
•  By specifying the values that appear most often in the keys (in this example
‘key1’ and ‘key2’), HIVE will split those into separate files automatically and
take this into account during queries so that it can skip the whole file if
possible
•  “STORED AS DIRECTORIES” is called “list bucketing”
–  Table is skewed, but also store each part as separate directory
–  1 directory for each skewed key value, 1 directory for all other keys
CREATE TABLE mytable (
key STRING, value STRING, …
) SKEWED BY (key) ON (‘key1’, ‘key2’) STORED AS DIRECTORIES;

Best Practice: When to use Partitioning/Bucketing/Skews
•  Partitioning is useful for chronological columns that don’t have a very high
number of possible values
–  You don’t want to end up with millions of partitions
•  Bucketing is most useful for tables that are “most often” joined together on the
same key
–  For example: joins by a patient-ID or customer-ID
–  Make sure the bucket count matches on both tables involved in the join
•  Skews useful when one or two column values dominate the table
–  Hive can avoid whole files when querying

What
About
Data
Modeling?

Data Modeling in Hadoop
•  No data modeling a-la DW/RDBMS
•  Decisions on data layout happen at the file/folder level
–  This is where partitioning, bucketing and skewing comes in
•  How far should we denormalize?
–  As far as it makes sense
–  Usually denormalize frequently joined tables
–  Be mindful of the memory implications of very wide tables (thousands of columns)

Data Modeling in Hadoop
•  Can we Alter an existing table to add Partitions or Buckets?
–  No
–  Create new partitioned/bucketed table and copy data over
•  Are there limits on number of columns possible in Hive?
–  No “hard” limit from Hive
–  File format memory requirements may limit us though
–  ORC tested with up to 20,000 columns before getting out-of-memory
–  Be mindful of memory implications when designing wide tables

HIVE
Join
strategies:

Choose
the
right
JOIN

Shuffle Joins – the default
Page 31
customer
order

first
last
id
cid
price
quan2ty

Nick
Toner
11911
4150
10.50
3

Jessie
Simonds
11912
11914
12.25
27

Kasi
Lamers
11913
3491
5.99
5

Rodger
Clayton
11914
2934
39.99
22

Verona
Hollen
11915
11914
40.50
10

SELECT
*
FROM
customer
join
order
ON
customer.id
=
order.cid;

M
{
id:
11911,
{
first:
Nick,
last:
Toner
}}

{
id:
11914,
{
first:
Rodger,
last:
Clayton
}}

…

M
{
cid:
4150,
{
price:
10.50,
quan?ty:
3
}}

{
cid:
11914,
{
price:
12.25,
quan?ty:
27
}}

…

R {
id:
11914,
{
first:
Rodger,
last:
Clayton
}}

{
cid:
11914,
{
price:
12.25,
quan?ty:
27
}}

R
{
id:
11911,
{
first:
Nick,
last:
Toner
}}

{
cid:
4150,
{
price:
10.50,
quan?ty:
3
}}

…

Iden?cal
keys
shuffled
to
the
same
reducer.
Join
done
reduce-‐side.

Expensive
from
a
network
u?liza?on
standpoint.

Broadcast Join (aka Map-side Join)
•  Star schemas (e.g. dimension tables)
•  Good when table is small enough to fit in RAM
Page 32

Using Broadcast Join
•  Set hive.auto.convert.join = true
•  HIVE then automatically uses broadcast join, if possible
–  Small tables held in memory by all nodes
•  Used for star-schema type joins common in Data warehousing use-cases
•  hive.auto.convert.join.noconditionaltask.size determines data size for
automatic conversion to broadcast join:
–  Default 10MB is too low (check your default)
–  Recommended: 256MB for 4GB container
Page 33

Sort-Merge-Bucket join:
When both are too large for memory
Page 34
customer
order

ﬁrst
last
id
cid
price
quan2ty

Nick
Toner
11911
4150
10.50
3

Jessie
Simonds
11912
11914
12.25
27

Kasi
Lamers
11913
11914
40.50
10

Rodger
Clayton
11914
12337
39.99
22

Verona
Hollen
11915
15912
40.50
10

SELECT
*
FROM
customer
join
order
ON
customer.id
=
order.cid;

CREATE
TABLE
customer
(id
int,
first
string,
last
string)

CLUSTERED
BY(id)
SORTED
BY(id)
INTO
32
BUCKETS;

CREATE
TABLE
order
(cid
int,
price
float,
quantity
int)

CLUSTERED
BY(cid)
SORTED
BY(cid)
INTO
32
BUCKETS;

Hive Join Strategies
Page 35
Type
Approach
Pros
Cons

Shuffle
Join

Join
keys
are
shuffled
using
map/
reduce
and
joins
performed
reduce

side.

Works
regardless
of
data

size
or
layout.

Most
resource-‐intensive

and
slowest
join
type.

Broadcast

Join

Small
tables
are
loaded
into

memory
in
all
nodes,
mapper
scans

through
the
large
table
and
joins.

Very
fast,
single
scan

through
largest
table.

All
but
one
table
must
be

small
enough
to
fit
in

RAM.

Sort-‐Merge-‐
Bucket
Join

Mappers
take
advantage
of
co-‐
loca?on
of
keys
to
do
efficient
joins.

Very
fast
for
tables
of
any

size.

Data
must
be
sorted
and

bucketed
ahead
of
?me.

All join types are now more efficient with Tez

More Join Strategies
•  Take a look at this blog posting for an explanation of joins:
http://henning.kropponline.de/2016/10/09/hive-join-strategies/
•  A search on Google will return more join strategies than what has
been covered here
•  Keep in mind that most benchmarks were done using Map Reduce
processing rather than Tez. Your performance should be better due to
the in-memory processing nature of Tez.
Page 36

Wri?ng
fast
queries:

Techniques
to
op?mize
your
queries

Optimizing HIVE queries
1.  Use
Tez

2.  Use
ORCFile

3.  Use
Vectoriza?on

4.  Use
Cost
Based
Op?miza?on
(CBO)

5.  Write
good
SQL

6.  Use
Hive
Explain

7.  Consider
Hive
LLAP

Technique
#1:
TEZ
vs
MR

Understanding Tez vs MapReduce

Technique
#2:
use
ORCFile

ORCFile – Efficient Columnar Format
High Performance: Split-able, columnar storage file
Efficient Reads: Break into large “stripes” of data for
efficient read
Fast Filtering: Built in index, min/max, metadata for
fast filtering blocks - bloom filters if desired
Efficient Compression: Decompose complex row
types into primitives: massive compression and efficient
comparisons for filtering
Precomputation: Built in aggregates per block (min,
max, count, sum, etc.)
Proven at 300 PB scale: Facebook uses ORC for their
300 PB Hive Warehouse

Technique
#3:

Use
Vectoriza?on

Using Vectorization
•  Vectorized query execution is a Hive feature that greatly reduces the CPU
usage for typical query operations like scans, filters, aggregates, and joins
•  Vectorized query execution streamlines operations by processing a block of
1024 rows at a time (instead of 1 row at a time)
•  ONLY works with ORCFiles
Page 44
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled=true;

Technique
#4:

Use
Cost-‐based
Op?miza?on

Hive Cost-Based Optimization (CBO)
•  Cost-‐Based
Op-miza-on
(CBO)
engine
uses
sta?s?cs

within
Hive
tables
to
produce
op?mal
query
plans

•  Two
types
of
stats
used
for
op?miza?on:

o  Table
stats

o  Column
stats

•  Uses
an
open-‐source
framework
called
Calcite

(formerly
Op,q)

Step 1: ensure HIVE has table statistics
Hive.stats.autogather=true;
•  Stats
are
collected
at
the
table
level
automa?cally
when:

•  If
you
have
an
exis?ng
table
without
stats
collected:

•  For
column-‐level
sta?s?cs:

–  HDP
2.1

–  HDP
2.2

ANALYZE TABLE table-name COMPUTE STATISTICS;
ANALYZE TABLE table-name COMPUTE STATISTICS for COLUMNS col1, col2;
ANALYZE TABLE table-name COMPUTE STATISTICS for COLUMNS;

CBO with Partitioned Tables
•  When
table
is
par??oned,
you
need
to
specify
the

par??on
when
collec?ng
sta?s?cs:

ANALYZE TABLE table-name partition (col1=‘x’) COMPUTE STATISTICS;
ANALYZE TABLE table-name partition(col1=‘x’) COMPUTE STATISTICS for COLUMNS;

Step 2: set HIVE properties to enable CBO
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats = true;
And
now
every
query
you
run
will
use
CBO…

Technique
#5:

Write
Smart
SQL

Query design matters
•  This is Big Data we’re talking about
•  So consider performance in every query you write
•  There are many ways to write SQL with the same functional results,
but often varying performance characteristics
•  Avoid Joins when possible and choose the right Join when not
Page 51

Technique
#6:

Use
Hive
Explain

HIVE EXPLAIN – understanding your query plan
Page 53
•  It is an advanced tool to debug what HIVE is doing.
•  Look at the sequence of operations and make sure it looks reasonable
•  Validate join type (e.g. we’ve asked for a map-side join, did it get executed that way?)
At the end of the day, if the plan is bad, everything else (ORC, Vectorization, etc) may not
matter.
Take a look at the below link on how to understand and analyze your query plan:
https://www.slideshare.net/HadoopSummit/how-to-understand-and-analyze-apache-hive-
query-execution-plan-for-performance-debugging
EXPLAIN {Hive Query}

Technique
#7:

Consider
Hive
LLAP

LLAP Key Benefits
Ã  Uses persistent query servers to avoid long startup times and deliver fast SQL.
Ã  Enables as fast as sub-second query in Hive by keeping all data and servers running
and in-memory all the time.
Ã  Shares its in-memory cache among all SQL users, maximizing the use of this scarce
resource.
Ã  Has fine-grained resource management and preemption, making it great for concurrent
access across many users.
Ã  Great for cloud because it caches data in memory and keeps it compressed,
overcoming long cloud storage access times and stretching the amount of data you can
fit in RAM.

Thank You

Hive Data Modeling and Query Optimization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hive Data Modeling and Query Optimization

Similar to Hive Data Modeling and Query Optimization (20)

Recently uploaded

Recently uploaded (20)

Hive Data Modeling and Query Optimization