SlideShare a Scribd company logo
1 of 56
Download to read offline
Page 1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive Data Modeling & Query Optimization
Eyad Garelnabi
Page 2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Agenda
•  File	
  Formats	
  
•  Hive	
  Table	
  Types	
  
•  Hive	
  Data	
  Layout	
  
•  What	
  About	
  Data	
  Modeling	
  
•  Hive	
  Join	
  Strategies	
  
•  Op?mizing	
  Queries	
  
	
  
Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
File	
  Formats:	
  	
  
Text,	
  Parquet,	
  ORC,	
  etc…	
  
Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Text
•  Requires SerDes
–  CSV: comma delimited
–  Additional SerDes online
•  Does not compress well
•  Row based separation
•  Slow to read and write
•  Usually used for initial data load
Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Parquet
•  Faster access to data
•  Efficient compression
•  Effective for select queries
Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORCFile
High Performance: Split-able, columnar storage file
Efficient Reads: Break into large “stripes” of data for
efficient read
Fast Filtering: Built in index, min/max, metadata for
fast filtering blocks - bloom filters if desired
Efficient Compression: Decompose complex row
types into primitives: massive compression and efficient
comparisons for filtering
Precomputation: Built in aggregates per block (min,
max, count, sum, etc.)
Proven at 300 PB scale: Facebook uses ORC for their
300 PB Hive Warehouse
Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
etc…
•  Avro
–  JSON formatted
–  Good for select * queries
–  Slow to read for other queries
•  Sequence
–  Optimized for Java MapReduce jobs
–  Ineficient for Hive
–  Rarely used
Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
High Compression with ORCFile
Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HIVE	
  Tables:	
  	
  
External,	
  Managed,	
  Views	
  
Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
External Tables
•  Hive manages schema/metadata
•  When dropped, only schema is deleted
CREATE EXTERNAL TABLE my_external_table
(
'id' int,
'name' string,
'department' string,
'country' string,
)
ROW FORMAT DELIMETED FIELDS TERMINATED BY ','
STORED AS orc;
Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Internal/Managed Tables
•  Hive manages schema and data
•  Data is saved by default in /usr/hive/warehouse/my_managed_table
•  When dropped, both schema and data are deleted
CREATE TABLE my_managed_table
(
'id' int,
'name' string,
'department' string,
'country' string,
)
ROW FORMAT DELIMETED FIELDS TERMINATED BY ',’
SET LOCATION ‘/usr/Scotiabank/demo’
STORED AS parquet;
Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Views
•  Virtual table
•  No data is stored to HDFS
•  When dropped, only schema is deleted
CREATE VIEW my_view
(
'id' int,
'name' string,
'department' string,
'country' string,
)
AS {select_statement};
Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HIVE	
  Data	
  Layout:	
  	
  
Par??oning,	
  Bucke?ng	
  and	
  Skews	
  
Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Abstractions in Hive
Par??ons,	
  buckets	
  and	
  skews	
  facilitate	
  
faster,	
  more	
  direct	
  data	
  access.	
  
Database	
  
Table	
   Table	
  
Par??on	
   Par??on	
   Par??on	
  
Bucket	
  
Bucket	
  
Bucket	
  
Op?onal	
  Per	
  Table	
  
Skewed	
  Keys	
  
Unskewed	
  
Keys	
  
Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Partitioning
•  Breaks up data horizontally by column value sets
•  When partitioning you will use 1 or more “virtual” columns break up data
•  Virtual columns cause directories to be created in HDFS.
–  Files for that partition are stored within that subdirectory.
•  Partitioning makes queries go fast.
–  Partitioning works particularly well when querying with the “virtual column”
–  If queries use various columns, it may be hard to decide which columns should we
partition by
Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Partitioning
•  Static Partitioning
–  Partitioning is done on selected column fields
CREATE TABLE static_partioned_table
(
'id' int,
'name' string,
'department' string
)
PARTITIONED BY ('country' string)
ROW FORMAT DELIMETED FIELDS TERMINATED BY ','
STORED AS ORCFile;
INSERT OVERWRITE TABLE static_partioned_table
PARTITION (country='canada')
SELECT id, name, department
FROM my_external_table
WHERE country='canada'
Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Partitioning
•  Dynamic Partitioning
–  Partitioning is automatically done on all column fields
CREATE TABLE dynamic_partioned_table
(
'id' int,
'name' string,
'department' string
)
PARTITIONED BY ('country' string)
ROW FORMAT DELIMETED FIELDS TERMINATED BY ','
STORED AS ORCFile;
INSERT OVERWRITE TABLE dynamic_partioned_table
PARTITION (country)
SELECT id, name, country
FROM my_external_table;
Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Partitioning
•  IMPORTANT: dynamic partitioning will not work by default
–  When creating tables, make sure:
–  set hive.exec.dynamic.partition=true
•  Also, set maximum number of partitions to avoid going overboard
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;
Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Partitioning
•  Multi-layer Partitioning is possible but often not efficient
–  Number of partitions becomes too much and will overwhelm the Metastore
•  Limit the number of partitions. Less may be better
–  1000 partitions will often perform better than 10000
•  Hadoop likes big files
–  avoid creating partitions with mostly small files
•  Only use when
–  Data is very large and there are lots of table scans
–  Data is queried aginst a particular column frequently
–  Column data must have low cardinality
Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Partitioning
•  Often better to partition by Date not Year/Month
–  By date you will only have 365 partitions at most
–  Partitioning by date will allow you to easily perform queiries that require ‘BETWEEN’and ‘IN’.
( https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html )
SELECT * FROM TableA WHERE DateStamp IN (‘2015-01-01’, ‘2015-02-03’, ‘2016-01-01’)
VS
SELECT * FROM TableB WHERE (YEAR=2015 AND MONTH=01 AND DAY=01) OR (YEAR=2015 AND MONTH=02 AND
DAY=03) OR (YEAR=2016 AND MONTH=01 AND DAY=01)
Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Bucketing
•  Breaks up data vertically by hashed key sets
•  When bucketing, you specify the number of buckets
•  Works particularly well when a lot of queries contain joins
CREATE TABLE bucketed_table
(
'id' int,
'name' string,
'department' string,
'country' string
)
CLUSTERED BY (id) INTO 12 BUCKETS
ROW FORMAT DELIMETED FIELDS TERMINATED BY ','
STORED AS ORC;
Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Bucketing
•  IMPORTANT: the bucketing specified at table creation is NOT enforced when
the table is written to…
•  So when writing data, must make sure:
–  Hive.enforce.bucketing = true
SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT INTO TABLE sale (xdate, state)
SELECT * FROM staging_table;
Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Bucketing
•  Works well when there is very large data volume and most queries are joins
•  Partitioning and bucketing may be combined, of course
–  Be careful not to wind up with very many small files that can overwhelm the
NameNode
–  Ideal file size is 200-500mb
•  Partition and Bucket frequently joined tables in a similar way to improve join
efficiency
CREATE TABLE sale (
id int, amount decimal, ...
) PARTITIONED BY (xdate string, state string)
CLUSTERED BY (id) SORTED BY (id) INTO 256 buckets;
	
  
Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Skewed Tables and List Bucketing
•  When table is skewed with on or more column values taking up most space
•  By specifying the values that appear most often in the keys (in this example
‘key1’ and ‘key2’), HIVE will split those into separate files automatically and
take this into account during queries so that it can skip the whole file if
possible
•  “STORED AS DIRECTORIES” is called “list bucketing”
–  Table is skewed, but also store each part as separate directory
–  1 directory for each skewed key value, 1 directory for all other keys
CREATE TABLE mytable (
key STRING, value STRING, …
) SKEWED BY (key) ON (‘key1’, ‘key2’) STORED AS DIRECTORIES;
	
  
Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Abstractions in Hive
Par??ons,	
  buckets	
  and	
  skews	
  facilitate	
  
faster,	
  more	
  direct	
  data	
  access.	
  
Database	
  
Table	
   Table	
  
Par??on	
   Par??on	
   Par??on	
  
Bucket	
  
Bucket	
  
Bucket	
  
Op?onal	
  Per	
  Table	
  
Skewed	
  Keys	
  
Unskewed	
  
Keys	
  
Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Best Practice: When to use Partitioning/Bucketing/Skews
•  Partitioning is useful for chronological columns that don’t have a very high
number of possible values
–  You don’t want to end up with millions of partitions
•  Bucketing is most useful for tables that are “most often” joined together on the
same key
–  For example: joins by a patient-ID or customer-ID
–  Make sure the bucket count matches on both tables involved in the join
•  Skews useful when one or two column values dominate the table
–  Hive can avoid whole files when querying
Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What	
  About	
  Data	
  Modeling?	
  
Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Modeling in Hadoop
•  No data modeling a-la DW/RDBMS
•  Decisions on data layout happen at the file/folder level
–  This is where partitioning, bucketing and skewing comes in
•  How far should we denormalize?
–  As far as it makes sense
–  Usually denormalize frequently joined tables
–  Be mindful of the memory implications of very wide tables (thousands of columns)
Page 29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Modeling in Hadoop
•  Can we Alter an existing table to add Partitions or Buckets?
–  No
–  Create new partitioned/bucketed table and copy data over
•  Are there limits on number of columns possible in Hive?
–  No “hard” limit from Hive
–  File format memory requirements may limit us though
–  ORC tested with up to 20,000 columns before getting out-of-memory
–  Be mindful of memory implications when designing wide tables
Page 30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HIVE	
  Join	
  strategies:	
  
Choose	
  the	
  right	
  JOIN	
  
Page 31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Shuffle Joins – the default
Page 31
customer	
   order	
  
first	
   last	
   id	
   cid	
   price	
   quan2ty	
  
Nick	
   Toner	
   11911	
   4150	
   10.50	
   3	
  
Jessie	
   Simonds	
   11912	
   11914	
   12.25	
   27	
  
Kasi	
   Lamers	
   11913	
   3491	
   5.99	
   5	
  
Rodger	
   Clayton	
   11914	
   2934	
   39.99	
   22	
  
Verona	
   Hollen	
   11915	
   11914	
   40.50	
   10	
  
SELECT	
  *	
  FROM	
  customer	
  join	
  order	
  ON	
  customer.id	
  =	
  order.cid;	
  
M
{	
  id:	
  11911,	
  {	
  first:	
  Nick,	
  last:	
  Toner	
  }}	
  
{	
  id:	
  11914,	
  {	
  first:	
  Rodger,	
  last:	
  Clayton	
  }}	
  
…	
  
M
{	
  cid:	
  4150,	
  {	
  price:	
  10.50,	
  quan?ty:	
  3	
  }}	
  
{	
  cid:	
  11914,	
  {	
  price:	
  12.25,	
  quan?ty:	
  27	
  }}	
  
…	
  
R {	
  id:	
  11914,	
  {	
  first:	
  Rodger,	
  last:	
  Clayton	
  }}	
  
{	
  cid:	
  11914,	
  {	
  price:	
  12.25,	
  quan?ty:	
  27	
  }}	
  
R
{	
  id:	
  11911,	
  {	
  first:	
  Nick,	
  last:	
  Toner	
  }}	
  
{	
  cid:	
  4150,	
  {	
  price:	
  10.50,	
  quan?ty:	
  3	
  }}	
  
…	
  
Iden?cal	
  keys	
  shuffled	
  to	
  the	
  same	
  reducer.	
  Join	
  done	
  reduce-­‐side.	
  
Expensive	
  from	
  a	
  network	
  u?liza?on	
  standpoint.	
  
Page 32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Broadcast Join (aka Map-side Join)
•  Star schemas (e.g. dimension tables)
•  Good when table is small enough to fit in RAM
Page 32
Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Using Broadcast Join
•  Set hive.auto.convert.join = true
•  HIVE then automatically uses broadcast join, if possible
–  Small tables held in memory by all nodes
•  Used for star-schema type joins common in Data warehousing use-cases
•  hive.auto.convert.join.noconditionaltask.size determines data size for
automatic conversion to broadcast join:
–  Default 10MB is too low (check your default)
–  Recommended: 256MB for 4GB container
Page 33
Page 34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sort-Merge-Bucket join:
When both are too large for memory
Page 34
customer	
   order	
  
first	
   last	
   id	
   cid	
   price	
   quan2ty	
  
Nick	
   Toner	
   11911	
   4150	
   10.50	
   3	
  
Jessie	
   Simonds	
   11912	
   11914	
   12.25	
   27	
  
Kasi	
   Lamers	
   11913	
   11914	
   40.50	
   10	
  
Rodger	
   Clayton	
   11914	
   12337	
   39.99	
   22	
  
Verona	
   Hollen	
   11915	
   15912	
   40.50	
   10	
  
SELECT	
  *	
  FROM	
  customer	
  join	
  order	
  ON	
  customer.id	
  =	
  order.cid;	
  
CREATE	
  TABLE	
  customer	
  (id	
  int,	
  first	
  string,	
  last	
  string)	
  
CLUSTERED	
  BY(id)	
  SORTED	
  BY(id)	
  INTO	
  32	
  BUCKETS;	
  
CREATE	
  TABLE	
  order	
  (cid	
  int,	
  price	
  float,	
  quantity	
  int)	
  
CLUSTERED	
  BY(cid)	
  SORTED	
  BY(cid)	
  INTO	
  32	
  BUCKETS;	
  
Page 35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive Join Strategies
Page 35
Type	
   Approach	
   Pros	
   Cons	
  
Shuffle	
  Join	
  
Join	
  keys	
  are	
  shuffled	
  using	
  map/
reduce	
  and	
  joins	
  performed	
  reduce	
  
side.	
  
Works	
  regardless	
  of	
  data	
  
size	
  or	
  layout.	
  
Most	
  resource-­‐intensive	
  
and	
  slowest	
  join	
  type.	
  
Broadcast	
  
Join	
  
Small	
  tables	
  are	
  loaded	
  into	
  
memory	
  in	
  all	
  nodes,	
  mapper	
  scans	
  
through	
  the	
  large	
  table	
  and	
  joins.	
  
Very	
  fast,	
  single	
  scan	
  
through	
  largest	
  table.	
  
All	
  but	
  one	
  table	
  must	
  be	
  
small	
  enough	
  to	
  fit	
  in	
  
RAM.	
  
Sort-­‐Merge-­‐
Bucket	
  Join	
  
Mappers	
  take	
  advantage	
  of	
  co-­‐
loca?on	
  of	
  keys	
  to	
  do	
  efficient	
  joins.	
  
Very	
  fast	
  for	
  tables	
  of	
  any	
  
size.	
  
Data	
  must	
  be	
  sorted	
  and	
  
bucketed	
  ahead	
  of	
  ?me.	
  
All join types are now more efficient with Tez
Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
More Join Strategies
•  Take a look at this blog posting for an explanation of joins:
http://henning.kropponline.de/2016/10/09/hive-join-strategies/
•  A search on Google will return more join strategies than what has
been covered here
•  Keep in mind that most benchmarks were done using Map Reduce
processing rather than Tez. Your performance should be better due to
the in-memory processing nature of Tez.
Page 36
Page 37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Wri?ng	
  fast	
  queries:	
  
Techniques	
  to	
  op?mize	
  your	
  queries	
  
Page 38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Optimizing HIVE queries
1.  Use	
  Tez	
  
2.  Use	
  ORCFile	
  
3.  Use	
  Vectoriza?on	
  
4.  Use	
  Cost	
  Based	
  Op?miza?on	
  (CBO)	
  
5.  Write	
  good	
  SQL	
  
6.  Use	
  Hive	
  Explain	
  
7.  Consider	
  Hive	
  LLAP	
  
	
  
Page 39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Technique	
  #1:	
  TEZ	
  vs	
  MR	
  
Page 40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Understanding Tez vs MapReduce
Page 41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Technique	
  #2:	
  use	
  ORCFile	
  
Page 42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORCFile – Efficient Columnar Format
High Performance: Split-able, columnar storage file
Efficient Reads: Break into large “stripes” of data for
efficient read
Fast Filtering: Built in index, min/max, metadata for
fast filtering blocks - bloom filters if desired
Efficient Compression: Decompose complex row
types into primitives: massive compression and efficient
comparisons for filtering
Precomputation: Built in aggregates per block (min,
max, count, sum, etc.)
Proven at 300 PB scale: Facebook uses ORC for their
300 PB Hive Warehouse
Page 43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Technique	
  #3:	
  	
  
Use	
  Vectoriza?on	
  
Page 44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Using Vectorization
•  Vectorized query execution is a Hive feature that greatly reduces the CPU
usage for typical query operations like scans, filters, aggregates, and joins
•  Vectorized query execution streamlines operations by processing a block of
1024 rows at a time (instead of 1 row at a time)
•  ONLY works with ORCFiles
Page 44
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled=true;
Page 45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Technique	
  #4:	
  	
  
Use	
  Cost-­‐based	
  Op?miza?on	
  
Page 46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive Cost-Based Optimization (CBO)
•  Cost-­‐Based	
  Op-miza-on	
  (CBO)	
  engine	
  uses	
  sta?s?cs	
  
within	
  Hive	
  tables	
  to	
  produce	
  op?mal	
  query	
  plans	
  	
  
•  Two	
  types	
  of	
  stats	
  used	
  for	
  op?miza?on:	
  
o  Table	
  stats	
  
o  Column	
  stats	
  
•  Uses	
  an	
  open-­‐source	
  framework	
  called	
  Calcite	
  
(formerly	
  Op,q)	
  
Page 47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Step 1: ensure HIVE has table statistics
Hive.stats.autogather=true;
•  Stats	
  are	
  collected	
  at	
  the	
  table	
  level	
  automa?cally	
  when:	
  	
  
•  If	
  you	
  have	
  an	
  exis?ng	
  table	
  without	
  stats	
  collected:	
  
	
  
•  For	
  column-­‐level	
  sta?s?cs:	
  
–  HDP	
  2.1	
  
	
  
	
  
–  HDP	
  2.2	
  
ANALYZE TABLE table-name COMPUTE STATISTICS;
ANALYZE TABLE table-name COMPUTE STATISTICS for COLUMNS col1, col2;
ANALYZE TABLE table-name COMPUTE STATISTICS for COLUMNS;
Page 48 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
CBO with Partitioned Tables
•  When	
  table	
  is	
  par??oned,	
  you	
  need	
  to	
  specify	
  the	
  
par??on	
  when	
  collec?ng	
  sta?s?cs:	
  
ANALYZE TABLE table-name partition (col1=‘x’) COMPUTE STATISTICS;
ANALYZE TABLE table-name partition(col1=‘x’) COMPUTE STATISTICS for COLUMNS;
Page 49 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Step 2: set HIVE properties to enable CBO
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats = true;
And	
  now	
  every	
  query	
  you	
  run	
  will	
  use	
  CBO…	
  
Page 50 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Technique	
  #5:	
  	
  
Write	
  Smart	
  SQL	
  
Page 51 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Query design matters
•  This is Big Data we’re talking about
•  So consider performance in every query you write
•  There are many ways to write SQL with the same functional results,
but often varying performance characteristics
•  Avoid Joins when possible and choose the right Join when not
Page 51
Page 52 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Technique	
  #6:	
  	
  
Use	
  Hive	
  Explain	
  
Page 53 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HIVE EXPLAIN – understanding your query plan
Page 53
•  It is an advanced tool to debug what HIVE is doing.
•  Look at the sequence of operations and make sure it looks reasonable
•  Validate join type (e.g. we’ve asked for a map-side join, did it get executed that way?)
At the end of the day, if the plan is bad, everything else (ORC, Vectorization, etc) may not
matter.
Take a look at the below link on how to understand and analyze your query plan:
https://www.slideshare.net/HadoopSummit/how-to-understand-and-analyze-apache-hive-
query-execution-plan-for-performance-debugging
EXPLAIN {Hive Query}
Page 54 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Technique	
  #7:	
  	
  
Consider	
  Hive	
  LLAP	
  
Page 55 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP Key Benefits
Ã  Uses persistent query servers to avoid long startup times and deliver fast SQL.
Ã  Enables as fast as sub-second query in Hive by keeping all data and servers running
and in-memory all the time.
Ã  Shares its in-memory cache among all SQL users, maximizing the use of this scarce
resource.
Ã  Has fine-grained resource management and preemption, making it great for concurrent
access across many users.
Ã  Great for cloud because it caches data in memory and keeps it compressed,
overcoming long cloud storage access times and stretching the amount of data you can
fit in RAM.
Page 56 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Thank You

More Related Content

What's hot

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL DatabasesDerek Stainer
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKTaposh Roy
 
Hadoop et son écosystème
Hadoop et son écosystèmeHadoop et son écosystème
Hadoop et son écosystèmeKhanh Maudoux
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive TutorialSandeep Patil
 

What's hot (20)

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Hadoop et son écosystème
Hadoop et son écosystèmeHadoop et son écosystème
Hadoop et son écosystème
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache hive
Apache hiveApache hive
Apache hive
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark
SparkSpark
Spark
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 

Similar to Hive Data Modeling and Query Optimization

Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingearnwithme2522
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Michael Rys
 
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSRUHULAMINHAZARIKA
 
Exadata Smart Scan - What is so smart about it?
Exadata Smart Scan  - What is so smart about it?Exadata Smart Scan  - What is so smart about it?
Exadata Smart Scan - What is so smart about it?Uwe Hesse
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)Huibert Aalbers
 
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)Michael Rys
 
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02Guillermo Julca
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptxIke Ellis
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetupt3rmin4t0r
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Nicolas Morales
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010John Sichi
 
Building better SQL Server Databases
Building better SQL Server DatabasesBuilding better SQL Server Databases
Building better SQL Server DatabasesColdFusionConference
 

Similar to Hive Data Modeling and Query Optimization (20)

Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
Exadata Smart Scan - What is so smart about it?
Exadata Smart Scan  - What is so smart about it?Exadata Smart Scan  - What is so smart about it?
Exadata Smart Scan - What is so smart about it?
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)
 
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
 
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
 
SQLServer Database Structures
SQLServer Database Structures SQLServer Database Structures
SQLServer Database Structures
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetup
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Building better SQL Server Databases
Building better SQL Server DatabasesBuilding better SQL Server Databases
Building better SQL Server Databases
 

Recently uploaded

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Hive Data Modeling and Query Optimization

  • 1. Page 1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Data Modeling & Query Optimization Eyad Garelnabi
  • 2. Page 2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Agenda •  File  Formats   •  Hive  Table  Types   •  Hive  Data  Layout   •  What  About  Data  Modeling   •  Hive  Join  Strategies   •  Op?mizing  Queries    
  • 3. Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved File  Formats:     Text,  Parquet,  ORC,  etc…  
  • 4. Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Text •  Requires SerDes –  CSV: comma delimited –  Additional SerDes online •  Does not compress well •  Row based separation •  Slow to read and write •  Usually used for initial data load
  • 5. Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Parquet •  Faster access to data •  Efficient compression •  Effective for select queries
  • 6. Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORCFile High Performance: Split-able, columnar storage file Efficient Reads: Break into large “stripes” of data for efficient read Fast Filtering: Built in index, min/max, metadata for fast filtering blocks - bloom filters if desired Efficient Compression: Decompose complex row types into primitives: massive compression and efficient comparisons for filtering Precomputation: Built in aggregates per block (min, max, count, sum, etc.) Proven at 300 PB scale: Facebook uses ORC for their 300 PB Hive Warehouse
  • 7. Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved etc… •  Avro –  JSON formatted –  Good for select * queries –  Slow to read for other queries •  Sequence –  Optimized for Java MapReduce jobs –  Ineficient for Hive –  Rarely used
  • 8. Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved High Compression with ORCFile
  • 9. Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HIVE  Tables:     External,  Managed,  Views  
  • 10. Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved External Tables •  Hive manages schema/metadata •  When dropped, only schema is deleted CREATE EXTERNAL TABLE my_external_table ( 'id' int, 'name' string, 'department' string, 'country' string, ) ROW FORMAT DELIMETED FIELDS TERMINATED BY ',' STORED AS orc;
  • 11. Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Internal/Managed Tables •  Hive manages schema and data •  Data is saved by default in /usr/hive/warehouse/my_managed_table •  When dropped, both schema and data are deleted CREATE TABLE my_managed_table ( 'id' int, 'name' string, 'department' string, 'country' string, ) ROW FORMAT DELIMETED FIELDS TERMINATED BY ',’ SET LOCATION ‘/usr/Scotiabank/demo’ STORED AS parquet;
  • 12. Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Views •  Virtual table •  No data is stored to HDFS •  When dropped, only schema is deleted CREATE VIEW my_view ( 'id' int, 'name' string, 'department' string, 'country' string, ) AS {select_statement};
  • 13. Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HIVE  Data  Layout:     Par??oning,  Bucke?ng  and  Skews  
  • 14. Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Abstractions in Hive Par??ons,  buckets  and  skews  facilitate   faster,  more  direct  data  access.   Database   Table   Table   Par??on   Par??on   Par??on   Bucket   Bucket   Bucket   Op?onal  Per  Table   Skewed  Keys   Unskewed   Keys  
  • 15. Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning •  Breaks up data horizontally by column value sets •  When partitioning you will use 1 or more “virtual” columns break up data •  Virtual columns cause directories to be created in HDFS. –  Files for that partition are stored within that subdirectory. •  Partitioning makes queries go fast. –  Partitioning works particularly well when querying with the “virtual column” –  If queries use various columns, it may be hard to decide which columns should we partition by
  • 16. Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning •  Static Partitioning –  Partitioning is done on selected column fields CREATE TABLE static_partioned_table ( 'id' int, 'name' string, 'department' string ) PARTITIONED BY ('country' string) ROW FORMAT DELIMETED FIELDS TERMINATED BY ',' STORED AS ORCFile; INSERT OVERWRITE TABLE static_partioned_table PARTITION (country='canada') SELECT id, name, department FROM my_external_table WHERE country='canada'
  • 17. Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning •  Dynamic Partitioning –  Partitioning is automatically done on all column fields CREATE TABLE dynamic_partioned_table ( 'id' int, 'name' string, 'department' string ) PARTITIONED BY ('country' string) ROW FORMAT DELIMETED FIELDS TERMINATED BY ',' STORED AS ORCFile; INSERT OVERWRITE TABLE dynamic_partioned_table PARTITION (country) SELECT id, name, country FROM my_external_table;
  • 18. Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning •  IMPORTANT: dynamic partitioning will not work by default –  When creating tables, make sure: –  set hive.exec.dynamic.partition=true •  Also, set maximum number of partitions to avoid going overboard set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions=1000; set hive.exec.max.dynamic.partitions.pernode=1000;
  • 19. Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning •  Multi-layer Partitioning is possible but often not efficient –  Number of partitions becomes too much and will overwhelm the Metastore •  Limit the number of partitions. Less may be better –  1000 partitions will often perform better than 10000 •  Hadoop likes big files –  avoid creating partitions with mostly small files •  Only use when –  Data is very large and there are lots of table scans –  Data is queried aginst a particular column frequently –  Column data must have low cardinality
  • 20. Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning •  Often better to partition by Date not Year/Month –  By date you will only have 365 partitions at most –  Partitioning by date will allow you to easily perform queiries that require ‘BETWEEN’and ‘IN’. ( https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html ) SELECT * FROM TableA WHERE DateStamp IN (‘2015-01-01’, ‘2015-02-03’, ‘2016-01-01’) VS SELECT * FROM TableB WHERE (YEAR=2015 AND MONTH=01 AND DAY=01) OR (YEAR=2015 AND MONTH=02 AND DAY=03) OR (YEAR=2016 AND MONTH=01 AND DAY=01)
  • 21. Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Bucketing •  Breaks up data vertically by hashed key sets •  When bucketing, you specify the number of buckets •  Works particularly well when a lot of queries contain joins CREATE TABLE bucketed_table ( 'id' int, 'name' string, 'department' string, 'country' string ) CLUSTERED BY (id) INTO 12 BUCKETS ROW FORMAT DELIMETED FIELDS TERMINATED BY ',' STORED AS ORC;
  • 22. Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Bucketing •  IMPORTANT: the bucketing specified at table creation is NOT enforced when the table is written to… •  So when writing data, must make sure: –  Hive.enforce.bucketing = true SET hive.enforce.bucketing = true; SET hive.exec.dynamic.partition.mode=nonstrict; INSERT INTO TABLE sale (xdate, state) SELECT * FROM staging_table;
  • 23. Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Bucketing •  Works well when there is very large data volume and most queries are joins •  Partitioning and bucketing may be combined, of course –  Be careful not to wind up with very many small files that can overwhelm the NameNode –  Ideal file size is 200-500mb •  Partition and Bucket frequently joined tables in a similar way to improve join efficiency CREATE TABLE sale ( id int, amount decimal, ... ) PARTITIONED BY (xdate string, state string) CLUSTERED BY (id) SORTED BY (id) INTO 256 buckets;  
  • 24. Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Skewed Tables and List Bucketing •  When table is skewed with on or more column values taking up most space •  By specifying the values that appear most often in the keys (in this example ‘key1’ and ‘key2’), HIVE will split those into separate files automatically and take this into account during queries so that it can skip the whole file if possible •  “STORED AS DIRECTORIES” is called “list bucketing” –  Table is skewed, but also store each part as separate directory –  1 directory for each skewed key value, 1 directory for all other keys CREATE TABLE mytable ( key STRING, value STRING, … ) SKEWED BY (key) ON (‘key1’, ‘key2’) STORED AS DIRECTORIES;  
  • 25. Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Abstractions in Hive Par??ons,  buckets  and  skews  facilitate   faster,  more  direct  data  access.   Database   Table   Table   Par??on   Par??on   Par??on   Bucket   Bucket   Bucket   Op?onal  Per  Table   Skewed  Keys   Unskewed   Keys  
  • 26. Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Best Practice: When to use Partitioning/Bucketing/Skews •  Partitioning is useful for chronological columns that don’t have a very high number of possible values –  You don’t want to end up with millions of partitions •  Bucketing is most useful for tables that are “most often” joined together on the same key –  For example: joins by a patient-ID or customer-ID –  Make sure the bucket count matches on both tables involved in the join •  Skews useful when one or two column values dominate the table –  Hive can avoid whole files when querying
  • 27. Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What  About  Data  Modeling?  
  • 28. Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Modeling in Hadoop •  No data modeling a-la DW/RDBMS •  Decisions on data layout happen at the file/folder level –  This is where partitioning, bucketing and skewing comes in •  How far should we denormalize? –  As far as it makes sense –  Usually denormalize frequently joined tables –  Be mindful of the memory implications of very wide tables (thousands of columns)
  • 29. Page 29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Modeling in Hadoop •  Can we Alter an existing table to add Partitions or Buckets? –  No –  Create new partitioned/bucketed table and copy data over •  Are there limits on number of columns possible in Hive? –  No “hard” limit from Hive –  File format memory requirements may limit us though –  ORC tested with up to 20,000 columns before getting out-of-memory –  Be mindful of memory implications when designing wide tables
  • 30. Page 30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HIVE  Join  strategies:   Choose  the  right  JOIN  
  • 31. Page 31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Shuffle Joins – the default Page 31 customer   order   first   last   id   cid   price   quan2ty   Nick   Toner   11911   4150   10.50   3   Jessie   Simonds   11912   11914   12.25   27   Kasi   Lamers   11913   3491   5.99   5   Rodger   Clayton   11914   2934   39.99   22   Verona   Hollen   11915   11914   40.50   10   SELECT  *  FROM  customer  join  order  ON  customer.id  =  order.cid;   M {  id:  11911,  {  first:  Nick,  last:  Toner  }}   {  id:  11914,  {  first:  Rodger,  last:  Clayton  }}   …   M {  cid:  4150,  {  price:  10.50,  quan?ty:  3  }}   {  cid:  11914,  {  price:  12.25,  quan?ty:  27  }}   …   R {  id:  11914,  {  first:  Rodger,  last:  Clayton  }}   {  cid:  11914,  {  price:  12.25,  quan?ty:  27  }}   R {  id:  11911,  {  first:  Nick,  last:  Toner  }}   {  cid:  4150,  {  price:  10.50,  quan?ty:  3  }}   …   Iden?cal  keys  shuffled  to  the  same  reducer.  Join  done  reduce-­‐side.   Expensive  from  a  network  u?liza?on  standpoint.  
  • 32. Page 32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Broadcast Join (aka Map-side Join) •  Star schemas (e.g. dimension tables) •  Good when table is small enough to fit in RAM Page 32
  • 33. Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Using Broadcast Join •  Set hive.auto.convert.join = true •  HIVE then automatically uses broadcast join, if possible –  Small tables held in memory by all nodes •  Used for star-schema type joins common in Data warehousing use-cases •  hive.auto.convert.join.noconditionaltask.size determines data size for automatic conversion to broadcast join: –  Default 10MB is too low (check your default) –  Recommended: 256MB for 4GB container Page 33
  • 34. Page 34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sort-Merge-Bucket join: When both are too large for memory Page 34 customer   order   first   last   id   cid   price   quan2ty   Nick   Toner   11911   4150   10.50   3   Jessie   Simonds   11912   11914   12.25   27   Kasi   Lamers   11913   11914   40.50   10   Rodger   Clayton   11914   12337   39.99   22   Verona   Hollen   11915   15912   40.50   10   SELECT  *  FROM  customer  join  order  ON  customer.id  =  order.cid;   CREATE  TABLE  customer  (id  int,  first  string,  last  string)   CLUSTERED  BY(id)  SORTED  BY(id)  INTO  32  BUCKETS;   CREATE  TABLE  order  (cid  int,  price  float,  quantity  int)   CLUSTERED  BY(cid)  SORTED  BY(cid)  INTO  32  BUCKETS;  
  • 35. Page 35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Join Strategies Page 35 Type   Approach   Pros   Cons   Shuffle  Join   Join  keys  are  shuffled  using  map/ reduce  and  joins  performed  reduce   side.   Works  regardless  of  data   size  or  layout.   Most  resource-­‐intensive   and  slowest  join  type.   Broadcast   Join   Small  tables  are  loaded  into   memory  in  all  nodes,  mapper  scans   through  the  large  table  and  joins.   Very  fast,  single  scan   through  largest  table.   All  but  one  table  must  be   small  enough  to  fit  in   RAM.   Sort-­‐Merge-­‐ Bucket  Join   Mappers  take  advantage  of  co-­‐ loca?on  of  keys  to  do  efficient  joins.   Very  fast  for  tables  of  any   size.   Data  must  be  sorted  and   bucketed  ahead  of  ?me.   All join types are now more efficient with Tez
  • 36. Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved More Join Strategies •  Take a look at this blog posting for an explanation of joins: http://henning.kropponline.de/2016/10/09/hive-join-strategies/ •  A search on Google will return more join strategies than what has been covered here •  Keep in mind that most benchmarks were done using Map Reduce processing rather than Tez. Your performance should be better due to the in-memory processing nature of Tez. Page 36
  • 37. Page 37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Wri?ng  fast  queries:   Techniques  to  op?mize  your  queries  
  • 38. Page 38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Optimizing HIVE queries 1.  Use  Tez   2.  Use  ORCFile   3.  Use  Vectoriza?on   4.  Use  Cost  Based  Op?miza?on  (CBO)   5.  Write  good  SQL   6.  Use  Hive  Explain   7.  Consider  Hive  LLAP    
  • 39. Page 39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique  #1:  TEZ  vs  MR  
  • 40. Page 40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Understanding Tez vs MapReduce
  • 41. Page 41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique  #2:  use  ORCFile  
  • 42. Page 42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORCFile – Efficient Columnar Format High Performance: Split-able, columnar storage file Efficient Reads: Break into large “stripes” of data for efficient read Fast Filtering: Built in index, min/max, metadata for fast filtering blocks - bloom filters if desired Efficient Compression: Decompose complex row types into primitives: massive compression and efficient comparisons for filtering Precomputation: Built in aggregates per block (min, max, count, sum, etc.) Proven at 300 PB scale: Facebook uses ORC for their 300 PB Hive Warehouse
  • 43. Page 43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique  #3:     Use  Vectoriza?on  
  • 44. Page 44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Using Vectorization •  Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins •  Vectorized query execution streamlines operations by processing a block of 1024 rows at a time (instead of 1 row at a time) •  ONLY works with ORCFiles Page 44 SET hive.vectorized.execution.enabled = true; SET hive.vectorized.execution.reduce.enabled=true;
  • 45. Page 45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique  #4:     Use  Cost-­‐based  Op?miza?on  
  • 46. Page 46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Cost-Based Optimization (CBO) •  Cost-­‐Based  Op-miza-on  (CBO)  engine  uses  sta?s?cs   within  Hive  tables  to  produce  op?mal  query  plans     •  Two  types  of  stats  used  for  op?miza?on:   o  Table  stats   o  Column  stats   •  Uses  an  open-­‐source  framework  called  Calcite   (formerly  Op,q)  
  • 47. Page 47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Step 1: ensure HIVE has table statistics Hive.stats.autogather=true; •  Stats  are  collected  at  the  table  level  automa?cally  when:     •  If  you  have  an  exis?ng  table  without  stats  collected:     •  For  column-­‐level  sta?s?cs:   –  HDP  2.1       –  HDP  2.2   ANALYZE TABLE table-name COMPUTE STATISTICS; ANALYZE TABLE table-name COMPUTE STATISTICS for COLUMNS col1, col2; ANALYZE TABLE table-name COMPUTE STATISTICS for COLUMNS;
  • 48. Page 48 © Hortonworks Inc. 2011 – 2015. All Rights Reserved CBO with Partitioned Tables •  When  table  is  par??oned,  you  need  to  specify  the   par??on  when  collec?ng  sta?s?cs:   ANALYZE TABLE table-name partition (col1=‘x’) COMPUTE STATISTICS; ANALYZE TABLE table-name partition(col1=‘x’) COMPUTE STATISTICS for COLUMNS;
  • 49. Page 49 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Step 2: set HIVE properties to enable CBO SET hive.cbo.enable=true; SET hive.compute.query.using.stats = true; And  now  every  query  you  run  will  use  CBO…  
  • 50. Page 50 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique  #5:     Write  Smart  SQL  
  • 51. Page 51 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Query design matters •  This is Big Data we’re talking about •  So consider performance in every query you write •  There are many ways to write SQL with the same functional results, but often varying performance characteristics •  Avoid Joins when possible and choose the right Join when not Page 51
  • 52. Page 52 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique  #6:     Use  Hive  Explain  
  • 53. Page 53 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HIVE EXPLAIN – understanding your query plan Page 53 •  It is an advanced tool to debug what HIVE is doing. •  Look at the sequence of operations and make sure it looks reasonable •  Validate join type (e.g. we’ve asked for a map-side join, did it get executed that way?) At the end of the day, if the plan is bad, everything else (ORC, Vectorization, etc) may not matter. Take a look at the below link on how to understand and analyze your query plan: https://www.slideshare.net/HadoopSummit/how-to-understand-and-analyze-apache-hive- query-execution-plan-for-performance-debugging EXPLAIN {Hive Query}
  • 54. Page 54 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique  #7:     Consider  Hive  LLAP  
  • 55. Page 55 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP Key Benefits Ã  Uses persistent query servers to avoid long startup times and deliver fast SQL. Ã  Enables as fast as sub-second query in Hive by keeping all data and servers running and in-memory all the time. Ã  Shares its in-memory cache among all SQL users, maximizing the use of this scarce resource. Ã  Has fine-grained resource management and preemption, making it great for concurrent access across many users. Ã  Great for cloud because it caches data in memory and keeps it compressed, overcoming long cloud storage access times and stretching the amount of data you can fit in RAM.
  • 56. Page 56 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Thank You