SlideShare a Scribd company logo
1 of 28
Page1
Hive: Loading Data
June 2015
Version 2.0
Ben Leonhardi
Page2
Agenda
• Introduction
• ORC files
• Partitioning vs. Predicate Pushdown
• Loading data
• Dynamic Partitioning
• Bucketing
• Optimize Sort Dynamic Partitioning
• Manual Distribution
• Miscellaneous
• Sorting and Predicate pushdown
• Debugging
• Bloom Filters
Page3
Introduction
• Effectively storing data in Hive
• Reducing IO
• Partitioning
• ORC files with predicate pushdown
• Partitioned tables
• Static partition loading
– One partition is loaded at a time
– Good for continuous operation
– Not suitable for initial loads
• Dynamic partition loading
– Data is distributed between partitions dynamically
• Data Sorting for better predicate pushdown
Page4
ORCFile – Columnar Storage for Hive
Columnar format
enables high
compression and high
performance.
• ORC is an optimized, compressed, columnar storage format
• Only needed columns are read
• Blocks of data can be skipped using indexes and predicate pushdown
Page5
Partitioning Hive
• Hive tables can be value partitioned
– Each partition is associated with a folder in HDFS
– All partitions have an entry in the Hive Catalog
– The Hive optimizer will parse the query for filter conditions and skip unneeded partitions
• Usage consideration
– Too many partitions can lead to bad performance in the Hive Catalog and Optimizer
– No range partitioning / no continuous values
– Normally date partitioned by data load
Page 5
• /apps/hive/warehouse
• cust.db
• customers
• sales
• day=20150801
• day=20150802
• day=20150803
• …
Warehouse folder in HDFS
Hive Databases have
folders ending in .db
Unpartitioned tables have
a single folder.
Partitioned tables have a subfolder for
each partition.
Page6
Predicate Pushdown
• ORC ( and other storage formats ) support predicate pushdown
– Query filters are pushed down into the storage handler
– Blocks of data can be skipped without reading them from HDFS based on ORC index
SELECT SUM (PROFIT) FROM SALES WHERE DAY = 03
Page 6
DAY CUST PROFIT
01 Klaus 35
01 Max 30
01 John 20
02 John 34
03 Max 10
04 Klaus 20
04 Max 45
05 Mark 20
DAY_MIN DAY_MAX PROFIT_MIN PROFIT_MAX
01 01 20 35
02 04 10 34
04 05 20 45
Only Block 2 can contain rows
with DAY 02.
Block 1 and 3 can be skipped
Page7
Partitioning vs. Predicate Pushdown
• Both reduce the data that needs to be read
• Partitioning works at split generation, no need to start containers
• Predicate pushdown is applied during file reads
• Partitioning is applied in the split generation/optimizer
• Impact on Optimizer and HCatalog for large number of partitions
• Thousands of partitions will result in performance problems
• Predicate Pushdown needs to read the file footers
• Container are allocated even though they can run very quickly
• No overhead in Optimizer/Catalog
• Newest Hive build 1.2 can apply PP at split generation time
• hive.exec.orc.split.strategy=BI, means never read footers (& fire jobs fast)
• hive.exec.orc.split.strategy=ETL, always read footers and split as fine as you want
Page8
Partitioning and Predicate Pushdown
SELECT * FROM TABLE WHERE COUNTRY = “EN” and DATE = 2015
Partition EN
ORC BLK1
2008
2010
2011
2011
2013
2013
Map1
ORC BLK2
2013
2013
2013
2014
2015
2015
ORC BLK3
2015
2015
2015
2015
2015
2015
Partition DE
ORC
BLK1
ORC
BLK2
Map2 Map3
Table partitioned on Country, only folder
for “EN” is read
ORC files keep index information on
content, blocks can be skipped based on
index
Page9
Agenda
• Introduction
• ORC files
• Partitioning vs. Predicate Pushdown
• Loading data
• Dynamic Partitioning
• Bucketing
• Optimize Sort Dynamic Partitioning
• Manual Distribution
• Miscellaneous
• Sorting and Predicate pushdown
• Debugging
• Bloom Filters
Page10
Loading Data with Dynamic Partitioning
CREATE TABLE ORC_SALES
( CLIENTID INT, DT DATE, REV DOUBLE, PROFIT DOUBLE, COMMENT STRING )
PARTITIONED BY ( COUNTRY STRING )
STORED AS ORC;
INSERT INTO TABLE ORC_SALES PARTITION (COUNTRY) SELECT * FROM DEL_SALES;
• Dynamic partitioning could create millions of partitions for bad partition keys
• Parameters exist that restrict the creation of dynamic partitions
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.exec.max.dynamic.partitions.pernode=100000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=100000;
Most of these settings are already enabled with
good values in HDP 2.2+
Dynamic partition columns need to be the last columns
in your dataset
Change order in SELECT list if necessary
Page11
Dynamic Partition Loading
• One file per Reducer/Mapper
• Standard Load will use Map tasks to write data. One map task per input block/split
Partition DE Partition EN Partition FR Partition SP
Block1 Block3Block2 Block4
Map1 Map3Map2 Map4
b1 b2
b3 b4
b1 b2
b3 b4
b1 b2
b3 b4
b1 b2
b3 b4
Block5
Map5
b5 b5 b5 b5
Page12
Small files
• Large number of writers with large number of partitions results in small files
• Files with 1-10 blocks of data are more efficient for HDFS
• ORC compression is not very efficient on small files
• ORC Writer will keep one Writer object open for each partition he encounters.
• RAM needed for one stripe in every file / column
• Too many Writers results in small stripes ( down to 5000 rows )
• If you run into memory problems you can increase the task RAM or increase the ORC
memory pool percentage
set hive.tez.java.opts="-Xmx3400m";
set hive.tez.container.size = 4096;
set hive.exec.orc.memory.pool = 1.0;
Page13
Loading Data Using Distribution
• For large number of partitions, load data through reducers.
• One or more reducers associated with a partition through data distribution
• Beware of Hash conflicts ( two partitions being mapped to the same reducer by the hash function )
EN, 2015
DE, 2015
EN, 2014
…
DE, 2009
EN, 2008
DE, 2011
…
EN, 2014
EN, 2008
DE, 2011
…
Partition EN
Map0
HASH (EN) -> 1
HASH (DE) -> 0
…
Map1
HASH (EN) -> 1
HASH (DE) -> 0
…
Map2
HASH (EN) -> 1
HASH (EN) -> 1
…
Red1
EN, 2015
EN, 2008
…
Red0
DE, 2015
DE, 2009
…
Partition DE
Page14
Bucketing
• Hive tables can be bucketed using the CLUSTERED BY keyword
– One file/reducer per bucket
– Buckets can be sorted
– Additional advantages like bucket joins and sampling
• Per default one reducer for each bucket across all partitions
– Performance problems for large loads with dynamic partitioning
– ORC Writer memory issues
• Enforce Bucketing and Sorting in Hive
set hive.enforce.sorting=true;
set hive.enforce.bucketing=true;
Page15
Bucketing Example
CREATE TABLE ORC_SALES
( CLIENTID INT, DT DATE, REV DOUBLE, PROFIT DOUBLE, COMMENT STRING )
PARTITIONED BY ( COUNTRY STRING )
CLUSTERED BY DT SORT BY ( DT ) INTO 31 BUCKETS;
INSERT INTO TABLE ORC_SALES PARTITION (COUNTRY) SELECT * FROM DEL_SALES;
Partition DE Partition EN Partition FR
D1 D2
D4 …
D3 D1 D2
D4 …
D3 D1 D2
D4 …
D3
Red DT1 Red DT2 Red DT3 Red …
Page16
Optimized Dynamic Sorted Partitioning
• Enable optimized sorted partitioning to fix small file creation
– Creates one reducer for each partition AND bucket
– If you have 5 partitions with 4 buckets you will have 20 reducers
• Hash conflicts mean that you can still have reducers handling more than one file
– Data is sorted by partition/bucket key
– ORCWriter closes files after encountering new keys
- only one open file at a time
- reduced memory needs
• Can be enabled with
set optimize.sort.dynamic.partitioning=true;
Page17
Optimized Dynamic Sorted Partitioning
• Optimized sorted partitioning creates one reducer per partition * bucket
Block1 Block3Block2 Block4 Block5
Partition DE Partition EN Partition FR Partition SP
Map1 Map3Map2 Map4 Map5
Red1 Red2 Red3 Red4
Out1 Out2 Out3 Out4
Hash Conflicts can happen even though there is
one reducer for each partition.
• This is the reason data is sorted
• Reducer can close ORC writer after each key
Page18
Miscellaneous
• Small number of partitions can lead to slow loads
• Solution is bucketing, increase the number of reducers
• This can also help in Predicate pushdown
• Partition by country, bucket by client id for example.
• On a big system you may have to increase the max. number of reducers
set hive.exec.reducers.max=1000;
Page19
Manual Distribution
• Fine grained control over distribution may be needed
• DISTRIBUTE BY keyword allows control over the distribution algorithm
• For example DISTRIBUTE BY GENDER will split the data stream into two sub streams
• Does not define the numbers of reducers
– Specify a fitting number with
set mapred.reduce.tasks=2
• For dynamic partitioning include the partition key in the distributiom
• Any additional subkeys result in multiple files per partition folder ( not unlike bucketing )
• For fast load try to maximize number of reducers in cluster
Page20
Distribute By
SET MAPRED.REDUCE.TASKS = 8;
INSERT INTO ORC_SALES PARTITION ( COUNTRY) SELECT FROM DEL_SALES
DISTRIBUTE BY COUNTRY, GENDER;
Block1 Block3Block2 Block4
Partition DE Partition EN Partition FR Partition SP
Map1 Map3Map2 Map4
Red1 Red2 Red3 Red4 Red5 Red6 Red7 Red8
DE
M
DE
F
EN
M
EN
F
FR
M
FR
F
SP
M
SP
F
HashConflict
Reducers and number of distribution keys do not have
to be identical but it is good best practice
If you run into hash conflicts, changing the distribution
key may help. ( M/F -> 0/1 ) for example
Page21
Agenda
• Introduction
• ORC files
• Partitioning vs. Predicate Pushdown
• Loading data
• Dynamic Partitioning
• Bucketing
• Optimize Sort Dynamic Partitioning
• Manual Distribution
• Miscellaneous
• Sorting and Predicate pushdown
• Debugging
• Bloom Filters
Page22
SORT BY for Predicate Pushdown ( PPD )
• ORC can skip stripes ( and 10k sub-blocks ) of data based on ORC footers
• Data can be skipped based on min/max values and bloom filters
• In warehouse environments data is normally sorted by date
• For initial loads or other predicates data can be sorted during load
• Two ways to sort data: ORDER BY ( global sort, slow ) and SORT BY ( sort by reducer )
– Use want SORT BY for PPD: faster and cross-file sorting does not help PPD
• Can be combined with Distribution, Partitioning, Bucketing to optimize effect
CREATE TABLE ORC_SALES
( CLIENTID INT, DT DATE, REV DOUBLE, PROFIT DOUBLE, COMMENT STRING )
STORED AS ORC;
INSERT INTO TABLE ORC_SALES SELECT * FROM DEL_SALES SORT BY DT;
Page23
Sorting when Inserting into Table
Partition DE
DE 1
2015-01
2015-01
2015-02
2015-03
2015-04
2015-05
Block1 Block1
DE 2
2015-02
2015-02
2015-03
2015-03
2015-03
Partition EN
EN 1
2015-03
2015-04
2015-04
2015-07
EN 2
2015-01
2015-02
2015-02
2015-02
2015-03
2015-05
Map1 Map1
SELECT * FROM DATA_ORC WHERE dt = 2015-02
Files are divided into
stripes of x MB and
blocks of 10000 rows
Only blue blocks have to
be read based on their
min/max values
This requires sorting
Page24
Checking Results
• Use hive –orcfiledump to check results in ORC files
hive –orcfiledump /apps/hive/warehouse/table/dt=3/00001_0
… Compression: ZLIB …
Stripe Statistics:
Stripe 1:
Column 0: count: 145000
Column 1: min: 1 max: 145000
…
Stripe 2:
Column 0: count: 144000
Column 1: min: 145001 max: 289000
…
Check Number of Stripes and number rows
- small stripes (5000 rows) indicate a memory problem during load
Data should be sorted on your
predicate columns
Page25
Bloom Filters
• New feature in Hive 1.2
• A hash index bitmap of values in a column
• If the bit for hash(value) is 0, no row in the stripe can be your value
• If the bit for hash(value) is 1, it is possible that the stripe contains your value
• Hive can skip stripes without need to sort data
• Hard to sort by multiple columns
CREATE TABLE ORC_SALES ( ID INT, Client INT, DT INT… );
STORED AS ORC TBLPROPERTIES
("orc.bloom.filter.columns"="Client,DT");
Parameter needs case sensitive comma-separated list of columns
Page26
Bloom Filters
• Bloom Filters are good
• If you have multiple predicate columns
• If your predicate columns are not suitable for sorting ( URLs, hash values, … )
• If you cannot sort the data ( daily ingestion, filter by clientid )
• Bloom Filters are bad
• If every stripe contains your value
– low cardinality fields like country
– Events that happen regularly ( client buys something daily )
• Check if you successfully created a bloom filter index with orcfiledump
hive --orcfiledump –rowindex 3,4,5 /apps/hive/…
You only see bloom filter indexes if you specify the columns you want to see
Page27
Verify ORC indexes
• Switch on additional information like row counts going in/out of Tasks
SET HIVE.TEZ.PRINT.EXEC.SUMMARY = TRUE;
• Run query with/without Predicate Pushdown to compare row counts:
set hive.optimize.index.filter=false;
// run query
set hive.optimize.index.filter=true;
// run query
// compare results
Page28
Summary
• Partitioning and Predicate Pushdown can greatly enhance query performance
• Predicate Pushdown enhances Partitioning, it does not replace it
• Too many partitions lead to performance problems
• Dynamic Partition loading can lead to problems
• Normally Optimized Dynamic Sorted Partitioning solves these problems
• Sometimes manual distribution can be beneficial
• Carefully design your table layout and data loading
• Sorting is critical for effective predicate pushdown
• If sorting is no option bloom filters can be a solution
• Verify data layout with orcfiledump and debug information

More Related Content

What's hot

Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationEyad Garelnabi
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 

What's hot (20)

Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query Optimization
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 

Similar to Hive: Loading Data

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftAmazon Web Services
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseAltinity Ltd
 
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum
 
Presentation db2 best practices for optimal performance
Presentation   db2 best practices for optimal performancePresentation   db2 best practices for optimal performance
Presentation db2 best practices for optimal performancesolarisyougood
 
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...Lucidworks
 
Presentation db2 best practices for optimal performance
Presentation   db2 best practices for optimal performancePresentation   db2 best practices for optimal performance
Presentation db2 best practices for optimal performancexKinAnx
 
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAmazon Web Services
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon RedshiftKel Graham
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheAlluxio, Inc.
 
Designing High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseDesigning High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseMarcel Franke
 
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingAmir Reza Hashemi
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...
Best Practices –  Extreme Performance with Data Warehousing  on Oracle Databa...Best Practices –  Extreme Performance with Data Warehousing  on Oracle Databa...
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...Edgar Alejandro Villegas
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 

Similar to Hive: Loading Data (20)

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
 
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
 
Presentation db2 best practices for optimal performance
Presentation   db2 best practices for optimal performancePresentation   db2 best practices for optimal performance
Presentation db2 best practices for optimal performance
 
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
 
Presentation db2 best practices for optimal performance
Presentation   db2 best practices for optimal performancePresentation   db2 best practices for optimal performance
Presentation db2 best practices for optimal performance
 
DAS RAID NAS SAN
DAS RAID NAS SANDAS RAID NAS SAN
DAS RAID NAS SAN
 
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
 
Designing High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseDesigning High Performance ETL for Data Warehouse
Designing High Performance ETL for Data Warehouse
 
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / Sharding
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...
Best Practices –  Extreme Performance with Data Warehousing  on Oracle Databa...Best Practices –  Extreme Performance with Data Warehousing  on Oracle Databa...
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 

Recently uploaded

10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...software pro Development
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 

Recently uploaded (20)

10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 

Hive: Loading Data

  • 1. Page1 Hive: Loading Data June 2015 Version 2.0 Ben Leonhardi
  • 2. Page2 Agenda • Introduction • ORC files • Partitioning vs. Predicate Pushdown • Loading data • Dynamic Partitioning • Bucketing • Optimize Sort Dynamic Partitioning • Manual Distribution • Miscellaneous • Sorting and Predicate pushdown • Debugging • Bloom Filters
  • 3. Page3 Introduction • Effectively storing data in Hive • Reducing IO • Partitioning • ORC files with predicate pushdown • Partitioned tables • Static partition loading – One partition is loaded at a time – Good for continuous operation – Not suitable for initial loads • Dynamic partition loading – Data is distributed between partitions dynamically • Data Sorting for better predicate pushdown
  • 4. Page4 ORCFile – Columnar Storage for Hive Columnar format enables high compression and high performance. • ORC is an optimized, compressed, columnar storage format • Only needed columns are read • Blocks of data can be skipped using indexes and predicate pushdown
  • 5. Page5 Partitioning Hive • Hive tables can be value partitioned – Each partition is associated with a folder in HDFS – All partitions have an entry in the Hive Catalog – The Hive optimizer will parse the query for filter conditions and skip unneeded partitions • Usage consideration – Too many partitions can lead to bad performance in the Hive Catalog and Optimizer – No range partitioning / no continuous values – Normally date partitioned by data load Page 5 • /apps/hive/warehouse • cust.db • customers • sales • day=20150801 • day=20150802 • day=20150803 • … Warehouse folder in HDFS Hive Databases have folders ending in .db Unpartitioned tables have a single folder. Partitioned tables have a subfolder for each partition.
  • 6. Page6 Predicate Pushdown • ORC ( and other storage formats ) support predicate pushdown – Query filters are pushed down into the storage handler – Blocks of data can be skipped without reading them from HDFS based on ORC index SELECT SUM (PROFIT) FROM SALES WHERE DAY = 03 Page 6 DAY CUST PROFIT 01 Klaus 35 01 Max 30 01 John 20 02 John 34 03 Max 10 04 Klaus 20 04 Max 45 05 Mark 20 DAY_MIN DAY_MAX PROFIT_MIN PROFIT_MAX 01 01 20 35 02 04 10 34 04 05 20 45 Only Block 2 can contain rows with DAY 02. Block 1 and 3 can be skipped
  • 7. Page7 Partitioning vs. Predicate Pushdown • Both reduce the data that needs to be read • Partitioning works at split generation, no need to start containers • Predicate pushdown is applied during file reads • Partitioning is applied in the split generation/optimizer • Impact on Optimizer and HCatalog for large number of partitions • Thousands of partitions will result in performance problems • Predicate Pushdown needs to read the file footers • Container are allocated even though they can run very quickly • No overhead in Optimizer/Catalog • Newest Hive build 1.2 can apply PP at split generation time • hive.exec.orc.split.strategy=BI, means never read footers (& fire jobs fast) • hive.exec.orc.split.strategy=ETL, always read footers and split as fine as you want
  • 8. Page8 Partitioning and Predicate Pushdown SELECT * FROM TABLE WHERE COUNTRY = “EN” and DATE = 2015 Partition EN ORC BLK1 2008 2010 2011 2011 2013 2013 Map1 ORC BLK2 2013 2013 2013 2014 2015 2015 ORC BLK3 2015 2015 2015 2015 2015 2015 Partition DE ORC BLK1 ORC BLK2 Map2 Map3 Table partitioned on Country, only folder for “EN” is read ORC files keep index information on content, blocks can be skipped based on index
  • 9. Page9 Agenda • Introduction • ORC files • Partitioning vs. Predicate Pushdown • Loading data • Dynamic Partitioning • Bucketing • Optimize Sort Dynamic Partitioning • Manual Distribution • Miscellaneous • Sorting and Predicate pushdown • Debugging • Bloom Filters
  • 10. Page10 Loading Data with Dynamic Partitioning CREATE TABLE ORC_SALES ( CLIENTID INT, DT DATE, REV DOUBLE, PROFIT DOUBLE, COMMENT STRING ) PARTITIONED BY ( COUNTRY STRING ) STORED AS ORC; INSERT INTO TABLE ORC_SALES PARTITION (COUNTRY) SELECT * FROM DEL_SALES; • Dynamic partitioning could create millions of partitions for bad partition keys • Parameters exist that restrict the creation of dynamic partitions set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode = nonstrict; set hive.exec.max.dynamic.partitions.pernode=100000; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.created.files=100000; Most of these settings are already enabled with good values in HDP 2.2+ Dynamic partition columns need to be the last columns in your dataset Change order in SELECT list if necessary
  • 11. Page11 Dynamic Partition Loading • One file per Reducer/Mapper • Standard Load will use Map tasks to write data. One map task per input block/split Partition DE Partition EN Partition FR Partition SP Block1 Block3Block2 Block4 Map1 Map3Map2 Map4 b1 b2 b3 b4 b1 b2 b3 b4 b1 b2 b3 b4 b1 b2 b3 b4 Block5 Map5 b5 b5 b5 b5
  • 12. Page12 Small files • Large number of writers with large number of partitions results in small files • Files with 1-10 blocks of data are more efficient for HDFS • ORC compression is not very efficient on small files • ORC Writer will keep one Writer object open for each partition he encounters. • RAM needed for one stripe in every file / column • Too many Writers results in small stripes ( down to 5000 rows ) • If you run into memory problems you can increase the task RAM or increase the ORC memory pool percentage set hive.tez.java.opts="-Xmx3400m"; set hive.tez.container.size = 4096; set hive.exec.orc.memory.pool = 1.0;
  • 13. Page13 Loading Data Using Distribution • For large number of partitions, load data through reducers. • One or more reducers associated with a partition through data distribution • Beware of Hash conflicts ( two partitions being mapped to the same reducer by the hash function ) EN, 2015 DE, 2015 EN, 2014 … DE, 2009 EN, 2008 DE, 2011 … EN, 2014 EN, 2008 DE, 2011 … Partition EN Map0 HASH (EN) -> 1 HASH (DE) -> 0 … Map1 HASH (EN) -> 1 HASH (DE) -> 0 … Map2 HASH (EN) -> 1 HASH (EN) -> 1 … Red1 EN, 2015 EN, 2008 … Red0 DE, 2015 DE, 2009 … Partition DE
  • 14. Page14 Bucketing • Hive tables can be bucketed using the CLUSTERED BY keyword – One file/reducer per bucket – Buckets can be sorted – Additional advantages like bucket joins and sampling • Per default one reducer for each bucket across all partitions – Performance problems for large loads with dynamic partitioning – ORC Writer memory issues • Enforce Bucketing and Sorting in Hive set hive.enforce.sorting=true; set hive.enforce.bucketing=true;
  • 15. Page15 Bucketing Example CREATE TABLE ORC_SALES ( CLIENTID INT, DT DATE, REV DOUBLE, PROFIT DOUBLE, COMMENT STRING ) PARTITIONED BY ( COUNTRY STRING ) CLUSTERED BY DT SORT BY ( DT ) INTO 31 BUCKETS; INSERT INTO TABLE ORC_SALES PARTITION (COUNTRY) SELECT * FROM DEL_SALES; Partition DE Partition EN Partition FR D1 D2 D4 … D3 D1 D2 D4 … D3 D1 D2 D4 … D3 Red DT1 Red DT2 Red DT3 Red …
  • 16. Page16 Optimized Dynamic Sorted Partitioning • Enable optimized sorted partitioning to fix small file creation – Creates one reducer for each partition AND bucket – If you have 5 partitions with 4 buckets you will have 20 reducers • Hash conflicts mean that you can still have reducers handling more than one file – Data is sorted by partition/bucket key – ORCWriter closes files after encountering new keys - only one open file at a time - reduced memory needs • Can be enabled with set optimize.sort.dynamic.partitioning=true;
  • 17. Page17 Optimized Dynamic Sorted Partitioning • Optimized sorted partitioning creates one reducer per partition * bucket Block1 Block3Block2 Block4 Block5 Partition DE Partition EN Partition FR Partition SP Map1 Map3Map2 Map4 Map5 Red1 Red2 Red3 Red4 Out1 Out2 Out3 Out4 Hash Conflicts can happen even though there is one reducer for each partition. • This is the reason data is sorted • Reducer can close ORC writer after each key
  • 18. Page18 Miscellaneous • Small number of partitions can lead to slow loads • Solution is bucketing, increase the number of reducers • This can also help in Predicate pushdown • Partition by country, bucket by client id for example. • On a big system you may have to increase the max. number of reducers set hive.exec.reducers.max=1000;
  • 19. Page19 Manual Distribution • Fine grained control over distribution may be needed • DISTRIBUTE BY keyword allows control over the distribution algorithm • For example DISTRIBUTE BY GENDER will split the data stream into two sub streams • Does not define the numbers of reducers – Specify a fitting number with set mapred.reduce.tasks=2 • For dynamic partitioning include the partition key in the distributiom • Any additional subkeys result in multiple files per partition folder ( not unlike bucketing ) • For fast load try to maximize number of reducers in cluster
  • 20. Page20 Distribute By SET MAPRED.REDUCE.TASKS = 8; INSERT INTO ORC_SALES PARTITION ( COUNTRY) SELECT FROM DEL_SALES DISTRIBUTE BY COUNTRY, GENDER; Block1 Block3Block2 Block4 Partition DE Partition EN Partition FR Partition SP Map1 Map3Map2 Map4 Red1 Red2 Red3 Red4 Red5 Red6 Red7 Red8 DE M DE F EN M EN F FR M FR F SP M SP F HashConflict Reducers and number of distribution keys do not have to be identical but it is good best practice If you run into hash conflicts, changing the distribution key may help. ( M/F -> 0/1 ) for example
  • 21. Page21 Agenda • Introduction • ORC files • Partitioning vs. Predicate Pushdown • Loading data • Dynamic Partitioning • Bucketing • Optimize Sort Dynamic Partitioning • Manual Distribution • Miscellaneous • Sorting and Predicate pushdown • Debugging • Bloom Filters
  • 22. Page22 SORT BY for Predicate Pushdown ( PPD ) • ORC can skip stripes ( and 10k sub-blocks ) of data based on ORC footers • Data can be skipped based on min/max values and bloom filters • In warehouse environments data is normally sorted by date • For initial loads or other predicates data can be sorted during load • Two ways to sort data: ORDER BY ( global sort, slow ) and SORT BY ( sort by reducer ) – Use want SORT BY for PPD: faster and cross-file sorting does not help PPD • Can be combined with Distribution, Partitioning, Bucketing to optimize effect CREATE TABLE ORC_SALES ( CLIENTID INT, DT DATE, REV DOUBLE, PROFIT DOUBLE, COMMENT STRING ) STORED AS ORC; INSERT INTO TABLE ORC_SALES SELECT * FROM DEL_SALES SORT BY DT;
  • 23. Page23 Sorting when Inserting into Table Partition DE DE 1 2015-01 2015-01 2015-02 2015-03 2015-04 2015-05 Block1 Block1 DE 2 2015-02 2015-02 2015-03 2015-03 2015-03 Partition EN EN 1 2015-03 2015-04 2015-04 2015-07 EN 2 2015-01 2015-02 2015-02 2015-02 2015-03 2015-05 Map1 Map1 SELECT * FROM DATA_ORC WHERE dt = 2015-02 Files are divided into stripes of x MB and blocks of 10000 rows Only blue blocks have to be read based on their min/max values This requires sorting
  • 24. Page24 Checking Results • Use hive –orcfiledump to check results in ORC files hive –orcfiledump /apps/hive/warehouse/table/dt=3/00001_0 … Compression: ZLIB … Stripe Statistics: Stripe 1: Column 0: count: 145000 Column 1: min: 1 max: 145000 … Stripe 2: Column 0: count: 144000 Column 1: min: 145001 max: 289000 … Check Number of Stripes and number rows - small stripes (5000 rows) indicate a memory problem during load Data should be sorted on your predicate columns
  • 25. Page25 Bloom Filters • New feature in Hive 1.2 • A hash index bitmap of values in a column • If the bit for hash(value) is 0, no row in the stripe can be your value • If the bit for hash(value) is 1, it is possible that the stripe contains your value • Hive can skip stripes without need to sort data • Hard to sort by multiple columns CREATE TABLE ORC_SALES ( ID INT, Client INT, DT INT… ); STORED AS ORC TBLPROPERTIES ("orc.bloom.filter.columns"="Client,DT"); Parameter needs case sensitive comma-separated list of columns
  • 26. Page26 Bloom Filters • Bloom Filters are good • If you have multiple predicate columns • If your predicate columns are not suitable for sorting ( URLs, hash values, … ) • If you cannot sort the data ( daily ingestion, filter by clientid ) • Bloom Filters are bad • If every stripe contains your value – low cardinality fields like country – Events that happen regularly ( client buys something daily ) • Check if you successfully created a bloom filter index with orcfiledump hive --orcfiledump –rowindex 3,4,5 /apps/hive/… You only see bloom filter indexes if you specify the columns you want to see
  • 27. Page27 Verify ORC indexes • Switch on additional information like row counts going in/out of Tasks SET HIVE.TEZ.PRINT.EXEC.SUMMARY = TRUE; • Run query with/without Predicate Pushdown to compare row counts: set hive.optimize.index.filter=false; // run query set hive.optimize.index.filter=true; // run query // compare results
  • 28. Page28 Summary • Partitioning and Predicate Pushdown can greatly enhance query performance • Predicate Pushdown enhances Partitioning, it does not replace it • Too many partitions lead to performance problems • Dynamic Partition loading can lead to problems • Normally Optimized Dynamic Sorted Partitioning solves these problems • Sometimes manual distribution can be beneficial • Carefully design your table layout and data loading • Sorting is critical for effective predicate pushdown • If sorting is no option bloom filters can be a solution • Verify data layout with orcfiledump and debug information