Submit Search
Upload
Hive Data Modeling and Query Optimization
•
8 likes
•
2,004 views
Eyad Garelnabi
Follow
Improve your Hive query performance through effective modeling and query optimization.
Read less
Read more
Technology
Report
Share
Report
Share
1 of 56
Download now
Download to read offline
Recommended
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
Sqoop
Sqoop
Prashant Gupta
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
Introduction to Apache Hive
Introduction to Apache Hive
Avkash Chauhan
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks
Apache sqoop with an use case
Apache sqoop with an use case
Davin Abraham
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
Recommended
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
Sqoop
Sqoop
Prashant Gupta
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
Introduction to Apache Hive
Introduction to Apache Hive
Avkash Chauhan
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks
Apache sqoop with an use case
Apache sqoop with an use case
Davin Abraham
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
Introduction to NoSQL Databases
Introduction to NoSQL Databases
Derek Stainer
1. Apache HIVE
1. Apache HIVE
Anuja Gunale
Hadoop Architecture and HDFS
Hadoop Architecture and HDFS
Edureka!
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
Hadoop et son écosystème
Hadoop et son écosystème
Khanh Maudoux
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
Introduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
Apache hive
Apache hive
pradipbajpai68
Apache Spark Introduction
Apache Spark Introduction
sudhakara st
Hadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
Apache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
Apache Hive Tutorial
Apache Hive Tutorial
Sandeep Patil
Spark SQL
Spark SQL
Joud Khattab
Session 14 - Hive
Session 14 - Hive
AnandMHadoop
Apache hive introduction
Apache hive introduction
Mahmood Reza Esmaili Zand
Apache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
Spark
Spark
Heena Madan
Hive(ppt)
Hive(ppt)
Abhinav Tyagi
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
Will Du
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
earnwithme2522
More Related Content
What's hot
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
Introduction to NoSQL Databases
Introduction to NoSQL Databases
Derek Stainer
1. Apache HIVE
1. Apache HIVE
Anuja Gunale
Hadoop Architecture and HDFS
Hadoop Architecture and HDFS
Edureka!
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
Hadoop et son écosystème
Hadoop et son écosystème
Khanh Maudoux
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
Introduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
Apache hive
Apache hive
pradipbajpai68
Apache Spark Introduction
Apache Spark Introduction
sudhakara st
Hadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
Apache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
Apache Hive Tutorial
Apache Hive Tutorial
Sandeep Patil
Spark SQL
Spark SQL
Joud Khattab
Session 14 - Hive
Session 14 - Hive
AnandMHadoop
Apache hive introduction
Apache hive introduction
Mahmood Reza Esmaili Zand
Apache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
Spark
Spark
Heena Madan
Hive(ppt)
Hive(ppt)
Abhinav Tyagi
What's hot
(20)
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Introduction to NoSQL Databases
Introduction to NoSQL Databases
1. Apache HIVE
1. Apache HIVE
Hadoop Architecture and HDFS
Hadoop Architecture and HDFS
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
Hadoop et son écosystème
Hadoop et son écosystème
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
Introduction to Spark Internals
Introduction to Spark Internals
Apache hive
Apache hive
Apache Spark Introduction
Apache Spark Introduction
Hadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Apache Spark Architecture
Apache Spark Architecture
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Apache Hive Tutorial
Apache Hive Tutorial
Spark SQL
Spark SQL
Session 14 - Hive
Session 14 - Hive
Apache hive introduction
Apache hive introduction
Apache Spark Overview
Apache Spark Overview
Spark
Spark
Hive(ppt)
Hive(ppt)
Similar to Hive Data Modeling and Query Optimization
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
Will Du
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
earnwithme2522
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Michael Rys
Apache Hive
Apache Hive
Amit Khandelwal
03 hive query language (hql)
03 hive query language (hql)
Subhas Kumar Ghosh
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
RUHULAMINHAZARIKA
Hive Hadoop
Hive Hadoop
Farafekr Technology Ltd.
Exadata Smart Scan - What is so smart about it?
Exadata Smart Scan - What is so smart about it?
Uwe Hesse
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)
Huibert Aalbers
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Michael Rys
Implementing the Databese Server session 02
Implementing the Databese Server session 02
Guillermo Julca
Build a modern data platform.pptx
Build a modern data platform.pptx
Ike Ellis
SQLServer Database Structures
SQLServer Database Structures
Antonios Chatzipavlis
Introduction to Amazon Athena
Introduction to Amazon Athena
Amazon Web Services
Stinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
Hortonworks
Data organization: hive meetup
Data organization: hive meetup
t3rmin4t0r
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
Nicolas Morales
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
Neeraja Rentachintala
Hive Evolution: ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
John Sichi
Building better SQL Server Databases
Building better SQL Server Databases
ColdFusionConference
Similar to Hive Data Modeling and Query Optimization
(20)
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Apache Hive
Apache Hive
03 hive query language (hql)
03 hive query language (hql)
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive Hadoop
Hive Hadoop
Exadata Smart Scan - What is so smart about it?
Exadata Smart Scan - What is so smart about it?
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Implementing the Databese Server session 02
Implementing the Databese Server session 02
Build a modern data platform.pptx
Build a modern data platform.pptx
SQLServer Database Structures
SQLServer Database Structures
Introduction to Amazon Athena
Introduction to Amazon Athena
Stinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
Data organization: hive meetup
Data organization: hive meetup
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
Hive Evolution: ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
Building better SQL Server Databases
Building better SQL Server Databases
Recently uploaded
How to write a Business Continuity Plan
How to write a Business Continuity Plan
Databarracks
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
BookNet Canada
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
LoriGlavin3
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
DianaGray10
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
Commit University
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
LoriGlavin3
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
mohitsingh558521
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Fwdays
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
Rick Flair
Training state-of-the-art general text embedding
Training state-of-the-art general text embedding
Zilliz
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
Sergiu Bodiu
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
Alex Barbosa Coqueiro
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
Raghuram Pandurangan
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
Fwdays
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
Dilum Bandara
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
LoriGlavin3
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
ScyllaDB
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
Fwdays
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
gvaughan
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
BookNet Canada
Recently uploaded
(20)
How to write a Business Continuity Plan
How to write a Business Continuity Plan
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
Training state-of-the-art general text embedding
Training state-of-the-art general text embedding
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Hive Data Modeling and Query Optimization
1.
Page 1 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Data Modeling & Query Optimization Eyad Garelnabi
2.
Page 2 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Agenda • File Formats • Hive Table Types • Hive Data Layout • What About Data Modeling • Hive Join Strategies • Op?mizing Queries
3.
Page 3 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved File Formats: Text, Parquet, ORC, etc…
4.
Page 4 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Text • Requires SerDes – CSV: comma delimited – Additional SerDes online • Does not compress well • Row based separation • Slow to read and write • Usually used for initial data load
5.
Page 5 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Parquet • Faster access to data • Efficient compression • Effective for select queries
6.
Page 6 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved ORCFile High Performance: Split-able, columnar storage file Efficient Reads: Break into large “stripes” of data for efficient read Fast Filtering: Built in index, min/max, metadata for fast filtering blocks - bloom filters if desired Efficient Compression: Decompose complex row types into primitives: massive compression and efficient comparisons for filtering Precomputation: Built in aggregates per block (min, max, count, sum, etc.) Proven at 300 PB scale: Facebook uses ORC for their 300 PB Hive Warehouse
7.
Page 7 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved etc… • Avro – JSON formatted – Good for select * queries – Slow to read for other queries • Sequence – Optimized for Java MapReduce jobs – Ineficient for Hive – Rarely used
8.
Page 8 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved High Compression with ORCFile
9.
Page 9 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved HIVE Tables: External, Managed, Views
10.
Page 10 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved External Tables • Hive manages schema/metadata • When dropped, only schema is deleted CREATE EXTERNAL TABLE my_external_table ( 'id' int, 'name' string, 'department' string, 'country' string, ) ROW FORMAT DELIMETED FIELDS TERMINATED BY ',' STORED AS orc;
11.
Page 11 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Internal/Managed Tables • Hive manages schema and data • Data is saved by default in /usr/hive/warehouse/my_managed_table • When dropped, both schema and data are deleted CREATE TABLE my_managed_table ( 'id' int, 'name' string, 'department' string, 'country' string, ) ROW FORMAT DELIMETED FIELDS TERMINATED BY ',’ SET LOCATION ‘/usr/Scotiabank/demo’ STORED AS parquet;
12.
Page 12 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Views • Virtual table • No data is stored to HDFS • When dropped, only schema is deleted CREATE VIEW my_view ( 'id' int, 'name' string, 'department' string, 'country' string, ) AS {select_statement};
13.
Page 13 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved HIVE Data Layout: Par??oning, Bucke?ng and Skews
14.
Page 14 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Abstractions in Hive Par??ons, buckets and skews facilitate faster, more direct data access. Database Table Table Par??on Par??on Par??on Bucket Bucket Bucket Op?onal Per Table Skewed Keys Unskewed Keys
15.
Page 15 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning • Breaks up data horizontally by column value sets • When partitioning you will use 1 or more “virtual” columns break up data • Virtual columns cause directories to be created in HDFS. – Files for that partition are stored within that subdirectory. • Partitioning makes queries go fast. – Partitioning works particularly well when querying with the “virtual column” – If queries use various columns, it may be hard to decide which columns should we partition by
16.
Page 16 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning • Static Partitioning – Partitioning is done on selected column fields CREATE TABLE static_partioned_table ( 'id' int, 'name' string, 'department' string ) PARTITIONED BY ('country' string) ROW FORMAT DELIMETED FIELDS TERMINATED BY ',' STORED AS ORCFile; INSERT OVERWRITE TABLE static_partioned_table PARTITION (country='canada') SELECT id, name, department FROM my_external_table WHERE country='canada'
17.
Page 17 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning • Dynamic Partitioning – Partitioning is automatically done on all column fields CREATE TABLE dynamic_partioned_table ( 'id' int, 'name' string, 'department' string ) PARTITIONED BY ('country' string) ROW FORMAT DELIMETED FIELDS TERMINATED BY ',' STORED AS ORCFile; INSERT OVERWRITE TABLE dynamic_partioned_table PARTITION (country) SELECT id, name, country FROM my_external_table;
18.
Page 18 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning • IMPORTANT: dynamic partitioning will not work by default – When creating tables, make sure: – set hive.exec.dynamic.partition=true • Also, set maximum number of partitions to avoid going overboard set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions=1000; set hive.exec.max.dynamic.partitions.pernode=1000;
19.
Page 19 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning • Multi-layer Partitioning is possible but often not efficient – Number of partitions becomes too much and will overwhelm the Metastore • Limit the number of partitions. Less may be better – 1000 partitions will often perform better than 10000 • Hadoop likes big files – avoid creating partitions with mostly small files • Only use when – Data is very large and there are lots of table scans – Data is queried aginst a particular column frequently – Column data must have low cardinality
20.
Page 20 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning • Often better to partition by Date not Year/Month – By date you will only have 365 partitions at most – Partitioning by date will allow you to easily perform queiries that require ‘BETWEEN’and ‘IN’. ( https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html ) SELECT * FROM TableA WHERE DateStamp IN (‘2015-01-01’, ‘2015-02-03’, ‘2016-01-01’) VS SELECT * FROM TableB WHERE (YEAR=2015 AND MONTH=01 AND DAY=01) OR (YEAR=2015 AND MONTH=02 AND DAY=03) OR (YEAR=2016 AND MONTH=01 AND DAY=01)
21.
Page 21 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Bucketing • Breaks up data vertically by hashed key sets • When bucketing, you specify the number of buckets • Works particularly well when a lot of queries contain joins CREATE TABLE bucketed_table ( 'id' int, 'name' string, 'department' string, 'country' string ) CLUSTERED BY (id) INTO 12 BUCKETS ROW FORMAT DELIMETED FIELDS TERMINATED BY ',' STORED AS ORC;
22.
Page 22 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Bucketing • IMPORTANT: the bucketing specified at table creation is NOT enforced when the table is written to… • So when writing data, must make sure: – Hive.enforce.bucketing = true SET hive.enforce.bucketing = true; SET hive.exec.dynamic.partition.mode=nonstrict; INSERT INTO TABLE sale (xdate, state) SELECT * FROM staging_table;
23.
Page 23 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Bucketing • Works well when there is very large data volume and most queries are joins • Partitioning and bucketing may be combined, of course – Be careful not to wind up with very many small files that can overwhelm the NameNode – Ideal file size is 200-500mb • Partition and Bucket frequently joined tables in a similar way to improve join efficiency CREATE TABLE sale ( id int, amount decimal, ... ) PARTITIONED BY (xdate string, state string) CLUSTERED BY (id) SORTED BY (id) INTO 256 buckets;
24.
Page 24 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Skewed Tables and List Bucketing • When table is skewed with on or more column values taking up most space • By specifying the values that appear most often in the keys (in this example ‘key1’ and ‘key2’), HIVE will split those into separate files automatically and take this into account during queries so that it can skip the whole file if possible • “STORED AS DIRECTORIES” is called “list bucketing” – Table is skewed, but also store each part as separate directory – 1 directory for each skewed key value, 1 directory for all other keys CREATE TABLE mytable ( key STRING, value STRING, … ) SKEWED BY (key) ON (‘key1’, ‘key2’) STORED AS DIRECTORIES;
25.
Page 25 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Abstractions in Hive Par??ons, buckets and skews facilitate faster, more direct data access. Database Table Table Par??on Par??on Par??on Bucket Bucket Bucket Op?onal Per Table Skewed Keys Unskewed Keys
26.
Page 26 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Best Practice: When to use Partitioning/Bucketing/Skews • Partitioning is useful for chronological columns that don’t have a very high number of possible values – You don’t want to end up with millions of partitions • Bucketing is most useful for tables that are “most often” joined together on the same key – For example: joins by a patient-ID or customer-ID – Make sure the bucket count matches on both tables involved in the join • Skews useful when one or two column values dominate the table – Hive can avoid whole files when querying
27.
Page 27 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved What About Data Modeling?
28.
Page 28 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Modeling in Hadoop • No data modeling a-la DW/RDBMS • Decisions on data layout happen at the file/folder level – This is where partitioning, bucketing and skewing comes in • How far should we denormalize? – As far as it makes sense – Usually denormalize frequently joined tables – Be mindful of the memory implications of very wide tables (thousands of columns)
29.
Page 29 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Modeling in Hadoop • Can we Alter an existing table to add Partitions or Buckets? – No – Create new partitioned/bucketed table and copy data over • Are there limits on number of columns possible in Hive? – No “hard” limit from Hive – File format memory requirements may limit us though – ORC tested with up to 20,000 columns before getting out-of-memory – Be mindful of memory implications when designing wide tables
30.
Page 30 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved HIVE Join strategies: Choose the right JOIN
31.
Page 31 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Shuffle Joins – the default Page 31 customer order first last id cid price quan2ty Nick Toner 11911 4150 10.50 3 Jessie Simonds 11912 11914 12.25 27 Kasi Lamers 11913 3491 5.99 5 Rodger Clayton 11914 2934 39.99 22 Verona Hollen 11915 11914 40.50 10 SELECT * FROM customer join order ON customer.id = order.cid; M { id: 11911, { first: Nick, last: Toner }} { id: 11914, { first: Rodger, last: Clayton }} … M { cid: 4150, { price: 10.50, quan?ty: 3 }} { cid: 11914, { price: 12.25, quan?ty: 27 }} … R { id: 11914, { first: Rodger, last: Clayton }} { cid: 11914, { price: 12.25, quan?ty: 27 }} R { id: 11911, { first: Nick, last: Toner }} { cid: 4150, { price: 10.50, quan?ty: 3 }} … Iden?cal keys shuffled to the same reducer. Join done reduce-‐side. Expensive from a network u?liza?on standpoint.
32.
Page 32 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Broadcast Join (aka Map-side Join) • Star schemas (e.g. dimension tables) • Good when table is small enough to fit in RAM Page 32
33.
Page 33 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Using Broadcast Join • Set hive.auto.convert.join = true • HIVE then automatically uses broadcast join, if possible – Small tables held in memory by all nodes • Used for star-schema type joins common in Data warehousing use-cases • hive.auto.convert.join.noconditionaltask.size determines data size for automatic conversion to broadcast join: – Default 10MB is too low (check your default) – Recommended: 256MB for 4GB container Page 33
34.
Page 34 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Sort-Merge-Bucket join: When both are too large for memory Page 34 customer order first last id cid price quan2ty Nick Toner 11911 4150 10.50 3 Jessie Simonds 11912 11914 12.25 27 Kasi Lamers 11913 11914 40.50 10 Rodger Clayton 11914 12337 39.99 22 Verona Hollen 11915 15912 40.50 10 SELECT * FROM customer join order ON customer.id = order.cid; CREATE TABLE customer (id int, first string, last string) CLUSTERED BY(id) SORTED BY(id) INTO 32 BUCKETS; CREATE TABLE order (cid int, price float, quantity int) CLUSTERED BY(cid) SORTED BY(cid) INTO 32 BUCKETS;
35.
Page 35 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Join Strategies Page 35 Type Approach Pros Cons Shuffle Join Join keys are shuffled using map/ reduce and joins performed reduce side. Works regardless of data size or layout. Most resource-‐intensive and slowest join type. Broadcast Join Small tables are loaded into memory in all nodes, mapper scans through the large table and joins. Very fast, single scan through largest table. All but one table must be small enough to fit in RAM. Sort-‐Merge-‐ Bucket Join Mappers take advantage of co-‐ loca?on of keys to do efficient joins. Very fast for tables of any size. Data must be sorted and bucketed ahead of ?me. All join types are now more efficient with Tez
36.
Page 36 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved More Join Strategies • Take a look at this blog posting for an explanation of joins: http://henning.kropponline.de/2016/10/09/hive-join-strategies/ • A search on Google will return more join strategies than what has been covered here • Keep in mind that most benchmarks were done using Map Reduce processing rather than Tez. Your performance should be better due to the in-memory processing nature of Tez. Page 36
37.
Page 37 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Wri?ng fast queries: Techniques to op?mize your queries
38.
Page 38 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Optimizing HIVE queries 1. Use Tez 2. Use ORCFile 3. Use Vectoriza?on 4. Use Cost Based Op?miza?on (CBO) 5. Write good SQL 6. Use Hive Explain 7. Consider Hive LLAP
39.
Page 39 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique #1: TEZ vs MR
40.
Page 40 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Understanding Tez vs MapReduce
41.
Page 41 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique #2: use ORCFile
42.
Page 42 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved ORCFile – Efficient Columnar Format High Performance: Split-able, columnar storage file Efficient Reads: Break into large “stripes” of data for efficient read Fast Filtering: Built in index, min/max, metadata for fast filtering blocks - bloom filters if desired Efficient Compression: Decompose complex row types into primitives: massive compression and efficient comparisons for filtering Precomputation: Built in aggregates per block (min, max, count, sum, etc.) Proven at 300 PB scale: Facebook uses ORC for their 300 PB Hive Warehouse
43.
Page 43 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique #3: Use Vectoriza?on
44.
Page 44 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Using Vectorization • Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins • Vectorized query execution streamlines operations by processing a block of 1024 rows at a time (instead of 1 row at a time) • ONLY works with ORCFiles Page 44 SET hive.vectorized.execution.enabled = true; SET hive.vectorized.execution.reduce.enabled=true;
45.
Page 45 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique #4: Use Cost-‐based Op?miza?on
46.
Page 46 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Cost-Based Optimization (CBO) • Cost-‐Based Op-miza-on (CBO) engine uses sta?s?cs within Hive tables to produce op?mal query plans • Two types of stats used for op?miza?on: o Table stats o Column stats • Uses an open-‐source framework called Calcite (formerly Op,q)
47.
Page 47 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Step 1: ensure HIVE has table statistics Hive.stats.autogather=true; • Stats are collected at the table level automa?cally when: • If you have an exis?ng table without stats collected: • For column-‐level sta?s?cs: – HDP 2.1 – HDP 2.2 ANALYZE TABLE table-name COMPUTE STATISTICS; ANALYZE TABLE table-name COMPUTE STATISTICS for COLUMNS col1, col2; ANALYZE TABLE table-name COMPUTE STATISTICS for COLUMNS;
48.
Page 48 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved CBO with Partitioned Tables • When table is par??oned, you need to specify the par??on when collec?ng sta?s?cs: ANALYZE TABLE table-name partition (col1=‘x’) COMPUTE STATISTICS; ANALYZE TABLE table-name partition(col1=‘x’) COMPUTE STATISTICS for COLUMNS;
49.
Page 49 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Step 2: set HIVE properties to enable CBO SET hive.cbo.enable=true; SET hive.compute.query.using.stats = true; And now every query you run will use CBO…
50.
Page 50 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique #5: Write Smart SQL
51.
Page 51 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Query design matters • This is Big Data we’re talking about • So consider performance in every query you write • There are many ways to write SQL with the same functional results, but often varying performance characteristics • Avoid Joins when possible and choose the right Join when not Page 51
52.
Page 52 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique #6: Use Hive Explain
53.
Page 53 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved HIVE EXPLAIN – understanding your query plan Page 53 • It is an advanced tool to debug what HIVE is doing. • Look at the sequence of operations and make sure it looks reasonable • Validate join type (e.g. we’ve asked for a map-side join, did it get executed that way?) At the end of the day, if the plan is bad, everything else (ORC, Vectorization, etc) may not matter. Take a look at the below link on how to understand and analyze your query plan: https://www.slideshare.net/HadoopSummit/how-to-understand-and-analyze-apache-hive- query-execution-plan-for-performance-debugging EXPLAIN {Hive Query}
54.
Page 54 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique #7: Consider Hive LLAP
55.
Page 55 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP Key Benefits à Uses persistent query servers to avoid long startup times and deliver fast SQL. à Enables as fast as sub-second query in Hive by keeping all data and servers running and in-memory all the time. à Shares its in-memory cache among all SQL users, maximizing the use of this scarce resource. à Has fine-grained resource management and preemption, making it great for concurrent access across many users. à Great for cloud because it caches data in memory and keeps it compressed, overcoming long cloud storage access times and stretching the amount of data you can fit in RAM.
56.
Page 56 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Thank You
Download now