Data organization: hive meetup

•Download as PPTX, PDF•

5 likes•2,148 views

The document discusses various techniques for optimizing data organization and performance in Hive, including: - Partitioning data by meaningful columns like customer ID or VIN to improve lookup performance. - Using the right number and size of buckets to avoid performance issues from too many small files or skewed data distribution. - Denormalizing data and optimizing JOIN queries through techniques like broadcast joins. - Storing data in its natural types like numbers instead of strings to enable predicate pushdown and better performance. - Using temporary tables and in-memory storage to optimize queries involving data reorganization or distinct slices.

Software

Page1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive: Data Organization for Performance
Gopal Vijayaraghavan

Page2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
In this episode
All BigData problems are primarily lookup problems
All Lookup problems are really Storage problems
All Storage problems turn into ETL problems
ETL problems are all about the Data
Data navigation?
Data organization?
Data ingestion?
It’s Big?

Page3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Good idea: Do things that scale!
There are many problems like this, but this one is mine

Page4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Partitions
If you have a database on cars and you
partition on VIN#
If you have a database on sales and you
partition on customer_id
Rule of thumb: Average partition is
>=1Gb and total # of partitions per
query <1000

Page5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Buckets
If you have more files than rows, you’ve
definitely got bucketing wrong
“clustered by” != “cluster by”
Bucketing on a skewed column slows
down ETL a *lot* (for no win)
If you have partitions, sort-merge
bucket-mapjoin can be slower than a
shuffle!!

Page6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Buckets - II
Histograms!
select explode(histogram_numeric(
hash(<col>)% <n-bucket>, <n-bucket>
)) as h from table;
The Curse of 31 & the last byte
If you have buckets & partitions, always
remember to ETL with
set hive.optimize.sort.dynamic.partition=true;

Page7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Denormalization
Denormalization can turn a compute
problem into an IO/lookup problem.
But if you then optimize that with
compression, you get a compute
problem again.
If you think JOINs are bad, you
probably haven’t moved out of
MapReduce.
Broadcast joins are good & dynamically
partitioned broadcast joins can scale
that ~1000x

Page8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Indexes?
Indexes in hive barely help in a
columnar world – incremental rebuild
isn’t really there
ORC maintains internal bloom filter
indexes (PARQUET-41 too)
You can store your indexes as ORC files,
if you want, so that you can have an
index in your index, to speedup indexes

Page9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema & Predicate Push Down
Never store a Number as a string,
because guess what “11” < “9” and
“11.0” != “11” – transform, then load
Predicate push-down cannot fight the
type system (♫ … breaking blocks in the
hot sun ♫)
UDFs applied on the data column is
always a bad idea for fast filtering.
If you need case-insensitive lookups,
always store as UPPER/lower.
If you need LIKE “%.twimg.com”, store
like DNS does “.com.twimg…”

Page10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Temporary tables
In-memory temp-tables
set hive.exec.temporary.table.storage=memory;
Easiest way to reorganize data
temporarily or to produce a “distinct
slice”
“create temporary table if not exists stored as
orc as select …”
Can be used for pagination queries to
good effect, for display tools

Page11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Complex types & Nesting
There’s pretty much no advantage to
using structs – they’re nearly
columns, without any of the good
stuff
Maps – not so bad, but handle with
care
Maps are way better than 4000
columns, most of them null
Arrays – ignore mostly
(JDBC!!)

Page12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema Evolution
Add columns, never remove them
Schemas are per-partition
Remember, the partitions don’t change their
schema after they’re created
All new inserts have new schema
After schema update, inserting data into old
partitions is a recipe for disaster
Type changes for a column also complicate
things (except for simple stuff like Int ->
BigInt or Float -> Double)

Page13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions?
?

What's hot

Optimizing Hive QueriesOwen O'Malley

Major advancements in Apache Hive towards full support of SQL complianceDataWorks Summit/Hadoop Summit

Tune up Yarn and Hiverxu

Hive acid and_2.x new_featuresAlberto Romero

LLAP Nov Meetupt3rmin4t0r

Apache Hive ACID ProjectDataWorks Summit/Hadoop Summit

LLAP: long-lived execution in HiveDataWorks Summit

Hive Does ACIDDataWorks Summit

LLAP: Sub-Second Analytical Queries in HiveDataWorks Summit/Hadoop Summit

Hive Data Modeling and Query OptimizationEyad Garelnabi

Hive acid-updates-summit-sjc-2014alanfgates

ORC 2015t3rmin4t0r

HiveACIDPublicInderaj (Raj) Bains

Sub-second-sql-on-hadoop-at-scaleYifeng Jiang

Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...DataWorks Summit

ORC 2015: Faster, Better, SmallerThe Apache Software Foundation

Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015alanfgates

Apache Hive on ACIDDataWorks Summit/Hadoop Summit

Hive on spark is blazing fast or is it finalHortonworks

Achieving 100k Queries per Hour on Hive on TezDataWorks Summit/Hadoop Summit

What's hot (20)

Optimizing Hive Queries

Major advancements in Apache Hive towards full support of SQL compliance

Tune up Yarn and Hive

Hive acid and_2.x new_features

LLAP Nov Meetup

Apache Hive ACID Project

LLAP: long-lived execution in Hive

Hive Does ACID

LLAP: Sub-Second Analytical Queries in Hive

Hive Data Modeling and Query Optimization

Hive acid-updates-summit-sjc-2014

ORC 2015

HiveACIDPublic

Sub-second-sql-on-hadoop-at-scale

Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...

ORC 2015: Faster, Better, Smaller

Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015

Apache Hive on ACID

Hive on spark is blazing fast or is it final

Achieving 100k Queries per Hour on Hive on Tez

Viewers also liked

Hive+Tez: A performance deep divet3rmin4t0r

Using Apache Hive with High PerformanceInderaj (Raj) Bains

Hive tuningMichael Zhang

Apache Hive 2.0: SQL, Speed, ScaleDataWorks Summit/Hadoop Summit

Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks

Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive PerformanceOlga Lavrentieva

What's new in Apache HiveDataWorks Summit

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.

Hive Demo Paper at VLDB 2009Namit Jain

ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit

Advanced Analytics using Apache HiveMurtaza Doctor

Analytical Queries with Hive: SQL Windowing and Table FunctionsDataWorks Summit

Methods Of OrganizationBarbara Yardley

Methods of organizing dataRoxane La'O

frequency distribution tableMonie Ali

Frequency distributionmetnashikiom2011-13

Presentation of Data and Frequency DistributionElain Cruz

Hive + Tez: A Performance Deep DiveDataWorks Summit

How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit

Frequency Distributions and Graphsmonritche

Viewers also liked (20)

Hive+Tez: A performance deep dive

Using Apache Hive with High Performance

Hive tuning

Apache Hive 2.0: SQL, Speed, Scale

Hortonworks Technical Workshop: Interactive Query with Apache Hive

Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

What's new in Apache Hive

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

Hive Demo Paper at VLDB 2009

ORC File & Vectorization - Improving Hive Data Storage and Query Performance

Advanced Analytics using Apache Hive

Analytical Queries with Hive: SQL Windowing and Table Functions

Methods Of Organization

Methods of organizing data

frequency distribution table

Frequency distribution

Presentation of Data and Frequency Distribution

Hive + Tez: A Performance Deep Dive

How to understand and analyze Apache Hive query execution plan for performanc...

Frequency Distributions and Graphs

Similar to Data organization: hive meetup

Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCCal Henderson

Hadoop crash course workshop at Hadoop SummitDataWorks Summit

Making MySQL Great For Business IntelligenceCalpont

Building modern data lakes Minio

SQL Server In-Memory OLTP introduction (Hekaton)Shy Engelberg

Web20expo Scalable Web Archroyans

Web20expo Scalable Web Archguest18a0f1

Web20expo Scalable Web Archmclee

ActiveWarehouse/ETL - BI & DW for Ruby/RailsPaul Gallagher

Front Range PHP NoSQL DatabasesJon Meredith

Apache Phoenix + Apache HBaseDataWorks Summit/Hadoop Summit

Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser

Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3DataWorks Summit

The Computer Science Behind a modern Distributed DatabaseArangoDB Database

Hive ACID Apache BigData 2016alanfgates

Apache Hive on ACIDHortonworks

Bhupeshbansal bigdata Bhupesh Bansal

SAS Programming.pptssuser660bb1

In-Place analytics with Unified Data AccessDataWorks Summit

World-class Data Engineering with Amazon RedshiftLars Kamp

Similar to Data organization: hive meetup (20)

Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC

Hadoop crash course workshop at Hadoop Summit

Making MySQL Great For Business Intelligence

Building modern data lakes

SQL Server In-Memory OLTP introduction (Hekaton)

Web20expo Scalable Web Arch

ActiveWarehouse/ETL - BI & DW for Ruby/Rails

Front Range PHP NoSQL Databases

Apache Phoenix + Apache HBase

Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3

The Computer Science Behind a modern Distributed Database

Hive ACID Apache BigData 2016

Apache Hive on ACID

Bhupeshbansal bigdata

SAS Programming.ppt

In-Place analytics with Unified Data Access

World-class Data Engineering with Amazon Redshift

Recently uploaded

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

tonesoftglanshi9

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd

%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171

Right Money Management App For Your Financial GoalsJhone kinadey

WSO2CON2024 - It's time to go PlatformlessWSO2

Direct Style Effect Systems -The Print[A] Example- A Comprehension AidPhilip Schwarz

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg

Announcing Codolex 2.0 from GDK SoftwareJim McKeeth

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba

VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale

Recently uploaded (20)

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

tonesoftg

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...

%in Harare+277-882-255-28 abortion pills for sale in Harare

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf

Right Money Management App For Your Financial Goals

WSO2CON2024 - It's time to go Platformless

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...

Announcing Codolex 2.0 from GDK Software

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein

VTU technical seminar 8Th Sem on Scikit-learn

Data organization: hive meetup

2. Page2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved In this episode All BigData problems are primarily lookup problems All Lookup problems are really Storage problems All Storage problems turn into ETL problems ETL problems are all about the Data Data navigation? Data organization? Data ingestion? It’s Big?

4. Page4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Partitions If you have a database on cars and you partition on VIN# If you have a database on sales and you partition on customer_id Rule of thumb: Average partition is >=1Gb and total # of partitions per query <1000

5. Page5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Buckets If you have more files than rows, you’ve definitely got bucketing wrong “clustered by” != “cluster by” Bucketing on a skewed column slows down ETL a *lot* (for no win) If you have partitions, sort-merge bucket-mapjoin can be slower than a shuffle!!

6. Page6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Buckets - II Histograms! select explode(histogram_numeric( hash(<col>)% <n-bucket>, <n-bucket> )) as h from table; The Curse of 31 & the last byte If you have buckets & partitions, always remember to ETL with set hive.optimize.sort.dynamic.partition=true;

7. Page7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Denormalization Denormalization can turn a compute problem into an IO/lookup problem. But if you then optimize that with compression, you get a compute problem again. If you think JOINs are bad, you probably haven’t moved out of MapReduce. Broadcast joins are good & dynamically partitioned broadcast joins can scale that ~1000x

8. Page8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Indexes? Indexes in hive barely help in a columnar world – incremental rebuild isn’t really there ORC maintains internal bloom filter indexes (PARQUET-41 too) You can store your indexes as ORC files, if you want, so that you can have an index in your index, to speedup indexes

9. Page9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Schema & Predicate Push Down Never store a Number as a string, because guess what “11” < “9” and “11.0” != “11” – transform, then load Predicate push-down cannot fight the type system (♫ … breaking blocks in the hot sun ♫) UDFs applied on the data column is always a bad idea for fast filtering. If you need case-insensitive lookups, always store as UPPER/lower. If you need LIKE “%.twimg.com”, store like DNS does “.com.twimg…”

10. Page10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Temporary tables In-memory temp-tables set hive.exec.temporary.table.storage=memory; Easiest way to reorganize data temporarily or to produce a “distinct slice” “create temporary table if not exists stored as orc as select …” Can be used for pagination queries to good effect, for display tools

11. Page11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Complex types & Nesting There’s pretty much no advantage to using structs – they’re nearly columns, without any of the good stuff Maps – not so bad, but handle with care Maps are way better than 4000 columns, most of them null Arrays – ignore mostly (JDBC!!)

12. Page12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Schema Evolution Add columns, never remove them Schemas are per-partition Remember, the partitions don’t change their schema after they’re created All new inserts have new schema After schema update, inserting data into old partitions is a recipe for disaster Type changes for a column also complicate things (except for simple stuff like Int -> BigInt or Float -> Double)

Data organization: hive meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Data organization: hive meetup

Similar to Data organization: hive meetup (20)

Recently uploaded

Recently uploaded (20)

Data organization: hive meetup