ORC File & Vectorization - Improving Hive Data Storage and Query Performance

•Download as PPTX, PDF•

20 likes•10,902 views

Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.

Technology Business

Copyright 2013 by Hortonworks and Microsoft
ORC File & Vectorization
Improving Hive Data Storage and Query Performance
June 2013
Page 1
Owen O’Malley
owen@hortonworks.com
@owen_omalley
Jitendra Pandey
jitendra@hortonworks.com
Eric Hanson
ehans@microsoft.com
owen@hortonworks.c
om

File Layout
Page 8
File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4

Hive Compound Types
Page 12
0
Struct
4
Struct
3
String
1
Int
2
Map
7
Time
5
String
6
Double

Comparison
Page 23
RC File Trevni Parquet ORC
Hive Integration Y N N Y
Active Development N N Y Y
Hive Type Model N N N Y
Shred complex columns N Y Y Y
Splits found quickly N Y Y Y
Files per a bucket 1 many 1 or many 1
Versioned metadata N Y Y Y
Run length data encoding N N Y Y
Store strings in dictionary N N Y Y
Store min, max, sum, count N N N Y
Store internal indexes N N N Y
No overhead for non-null N N N Y ≥ 0.12
Predicate Pushdown N N N Y ≥ 0.12

Why row-at-a-time execution is slow
Page 26
• Hive uses Object Inspectors to work on a row
• Enables level of abstraction
• Costs major performance
• Exacerbated by using lazy serdes
• Inner loop has many method, new(), and if-
then-else calls
• Lots of CPU instructions
• Pipeline stalls Poor instructions/cycle
• Poor cache locality

How the code works (simplified)
Page 27
class LongColumnAddLongScalarExpression {
int inputColumn;
int outputColumn;
long scalar;
void evaluate(VectorizedRowBatch batch) {
long [] inVector =
((LongColumnVector) batch.columns[inputColumn]).vector;
long [] outVector =
((LongColumnVector) batch.columns[outputColumn]).vector;
if (batch.selectedInUse) {
for (int j = 0; j < batch.size; j++) {
int i = batch.selected[j];
outVector[i] = inVector[i] + scalar;
}
} else {
for (int i = 0; i < batch.size; i++) {
outVector[i] = inVector[i] + scalar;
}
}
}
}
}
No method calls
Low instruction count
Cache locality to 1024 values
No pipeline stalls
SIMD in Java 8

Preliminary performance results
• NOT a benchmark
• 218 million row fact table of real data, 25 columns
• 18GB raw data
• 6 core, 12 thread workstation, 1 disk, 16GB RAM
• select a, b, count(*) from t
where c >= const group by a, b -- 53 row result
Page 29
warm start times RC non-
vectorized
(default, not
compressed)
ORC non-
vectorized
(default,
compressed)
ORC vectorized
(default,
compressed)
Runtime (sec) 261 58 43
Total CPU (sec) 381 159 42

Thanks to contributors!
Page 30
• Microsoft Big Data:
• Eric Hanson, Remus Rusanu, Sarvesh
Sakalanaga, Tony Murphy, Ashit Gosalia
• Hortonworks:
• Jitendra Pandey, Owen O’Malley, Gopal V
• Others:
• Teddy Choi, Tim Chen
Jitendra/Eric are joint leads

What's hot

The Parquet Format and Performance Optimization OpportunitiesDatabricks

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

Apache Spark ArchitectureAlexey Grishchenko

Optimizing Hive QueriesOwen O'Malley

Apache Iceberg: An Architectural Look Under the CoversScyllaDB

Deep Dive: Memory Management in Apache SparkDatabricks

Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks

HTAP QueriesAtif Shaikh

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

The Apache Spark File Format EcosystemDatabricks

Spark shuffle introductioncolorant

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks

Hive+Tez: A performance deep divet3rmin4t0r

Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks

Apache Spark FundamentalsZahra Eskandari

Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

What's hot (20)

The Parquet Format and Performance Optimization Opportunities

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro

Apache Spark Architecture

Optimizing Hive Queries

Apache Iceberg: An Architectural Look Under the Covers

Deep Dive: Memory Management in Apache Spark

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Iceberg: A modern table format for big data (Strata NY 2018)

Optimizing Delta/Parquet Data Lakes for Apache Spark

HTAP Queries

A Deep Dive into Query Execution Engine of Spark SQL

The Apache Spark File Format Ecosystem

Spark shuffle introduction

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Hive+Tez: A performance deep dive

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Apache Spark Fundamentals

Apache Tez - A New Chapter in Hadoop Data Processing

Top 5 Mistakes When Writing Spark Applications

Viewers also liked

Hive tuningMichael Zhang

Hive: Loading DataBenjamin Leonhardi

File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

ORC 2015: Faster, Better, SmallerDataWorks Summit

ORC FilesOwen O'Malley

Parquet Strata/Hadoop World, New York 2013Julien Le Dem

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon

ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley

Harnessing the Hadoop Ecosystem Optimizations in Apache HiveQubole

ORC File IntroductionOwen O'Malley

LLAP Nov Meetupt3rmin4t0r

ORC 2015t3rmin4t0r

Indexed HiveNikhilDeshpande

Data organization: hive meetupt3rmin4t0r

Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks

Parquet and AVROairisData

LLAP: long-lived execution in HiveDataWorks Summit

Big data: Loading your data with flume and sqoopChristophe Marchal

Effective Hive QueriesQubole

Viewers also liked (20)

Hive tuning

Hive: Loading Data

File Format Benchmarks - Avro, JSON, ORC, & Parquet

Efficient Data Storage for Analytics with Apache Parquet 2.0

ORC 2015: Faster, Better, Smaller

ORC Files

Parquet Strata/Hadoop World, New York 2013

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...

ORC File and Vectorization - Hadoop Summit 2013

Harnessing the Hadoop Ecosystem Optimizations in Apache Hive

ORC File Introduction

LLAP Nov Meetup

ORC 2015

Indexed Hive

Data organization: hive meetup

Project Tungsten: Bringing Spark Closer to Bare Metal

Parquet and AVRO

LLAP: long-lived execution in Hive

Big data: Loading your data with flume and sqoop

Effective Hive Queries

Similar to ORC File & Vectorization - Improving Hive Data Storage and Query Performance

Overview of the Hive Stinger InitiativeModern Data Stack France

InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData

Master tuningThomas Kejser

Web analytics at scale with Druid at naver.comJungsu Heo

CBStreams - Java Streams for ColdFusion (CFML)Ortus Solutions, Corp

ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...Ortus Solutions, Corp

User Group3009sqlserver.co.il

OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuNETWAYS

OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuNETWAYS

Fighting Against Chaotically Separated Values with EmbulkSadayuki Furuhashi

WebObjects OptimizationWO Community

Nodejs - Should Ruby Developers Care?Felix Geisendörfer

NOSQL and Cassandrarantav

Google cloud Dataflow & Apache FlinkIván Fernández Perea

Using Apache Hive with High PerformanceInderaj (Raj) Bains

Orms vs Micro-ORMsDavid Paquette

Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineDataWorks Summit

VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log InsightVMworld

Performance optimization - JavaScriptFilip Mares

Node.js: The What, The How and The WhenFITC

Similar to ORC File & Vectorization - Improving Hive Data Storage and Query Performance (20)

Overview of the Hive Stinger Initiative

InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...

Master tuning

Web analytics at scale with Druid at naver.com

CBStreams - Java Streams for ColdFusion (CFML)

ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...

User Group3009

OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu

OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu

Fighting Against Chaotically Separated Values with Embulk

WebObjects Optimization

Nodejs - Should Ruby Developers Care?

NOSQL and Cassandra

Google cloud Dataflow & Apache Flink

Using Apache Hive with High Performance

Orms vs Micro-ORMs

Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine

VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight

Performance optimization - JavaScript

Node.js: The What, The How and The When

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

From Family Reminiscence to Scholarly Archive .Alan Dix

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Advanced Computer Architecture – An IntroductionDilum Bandara

How to write a Business Continuity PlanDatabarracks

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024

Streamlining Python Development: A Guide to a Modern Project Setup

The State of Passkeys with FIDO Alliance.pptx

What's New in Teams Calling, Meetings and Devices March 2024

From Family Reminiscence to Scholarly Archive .

Are Multi-Cloud and Serverless Good or Bad?

DMCC Future of Trade Web3 - Special Edition

Unraveling Multimodality with Large Language Models.pdf

DSPy a system for AI to Write Prompts and Do Fine Tuning

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

DevEX - reference for building teams, processes, and platforms

Advanced Computer Architecture – An Introduction

How to write a Business Continuity Plan

How AI, OpenAI, and ChatGPT impact business and software.

Commit 2024 - Secret Management made easy

Developer Data Modeling Mistakes: From Postgres to NoSQL

Unleash Your Potential - Namagunga Girls Coding Club

ORC File & Vectorization - Improving Hive Data Storage and Query Performance

1. Copyright 2013 by Hortonworks and Microsoft ORC File & Vectorization Improving Hive Data Storage and Query Performance June 2013 Page 1 Owen O’Malley owen@hortonworks.com @owen_omalley Jitendra Pandey jitendra@hortonworks.com Eric Hanson ehans@microsoft.com owen@hortonworks.c om

2. ORC – Optimized RC File Page 2

3. History Page 3

4. Remaining Challenges Page 4

5. Requirements Page 5

6. File Structure Page 6

7. Stripe Structure Page 7

8. File Layout Page 8 File Footer Postscript Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Stream 2.1 Stream 2.2 Stream 2.3 Stream 2.4

9. Compression Page 9

10. Integer Column Serialization Page 10

11. String Column Serialization Page 11

12. Hive Compound Types Page 12 0 Struct 4 Struct 3 String 1 Int 2 Map 7 Time 5 String 6 Double

13. Compound Type Serialization Page 13

14. Generic Compression Page 14

15. Column Projection Page 15

16. How Do You Use ORC Page 16

17. Managing Memory Page 17

18. TPC-DS File Sizes Page 18

19. ORC Predicate Pushdown Page 19

20. Additional Details Page 20

21. Current work for Hive 0.12 Page 21

22. Future Work Page 22

23. Comparison Page 23 RC File Trevni Parquet ORC Hive Integration Y N N Y Active Development N N Y Y Hive Type Model N N N Y Shred complex columns N Y Y Y Splits found quickly N Y Y Y Files per a bucket 1 many 1 or many 1 Versioned metadata N Y Y Y Run length data encoding N N Y Y Store strings in dictionary N N Y Y Store min, max, sum, count N N N Y Store internal indexes N N N Y No overhead for non-null N N N Y ≥ 0.12 Predicate Pushdown N N N Y ≥ 0.12

24. Vectorization Page 24

25. Vectorization Page 25

26. Why row-at-a-time execution is slow Page 26 • Hive uses Object Inspectors to work on a row • Enables level of abstraction • Costs major performance • Exacerbated by using lazy serdes • Inner loop has many method, new(), and if- then-else calls • Lots of CPU instructions • Pipeline stalls Poor instructions/cycle • Poor cache locality

27. How the code works (simplified) Page 27 class LongColumnAddLongScalarExpression { int inputColumn; int outputColumn; long scalar; void evaluate(VectorizedRowBatch batch) { long [] inVector = ((LongColumnVector) batch.columns[inputColumn]).vector; long [] outVector = ((LongColumnVector) batch.columns[outputColumn]).vector; if (batch.selectedInUse) { for (int j = 0; j < batch.size; j++) { int i = batch.selected[j]; outVector[i] = inVector[i] + scalar; } } else { for (int i = 0; i < batch.size; i++) { outVector[i] = inVector[i] + scalar; } } } } } No method calls Low instruction count Cache locality to 1024 values No pipeline stalls SIMD in Java 8

28. Vectorization project Page 28

29. Preliminary performance results • NOT a benchmark • 218 million row fact table of real data, 25 columns • 18GB raw data • 6 core, 12 thread workstation, 1 disk, 16GB RAM • select a, b, count(*) from t where c >= const group by a, b -- 53 row result Page 29 warm start times RC non- vectorized (default, not compressed) ORC non- vectorized (default, compressed) ORC vectorized (default, compressed) Runtime (sec) 261 58 43 Total CPU (sec) 381 159 42

30. Thanks to contributors! Page 30 • Microsoft Big Data: • Eric Hanson, Remus Rusanu, Sarvesh Sakalanaga, Tony Murphy, Ashit Gosalia • Hortonworks: • Jitendra Pandey, Owen O’Malley, Gopal V • Others: • Teddy Choi, Tim Chen Jitendra/Eric are joint leads

ORC File & Vectorization - Improving Hive Data Storage and Query Performance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to ORC File & Vectorization - Improving Hive Data Storage and Query Performance

Similar to ORC File & Vectorization - Improving Hive Data Storage and Query Performance (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

ORC File & Vectorization - Improving Hive Data Storage and Query Performance