ORC Files

•

49 likes•51,456 views

Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.

© Hortonworks Inc. 2012
ORC Files
June 2013
Page 1
Owen O’Malley
owen@hortonworks.com
@owen_omalley
owen@hortonworks.com

© Hortonworks Inc. 2012
Who Am I?
Page 2

© Hortonworks Inc. 2012
Remaining Challenges
Page 4

© Hortonworks Inc. 2012
Requirements
Page 5

© Hortonworks Inc. 2012
File Structure
Page 6

© Hortonworks Inc. 2012
Stripe Structure
Page 7

© Hortonworks Inc. 2012
File Layout
Page 8
File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4

© Hortonworks Inc. 2012
Compression
Page 9

© Hortonworks Inc. 2012
Integer Column Serialization
Page 10

© Hortonworks Inc. 2012
String Column Serialization
Page 11

© Hortonworks Inc. 2012
Hive Compound Types
Page 12
0
Struct
4
Struct
3
String
1
Int
2
Map
7
Time
5
String
6
Double

© Hortonworks Inc. 2012
Compound Type Serialization
Page 13

© Hortonworks Inc. 2012
Generic Compression
Page 14

© Hortonworks Inc. 2012
Column Projection
Page 15

© Hortonworks Inc. 2012
How Do You Use ORC
Page 16

© Hortonworks Inc. 2012
Managing Memory
Page 17

© Hortonworks Inc. 2012
Pavan’s Trick
Page 18

© Hortonworks Inc. 2012
Looking at ORC File Structures
Page 19

© Hortonworks Inc. 2012
Looking at ORC File Structures
Page 20

© Hortonworks Inc. 2012
TPC-DS File Sizes
Page 21

© Hortonworks Inc. 2012
TPC-DS Query Performance
Page 22

© Hortonworks Inc. 2012
Additional Details
Page 23

© Hortonworks Inc. 2012
Current work
Page 24

© Hortonworks Inc. 2012
Vectorization
Page 25

© Hortonworks Inc. 2012
Vectorization Preliminary Results
Page 26

© Hortonworks Inc. 2012
Future Work
Page 27

© Hortonworks Inc. 2012
Comparison
Page 29
RC File Trevni Parquet ORC File
Hive Type Model N N N Y
Separate complex columns N Y Y Y
Splits found quickly N Y Y Y
Default column group size 4MB 64MB* 64MB* 256MB
Files per a bucket 1 > 1 1* 1
Store min, max, sum, count N N N Y
Versioned metadata N Y Y Y
Run length data encoding N N Y Y
Store strings in dictionary N N N Y
Store row count N Y N Y
Skip compressed blocks N N N Y
Store internal indexes N N N Y

What's hot

What is new in Apache Hive 3.0?DataWorks Summit

Apache Nifi Crash CourseDataWorks Summit

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward

My first 90 days with ClickHouse.pdfAlkin Tezuysal

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Spark Summit

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Apache Tez: Accelerating Hadoop Query ProcessingHortonworks

Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

HBase Low LatencyDataWorks Summit

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

An overview of Neo4j InternalsTobias Lindaaker

Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsDatabricks

NiFi Developer GuideDeon Huang

Observability for Data Pipelines With OpenLineageDatabricks

Performance Update: When Apache ORC Met Apache SparkDataWorks Summit

SparkSQL: A Compiler from Queries to RDDsDatabricks

What's hot (20)

What is new in Apache Hive 3.0?

Apache Nifi Crash Course

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...

Tame the small files problem and optimize data layout for streaming ingestion...

My first 90 days with ClickHouse.pdf

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro

Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

The Parquet Format and Performance Optimization Opportunities

Apache Tez: Accelerating Hadoop Query Processing

Building robust CDC pipeline with Apache Hudi and Debezium

Iceberg: A modern table format for big data (Strata NY 2018)

HBase Low Latency

Efficient Data Storage for Analytics with Apache Parquet 2.0

An overview of Neo4j Internals

Materialized Column: An Efficient Way to Optimize Queries on Nested Columns

NiFi Developer Guide

Observability for Data Pipelines With OpenLineage

Performance Update: When Apache ORC Met Apache Spark

SparkSQL: A Compiler from Queries to RDDs

Similar to ORC Files

Using Apache Hive with High PerformanceInderaj (Raj) Bains

Optimizing Hive QueriesOwen O'Malley

Optimizing Hive QueriesDataWorks Summit

ORC: 2015 Faster, Better, SmallerDataWorks Summit

Getting Started with MongoDB Using the Microsoft Stack MongoDB

ORC 2015t3rmin4t0r

Hive on spark is blazing fast or is it finalHortonworks

MOUG17 Keynote: Oracle OpenWorld Major AnnouncementsMonica Li

Data lake – On Premise VS CloudIdan Tohami

SQL in the Hybrid WorldTanel Poder

Enabling R on HadoopDataWorks Summit

Migre sus bases de datos Oracle a la nube EDB

ORC 2015: Faster, Better, SmallerThe Apache Software Foundation

Building Operational Data Lake using Spark and SequoiaDB with Yang PengDatabricks

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)MongoDB

Migration DB2 to EDB - Project ExperienceEDB

LA HUG - Agile Analytics Applications on HDPHortonworks

Things learned from OpenWorld 2013Connor McDonald

Whats new in Oracle Database 12c release 12.1.0.2Connor McDonald

What's New in Apache Hive 3.0?DataWorks Summit

Similar to ORC Files (20)

Using Apache Hive with High Performance

Optimizing Hive Queries

ORC: 2015 Faster, Better, Smaller

Getting Started with MongoDB Using the Microsoft Stack

ORC 2015

Hive on spark is blazing fast or is it final

MOUG17 Keynote: Oracle OpenWorld Major Announcements

Data lake – On Premise VS Cloud

SQL in the Hybrid World

Enabling R on Hadoop

Migre sus bases de datos Oracle a la nube

ORC 2015: Faster, Better, Smaller

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)

Migration DB2 to EDB - Project Experience

LA HUG - Agile Analytics Applications on HDP

Things learned from OpenWorld 2013

Whats new in Oracle Database 12c release 12.1.0.2

What's New in Apache Hive 3.0?

More from Owen O'Malley

Running An Apache Project: 10 Traps and How to Avoid ThemOwen O'Malley

Big Data's Journey to ACIDOwen O'Malley

Protect your private data with ORC column encryptionOwen O'Malley

Fine Grain Access Control for Big Data: ORC Column EncryptionOwen O'Malley

Fast Access to Your Data - Avro, JSON, ORC, and ParquetOwen O'Malley

Strata NYC 2018 IcebergOwen O'Malley

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetOwen O'Malley

ORC Column EncryptionOwen O'Malley

Protecting Enterprise Data in Apache HadoopOwen O'Malley

Data protection2015Owen O'Malley

Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley

Hadoop Security ArchitectureOwen O'Malley

Adding ACID Updates to HiveOwen O'Malley

ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley

Next Generation Hadoop OperationsOwen O'Malley

Next Generation MapReduceOwen O'Malley

Bay Area HUG Feb 2011 IntroOwen O'Malley

Plugging the Holes: Security and Compatability in HadoopOwen O'Malley

More from Owen O'Malley (18)

Running An Apache Project: 10 Traps and How to Avoid Them

Big Data's Journey to ACID

Protect your private data with ORC column encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Strata NYC 2018 Iceberg

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

ORC Column Encryption

Protecting Enterprise Data in Apache Hadoop

Data protection2015

Structor - Automated Building of Virtual Hadoop Clusters

Hadoop Security Architecture

Adding ACID Updates to Hive

ORC File and Vectorization - Hadoop Summit 2013

Next Generation Hadoop Operations

Next Generation MapReduce

Bay Area HUG Feb 2011 Intro

Plugging the Holes: Security and Compatability in Hadoop

ORC Files

1. © Hortonworks Inc. 2012 ORC Files June 2013 Page 1 Owen O’Malley owen@hortonworks.com @owen_omalley owen@hortonworks.com

8. © Hortonworks Inc. 2012 File Layout Page 8 File Footer Postscript Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Stream 2.1 Stream 2.2 Stream 2.3 Stream 2.4

29. © Hortonworks Inc. 2012 Comparison Page 29 RC File Trevni Parquet ORC File Hive Type Model N N N Y Separate complex columns N Y Y Y Splits found quickly N Y Y Y Default column group size 4MB 64MB* 64MB* 256MB Files per a bucket 1 > 1 1* 1 Store min, max, sum, count N N N Y Versioned metadata N Y Y Y Run length data encoding N N Y Y Store strings in dictionary N N N Y Store row count N Y N Y Skip compressed blocks N N N Y Store internal indexes N N N Y

ORC Files

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ORC Files

Similar to ORC Files (20)

More from Owen O'Malley

More from Owen O'Malley (18)

ORC Files