SlideShare a Scribd company logo
1 of 45
1 © Hortonworks Inc. 2011–2018. All rights reserved
ORC Improvement in Apache Spark 2.3
Dongjoon Hyun
Principal Software Engineer @ Hortonworks Data Science Team
April 2018
2 © Hortonworks Inc. 2011–2018. All rights reserved
Dongjoon Hyun
• Hortonworks
• Principal Software Engineer @ Data Science Team
• Apache Project
• Apache REEF Project Management Committee(PMC) Member & Committer
• Apache Spark Project Contributor
• GitHub
• https://github.com/dongjoon-hyun
3 © Hortonworks Inc. 2011–2018. All rights reserved
Agenda
• What’s New in Apache Spark 2.3
• Previous ORC issues in Apache Spark
• Current Approach & Demo
• Performance & Limitation
• Future roadmap
4 © Hortonworks Inc. 2011–2018. All rights reserved
• Vectorized ORC Reader
• Structured Streaming with ORC
• Schema evolution with ORC
• PySpark Performance Enhancements
with Apache Arrow and ORC
• Structured stream-stream joins
• Spark History Server V2
• Spark on Kubernetes
• Data source API V2
• Streaming API V2
• Continuous Structured Streaming
Processing
Major Features Experimental Features
What’s New in Apache Spark 2.3
5 © Hortonworks Inc. 2011–2018. All rights reserved
• Vectorized ORC Reader
• Structured Streaming with ORC
• Schema evolution with ORC
• PySpark Performance Enhancements
with Apache Arrow and ORC
• Structured stream-stream joins
• Spark History Server V2
• Spark on Kubernetes
• Data source API V2
• Streaming API V2
• Continuous Structured Streaming
Processing
Major Features Experimental Features
What’s New in Apache Spark 2.3
6 © Hortonworks Inc. 2011–2018. All rights reserved
Spark’s file-based data sources
• TEXT The simplest one with one string column schema
• CSV Popular for data science workloads
• JSON The most flexible one for schema changes
• PARQUET The only one with vectorized reader
• ORC Popular for shared Hive tables
7 © Hortonworks Inc. 2011–2018. All rights reserved
Motivation
• TEXT The simplest one with one string column schema
• CSV Popular for data science workloads
• JSON The most flexible one for schema changes
• PARQUET The only one with vectorized reader
• ORC Popular for shared Hive tables
Fast
Flexible
Hive Table Access
8 © Hortonworks Inc. 2011–2018. All rights reserved
Previous ORC Issues in Spark
9 © Hortonworks Inc. 2011–2018. All rights reserved
Background – The story of Spark, ORC, and Hive
• Before Apache ORC
• Hive 1.2.1 (2015 JUN)  SPARK-2883 (Spark 1.4)
• After Apache ORC
• v1.0.0 (2016 JAN)
...
• v1.3.3 (2017 FEB)
• v1.4.0 (2017 MAY)
10 © Hortonworks Inc. 2011–2018. All rights reserved
Background – The story of Spark, ORC, and Hive – Cont.
• Before Apache ORC
• Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4
• After Apache ORC
• v1.0.0 (2016 JAN)
...
• v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3
• v1.4.0 (2017 MAY)  SPARK-21422 Spark 2.3
• v1.4.1 (2017 OCT)  SPARK-22300 Spark 2.3
• v1.4.3 (2018 FEB)  SPARK-23340, HIVE-18674 Hive 3.0 Spark 2.4
11 © Hortonworks Inc. 2011–2018. All rights reserved
Six Issue Categories
• ORC Writer Versions
• Performance
• Structured streaming
• Column names
• Hive tables and schema evolution
• Robustness
12 © Hortonworks Inc. 2011–2018. All rights reserved
Category 1 – ORC Writer Versions
• ORIGINAL
• HIVE_8732 (2014) ORC string statistics are not merged correctly
• HIVE_4243 (2015) Fix column names in FileSinkOperator
• HIVE_12055(2015) Create row-by-row shims for the write path
• HIVE_13083(2016) Writing HiveDecimal can wrongly suppress
present stream
• ORC_101 (2016) Correct the use of the default charset in bloomfilter
• ORC_135 (2018) PPD for timestamp is wrong when reader/writer
timezones are different
13 © Hortonworks Inc. 2011–2018. All rights reserved
Category 2 – Performance
• Vectorized ORC Reader (SPARK-16060)
• Fast reading partition-columns (SPARK-22712)
• Pushing down filters for DateType (SPARK-21787)
14 © Hortonworks Inc. 2011–2018. All rights reserved
• `FileNotFoundException` at writing
empty partitions as ORC
• Create structured steam with ORC files
Write (SPARK-15474) Read (SPARK-22781)
Category 3 – Structured streaming
spark.readStream.orc(path)
15 © Hortonworks Inc. 2011–2018. All rights reserved
Category 4 – Column names
• Unicode column names (SPARK-23072)
• Column names with dot (SPARK-21791)
• Should not create invalid column names (SPARK-21912)
16 © Hortonworks Inc. 2011–2018. All rights reserved
Category 5 – Hive tables and schema evolution
• Support `ALTER TABLE ADD COLUMNS` (SPARK-21929)
• Introduced at Spark 2.2, but throws AnalysisException for ORC
• Support column positional mismatch (SPARK-22267)
• Return wrong result if ORC file schema is different from Hive MetaStore schema order
17 © Hortonworks Inc. 2011–2018. All rights reserved
Category 6 – Robustness
• ORC metadata exceed ProtoBuf message size limit (SPARK-19109)
• NullPointerException on zero-size ORC file (SPARK-19809)
• Support `ignoreCorruptFiles` (SPARK-23049)
• Support `ignoreMissingFiles` (SPARK-23305)
18 © Hortonworks Inc. 2011–2018. All rights reserved
Current Approach
19 © Hortonworks Inc. 2011–2018. All rights reserved
Supports two ORC file formats
• Adding a new OrcFileFormat (SPARK-20682)
FileFormat
TextBasedFileFormat
ParquetFileFormat
OrcFileFormat
HiveFileFormat
JsonFileFormat
LibSVMFileFormat
CSVFileFormat
TextFileFormat
o.a.s.sql.execution.datasources
o.a.s.ml.source.libsvmo.a.s.sql.hive.orc
OrcFileFormat
`hive` OrcFileFormat
from Hive 1.2.1
`native` OrcFileFormat
with ORC 1.4.1
20 © Hortonworks Inc. 2011–2018. All rights reserved
In Reality – Four cases for ORC Reader/Writer
`hive` Reader`native` Reader
`hive` Writer
`native` Writer
• New Data
• New Apps
• Best performance
(Vectorized Reader)
• New Data
• Old Apps
• Improved performance
(Non-vectorized Reader)
• Old Data
• New Apps
• Improved performance
(Vectorized Reader)
• Old Data
• Old Apps
• As-Is performance
(Non-vectorized Reader)
1
2
3
4
21 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Single column scan from wide tables
Number of columns
Time
(ms)
1M rows with all BIGINT columns
0
200
400
600
800
1000
1200
100 200 300
native writer / native reader hive writer / native reader
native writer / hive reader hive writer / hive reader
4x 1
2
3
4
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
22 © Hortonworks Inc. 2011–2018. All rights reserved
How to specify `native` OrcFileFormat directly
CREATE TABLE people (name string, age int)
USING org.apache.spark.sql.execution.datasources.orc
df.write
.format("org.apache.spark.sql.execution.datasources.orc")
.save(path)
spark.read
.format("org.apache.spark.sql.execution.datasources.orc")
.load(path)
Read Dataset
Write Dataset
Create ORC Table
23 © Hortonworks Inc. 2011–2018. All rights reserved
Switch ORC implementation (SPARK-20728)
• spark.sql.orc.impl=native (default: `hive`)
CREATE TABLE people (name string, age int)
USING ORC OPTIONS (orc.compress 'ZLIB')
spark.read.orc(path)
df.write.orc(path)
spark.read.format("orc").load (path)
df.write.format("orc").save(path)
Read/Write Dataset
Read/Write Dataset
Create ORC Table
24 © Hortonworks Inc. 2011–2018. All rights reserved
Switch ORC implementation (SPARK-20728) – Cont.
• spark.sql.orc.impl=native (default: `hive`)
spark.readStream.orc(path)
spark.readStream.format("orc").load(path)
df.writeStream
.option("checkpointLocation", path1)
.format("orc")
.option("path", path2)
.start
Read/Write
Structured Stream
25 © Hortonworks Inc. 2011–2018. All rights reserved
ORC Readers with `spark.sql.` configurations
orc.impl
# of cols <= codegen.maxFields
`native`
`hive` ORC Reader
`hive`
true
spark.sql.codegen.maxFields=100 (default)
false
`native` ORC Columnar Batch Reader
all atomic types
true
false
`native` ORC Record Reader
orc.enableVectorizedReader false
true
26 © Hortonworks Inc. 2011–2018. All rights reserved
ORC Readers with `spark.sql.` configurations – Cont.
orc.enableVectorizedReader
Wrapping
ORC ColumnVector 
Spark OrcColumnVector
orc.copyBatchToSpark
true
false
Copying
ORC ColumnVector 
Spark OffHeapColumnVector
true
columnVector.offheap.enabled
true
Copying
ORC ColumnVector 
Spark OnHeapColumnVector
false
`native` ORC Columnar Batch Reader
27 © Hortonworks Inc. 2011–2018. All rights reserved
Support vectorized read on Hive ORC Tables
• spark.sql.hive.convertMetastoreOrc=true (default: false)
• `spark.sql.orc.impl=native` is required, too.
CREATE TABLE people (name string, age int)
STORED AS ORC
CREATE TABLE people (name string, age int)
USING HIVE OPTIONS (fileFormat 'ORC', orc.compress 'gzip')
SPARK-23355
28 © Hortonworks Inc. 2011–2018. All rights reserved
Schema evolution at reading file-based data sources
• Frequently, new files can have wider column types or new columns
• Before SPARK-21929, users drop and recreate ORC table with an updated schema.
• User-defined schema reduces schema inference cost and handles upcasting
• boolean -> byte -> short -> int -> long
• float -> double
spark.read.schema("col1 int").orc(path)
spark.read.schema("col1 long, col2 long").orc(path)
Old Data
New Data
29 © Hortonworks Inc. 2011–2018. All rights reserved
Schema evolution at reading file-based data sources – Cont.
1. Native Vectorized ORC Reader
2. Only safe change via upcasting
3. JSON is the most flexible for changing types
File Format TEXT CSV JSON ORC
`hive`
ORC
`native`1
PARQUET
Add Column At The End ✔️ ✔️ ✔️ ✔️ ✔️
Hide Trailing Column ✔️ ✔️ ✔️ ✔️ ✔️
Hide Column ✔️ ✔️ ✔️
Change Type2 ✔️ ✔️3 ✔️
Change Position ✔️ ✔️ ✔️
30 © Hortonworks Inc. 2011–2018. All rights reserved
Demo 1
ORC configuration
31 © Hortonworks Inc. 2011–2018. All rights reserved
Demo 2
PySpark with ORC
32 © Hortonworks Inc. 2011–2018. All rights reserved
Performance
33 © Hortonworks Inc. 2011–2018. All rights reserved
Micro Benchmark
• Target
• Apache Spark 2.3.0
• Apache ORC 1.4.1
• Machine
• MacBook Pro (2015 Mid)
• Intel® Core™ i7-4770JQ CPI @ 2.20GHz
• Mac OS X 10.13.4
• JDK 1.8.0_161
34 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Single column scan from wide tables
Number of columns
Time
(ms)
1M rows with all BIGINT columns
0
200
400
600
800
1000
1200
100 200 300
native writer / native reader hive writer / hive reader
4x
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
35 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Vectorized Read
0
500
1000
1500
2000
2500
TINYINT SMALLINT INT BIGINT FLOAT DOULBE
native hive
15M rows in a single-column table
Time
(ms)
10x
5x
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
11x
36 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Partitioned table read
0
500
1000
1500
2000
2500
Data column Partition column Both columns
native hive
Time
(ms)
21x7x
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
15M rows in a partitioned table
37 © Hortonworks Inc. 2011–2018. All rights reserved
Predicate Pushdown
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Select 10% rows (id < value)
Select 50% rows (id < value)
Select 90% rows (id < value)
Select all rows (id IS NOT NULL)
parquet native Time (ms)
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala
15M rows with 5 data columns and 1 sequential id column
38 © Hortonworks Inc. 2011–2018. All rights reserved
Limitation
Future Roadmap
39 © Hortonworks Inc. 2011–2018. All rights reserved
Limitation
• Spark vectorization supports atomic types only
• Limited simple schema evolution. JSON provides more
• boolean -> byte -> short -> int -> long
• float -> double
• `convertMetastore` ignores `STORED AS` table properties (SPARK-23355)
• Both ORC/Parquet
40 © Hortonworks Inc. 2011–2018. All rights reserved
Future Roadmap – Apache Spark 2.4 (2018 Fall)
• Feature Parity for ORC with Parquet (SPARK-20901)
• Use `native` ORC implementation by default (SPARK-23456)
• Use ORC predicate pushdown by default (SPARK-21783)
• Use `convertMetastoreOrc` by default (SPARK-22279)
• Test ORC as default data source format (SPARK-23553)
• Test and support Bloom Filters (SPARK-12417)
41 © Hortonworks Inc. 2011–2018. All rights reserved
Future Roadmap – On-going work
• Support VectorUDT/MatrixUDT (SPARK-22320)
• Support CHAR/VARCHAR Types
• Vectorized Writer with DataSource V2
• ALTER TABLE … CHANGE column type (SPARK-18727)
42 © Hortonworks Inc. 2011–2018. All rights reserved
Summary
• Apache Spark 2.3 starts to take advantage of Apache ORC
• Native vectorized ORC reader
• boosts Spark ORC performance
• provides better schema evolution ability
• Structured streaming starts to work with ORC (both reader/writer)
• Spark is going to become faster and faster with ORC
43 © Hortonworks Inc. 2011–2018. All rights reserved
Reference
• https://youtu.be/EL-NHiwqCSY, ORC configuration in Apache Spark 2.3
• https://youtu.be/zJZ1gtzu-rs, Apache Spark 2.3 ORC with Apache Arrow
• https://community.hortonworks.com/articles/148917/orc-improvements-for-apache-
spark-22.html
• https://www.slideshare.net/Hadoop_Summit/performance-update-when-apache-orc-
met-apache-spark-81023199, Dataworks Summit 2017 Sydney
• https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data,
Dataworks Summit 2017 San Jose
44 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?
45 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you

More Related Content

What's hot

Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetDataWorks Summit
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BIDataWorks Summit
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisDataWorks Summit/Hadoop Summit
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceDataWorks Summit/Hadoop Summit
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetOwen O'Malley
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetupt3rmin4t0r
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisliang chen
 
Next Generation Execution for Apache Storm
Next Generation Execution for Apache StormNext Generation Execution for Apache Storm
Next Generation Execution for Apache StormDataWorks Summit
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseDataWorks Summit/Hadoop Summit
 
You Can't Search Without Data
You Can't Search Without DataYou Can't Search Without Data
You Can't Search Without DataBryan Bende
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4boxu42
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopHortonworks
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache CalciteJulian Hyde
 

What's hot (20)

Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
HadoopFileFormats_2016
HadoopFileFormats_2016HadoopFileFormats_2016
HadoopFileFormats_2016
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetup
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
 
ORC 2015
ORC 2015ORC 2015
ORC 2015
 
Next Generation Execution for Apache Storm
Next Generation Execution for Apache StormNext Generation Execution for Apache Storm
Next Generation Execution for Apache Storm
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
 
You Can't Search Without Data
You Can't Search Without DataYou Can't Search Without Data
You Can't Search Without Data
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Running Services on YARN
Running Services on YARNRunning Services on YARN
Running Services on YARN
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 

Similar to ORC improvement in Apache Spark 2.3

What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4DataWorks Summit
 
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4DataWorks Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with ZeppelinHortonworks
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive DataWorks Summit
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storySunil Govindan
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFiHortonworks
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks
 
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a serviceCOMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a serviceAntonio García-Domínguez
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo DataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFTDataWorks Summit
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 

Similar to ORC improvement in Apache Spark 2.3 (20)

What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
 
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration story
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a serviceCOMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFT
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 

Recently uploaded

Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 

Recently uploaded (20)

Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 

ORC improvement in Apache Spark 2.3

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved ORC Improvement in Apache Spark 2.3 Dongjoon Hyun Principal Software Engineer @ Hortonworks Data Science Team April 2018
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Dongjoon Hyun • Hortonworks • Principal Software Engineer @ Data Science Team • Apache Project • Apache REEF Project Management Committee(PMC) Member & Committer • Apache Spark Project Contributor • GitHub • https://github.com/dongjoon-hyun
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Agenda • What’s New in Apache Spark 2.3 • Previous ORC issues in Apache Spark • Current Approach & Demo • Performance & Limitation • Future roadmap
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved • Vectorized ORC Reader • Structured Streaming with ORC • Schema evolution with ORC • PySpark Performance Enhancements with Apache Arrow and ORC • Structured stream-stream joins • Spark History Server V2 • Spark on Kubernetes • Data source API V2 • Streaming API V2 • Continuous Structured Streaming Processing Major Features Experimental Features What’s New in Apache Spark 2.3
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved • Vectorized ORC Reader • Structured Streaming with ORC • Schema evolution with ORC • PySpark Performance Enhancements with Apache Arrow and ORC • Structured stream-stream joins • Spark History Server V2 • Spark on Kubernetes • Data source API V2 • Streaming API V2 • Continuous Structured Streaming Processing Major Features Experimental Features What’s New in Apache Spark 2.3
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Spark’s file-based data sources • TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Popular for shared Hive tables
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Motivation • TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Popular for shared Hive tables Fast Flexible Hive Table Access
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Previous ORC Issues in Spark
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Background – The story of Spark, ORC, and Hive • Before Apache ORC • Hive 1.2.1 (2015 JUN)  SPARK-2883 (Spark 1.4) • After Apache ORC • v1.0.0 (2016 JAN) ... • v1.3.3 (2017 FEB) • v1.4.0 (2017 MAY)
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved Background – The story of Spark, ORC, and Hive – Cont. • Before Apache ORC • Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4 • After Apache ORC • v1.0.0 (2016 JAN) ... • v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3 • v1.4.0 (2017 MAY)  SPARK-21422 Spark 2.3 • v1.4.1 (2017 OCT)  SPARK-22300 Spark 2.3 • v1.4.3 (2018 FEB)  SPARK-23340, HIVE-18674 Hive 3.0 Spark 2.4
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Six Issue Categories • ORC Writer Versions • Performance • Structured streaming • Column names • Hive tables and schema evolution • Robustness
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved Category 1 – ORC Writer Versions • ORIGINAL • HIVE_8732 (2014) ORC string statistics are not merged correctly • HIVE_4243 (2015) Fix column names in FileSinkOperator • HIVE_12055(2015) Create row-by-row shims for the write path • HIVE_13083(2016) Writing HiveDecimal can wrongly suppress present stream • ORC_101 (2016) Correct the use of the default charset in bloomfilter • ORC_135 (2018) PPD for timestamp is wrong when reader/writer timezones are different
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Category 2 – Performance • Vectorized ORC Reader (SPARK-16060) • Fast reading partition-columns (SPARK-22712) • Pushing down filters for DateType (SPARK-21787)
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved • `FileNotFoundException` at writing empty partitions as ORC • Create structured steam with ORC files Write (SPARK-15474) Read (SPARK-22781) Category 3 – Structured streaming spark.readStream.orc(path)
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved Category 4 – Column names • Unicode column names (SPARK-23072) • Column names with dot (SPARK-21791) • Should not create invalid column names (SPARK-21912)
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved Category 5 – Hive tables and schema evolution • Support `ALTER TABLE ADD COLUMNS` (SPARK-21929) • Introduced at Spark 2.2, but throws AnalysisException for ORC • Support column positional mismatch (SPARK-22267) • Return wrong result if ORC file schema is different from Hive MetaStore schema order
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved Category 6 – Robustness • ORC metadata exceed ProtoBuf message size limit (SPARK-19109) • NullPointerException on zero-size ORC file (SPARK-19809) • Support `ignoreCorruptFiles` (SPARK-23049) • Support `ignoreMissingFiles` (SPARK-23305)
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved Current Approach
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved Supports two ORC file formats • Adding a new OrcFileFormat (SPARK-20682) FileFormat TextBasedFileFormat ParquetFileFormat OrcFileFormat HiveFileFormat JsonFileFormat LibSVMFileFormat CSVFileFormat TextFileFormat o.a.s.sql.execution.datasources o.a.s.ml.source.libsvmo.a.s.sql.hive.orc OrcFileFormat `hive` OrcFileFormat from Hive 1.2.1 `native` OrcFileFormat with ORC 1.4.1
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved In Reality – Four cases for ORC Reader/Writer `hive` Reader`native` Reader `hive` Writer `native` Writer • New Data • New Apps • Best performance (Vectorized Reader) • New Data • Old Apps • Improved performance (Non-vectorized Reader) • Old Data • New Apps • Improved performance (Vectorized Reader) • Old Data • Old Apps • As-Is performance (Non-vectorized Reader) 1 2 3 4
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / native reader native writer / hive reader hive writer / hive reader 4x 1 2 3 4 https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved How to specify `native` OrcFileFormat directly CREATE TABLE people (name string, age int) USING org.apache.spark.sql.execution.datasources.orc df.write .format("org.apache.spark.sql.execution.datasources.orc") .save(path) spark.read .format("org.apache.spark.sql.execution.datasources.orc") .load(path) Read Dataset Write Dataset Create ORC Table
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved Switch ORC implementation (SPARK-20728) • spark.sql.orc.impl=native (default: `hive`) CREATE TABLE people (name string, age int) USING ORC OPTIONS (orc.compress 'ZLIB') spark.read.orc(path) df.write.orc(path) spark.read.format("orc").load (path) df.write.format("orc").save(path) Read/Write Dataset Read/Write Dataset Create ORC Table
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved Switch ORC implementation (SPARK-20728) – Cont. • spark.sql.orc.impl=native (default: `hive`) spark.readStream.orc(path) spark.readStream.format("orc").load(path) df.writeStream .option("checkpointLocation", path1) .format("orc") .option("path", path2) .start Read/Write Structured Stream
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved ORC Readers with `spark.sql.` configurations orc.impl # of cols <= codegen.maxFields `native` `hive` ORC Reader `hive` true spark.sql.codegen.maxFields=100 (default) false `native` ORC Columnar Batch Reader all atomic types true false `native` ORC Record Reader orc.enableVectorizedReader false true
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved ORC Readers with `spark.sql.` configurations – Cont. orc.enableVectorizedReader Wrapping ORC ColumnVector  Spark OrcColumnVector orc.copyBatchToSpark true false Copying ORC ColumnVector  Spark OffHeapColumnVector true columnVector.offheap.enabled true Copying ORC ColumnVector  Spark OnHeapColumnVector false `native` ORC Columnar Batch Reader
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved Support vectorized read on Hive ORC Tables • spark.sql.hive.convertMetastoreOrc=true (default: false) • `spark.sql.orc.impl=native` is required, too. CREATE TABLE people (name string, age int) STORED AS ORC CREATE TABLE people (name string, age int) USING HIVE OPTIONS (fileFormat 'ORC', orc.compress 'gzip') SPARK-23355
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Schema evolution at reading file-based data sources • Frequently, new files can have wider column types or new columns • Before SPARK-21929, users drop and recreate ORC table with an updated schema. • User-defined schema reduces schema inference cost and handles upcasting • boolean -> byte -> short -> int -> long • float -> double spark.read.schema("col1 int").orc(path) spark.read.schema("col1 long, col2 long").orc(path) Old Data New Data
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved Schema evolution at reading file-based data sources – Cont. 1. Native Vectorized ORC Reader 2. Only safe change via upcasting 3. JSON is the most flexible for changing types File Format TEXT CSV JSON ORC `hive` ORC `native`1 PARQUET Add Column At The End ✔️ ✔️ ✔️ ✔️ ✔️ Hide Trailing Column ✔️ ✔️ ✔️ ✔️ ✔️ Hide Column ✔️ ✔️ ✔️ Change Type2 ✔️ ✔️3 ✔️ Change Position ✔️ ✔️ ✔️
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Demo 1 ORC configuration
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Demo 2 PySpark with ORC
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved Performance
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved Micro Benchmark • Target • Apache Spark 2.3.0 • Apache ORC 1.4.1 • Machine • MacBook Pro (2015 Mid) • Intel® Core™ i7-4770JQ CPI @ 2.20GHz • Mac OS X 10.13.4 • JDK 1.8.0_161
  • 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / hive reader 4x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
  • 35. 35 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Vectorized Read 0 500 1000 1500 2000 2500 TINYINT SMALLINT INT BIGINT FLOAT DOULBE native hive 15M rows in a single-column table Time (ms) 10x 5x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 11x
  • 36. 36 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Partitioned table read 0 500 1000 1500 2000 2500 Data column Partition column Both columns native hive Time (ms) 21x7x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 15M rows in a partitioned table
  • 37. 37 © Hortonworks Inc. 2011–2018. All rights reserved Predicate Pushdown 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Select 10% rows (id < value) Select 50% rows (id < value) Select 90% rows (id < value) Select all rows (id IS NOT NULL) parquet native Time (ms) https://github.com/apache/spark/blob/branch-2.3/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala 15M rows with 5 data columns and 1 sequential id column
  • 38. 38 © Hortonworks Inc. 2011–2018. All rights reserved Limitation Future Roadmap
  • 39. 39 © Hortonworks Inc. 2011–2018. All rights reserved Limitation • Spark vectorization supports atomic types only • Limited simple schema evolution. JSON provides more • boolean -> byte -> short -> int -> long • float -> double • `convertMetastore` ignores `STORED AS` table properties (SPARK-23355) • Both ORC/Parquet
  • 40. 40 © Hortonworks Inc. 2011–2018. All rights reserved Future Roadmap – Apache Spark 2.4 (2018 Fall) • Feature Parity for ORC with Parquet (SPARK-20901) • Use `native` ORC implementation by default (SPARK-23456) • Use ORC predicate pushdown by default (SPARK-21783) • Use `convertMetastoreOrc` by default (SPARK-22279) • Test ORC as default data source format (SPARK-23553) • Test and support Bloom Filters (SPARK-12417)
  • 41. 41 © Hortonworks Inc. 2011–2018. All rights reserved Future Roadmap – On-going work • Support VectorUDT/MatrixUDT (SPARK-22320) • Support CHAR/VARCHAR Types • Vectorized Writer with DataSource V2 • ALTER TABLE … CHANGE column type (SPARK-18727)
  • 42. 42 © Hortonworks Inc. 2011–2018. All rights reserved Summary • Apache Spark 2.3 starts to take advantage of Apache ORC • Native vectorized ORC reader • boosts Spark ORC performance • provides better schema evolution ability • Structured streaming starts to work with ORC (both reader/writer) • Spark is going to become faster and faster with ORC
  • 43. 43 © Hortonworks Inc. 2011–2018. All rights reserved Reference • https://youtu.be/EL-NHiwqCSY, ORC configuration in Apache Spark 2.3 • https://youtu.be/zJZ1gtzu-rs, Apache Spark 2.3 ORC with Apache Arrow • https://community.hortonworks.com/articles/148917/orc-improvements-for-apache- spark-22.html • https://www.slideshare.net/Hadoop_Summit/performance-update-when-apache-orc- met-apache-spark-81023199, Dataworks Summit 2017 Sydney • https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data, Dataworks Summit 2017 San Jose
  • 44. 44 © Hortonworks Inc. 2011–2018. All rights reserved Questions?
  • 45. 45 © Hortonworks Inc. 2011–2018. All rights reserved Thank you