Submit Search
Upload
ORC improvement in Apache Spark 2.3
•
Download as PPTX, PDF
•
9 likes
•
2,530 views
Dongjoon Hyun
Follow
Dataworks Summit 2018 Berlin - ORC improvement in Apache Spark 2.3
Read less
Read more
Software
Report
Share
Report
Share
1 of 45
Download now
Recommended
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
Dongjoon Hyun
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
DataWorks Summit
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
DataWorks Summit
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
Recommended
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
Dongjoon Hyun
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
DataWorks Summit
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
DataWorks Summit
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
DataWorks Summit
LLAP: Building Cloud First BI
LLAP: Building Cloud First BI
DataWorks Summit
HadoopFileFormats_2016
HadoopFileFormats_2016
Jakub Wszolek, PhD
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
DataWorks Summit/Hadoop Summit
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
DataWorks Summit/Hadoop Summit
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
LLAP Nov Meetup
LLAP Nov Meetup
t3rmin4t0r
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
liang chen
ORC 2015
ORC 2015
t3rmin4t0r
Next Generation Execution for Apache Storm
Next Generation Execution for Apache Storm
DataWorks Summit
Hive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
You Can't Search Without Data
You Can't Search Without Data
Bryan Bende
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
boxu42
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Hortonworks
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
Streaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Julian Hyde
Running Services on YARN
Running Services on YARN
DataWorks Summit/Hadoop Summit
ORC Deep Dive 2020
ORC Deep Dive 2020
Owen O'Malley
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
DataWorks Summit
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
DataWorks Summit
More Related Content
What's hot
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
DataWorks Summit
LLAP: Building Cloud First BI
LLAP: Building Cloud First BI
DataWorks Summit
HadoopFileFormats_2016
HadoopFileFormats_2016
Jakub Wszolek, PhD
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
DataWorks Summit/Hadoop Summit
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
DataWorks Summit/Hadoop Summit
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
LLAP Nov Meetup
LLAP Nov Meetup
t3rmin4t0r
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
liang chen
ORC 2015
ORC 2015
t3rmin4t0r
Next Generation Execution for Apache Storm
Next Generation Execution for Apache Storm
DataWorks Summit
Hive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
You Can't Search Without Data
You Can't Search Without Data
Bryan Bende
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
boxu42
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Hortonworks
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
Streaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Julian Hyde
Running Services on YARN
Running Services on YARN
DataWorks Summit/Hadoop Summit
ORC Deep Dive 2020
ORC Deep Dive 2020
Owen O'Malley
What's hot
(20)
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
LLAP: Building Cloud First BI
LLAP: Building Cloud First BI
HadoopFileFormats_2016
HadoopFileFormats_2016
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
LLAP Nov Meetup
LLAP Nov Meetup
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
ORC 2015
ORC 2015
Next Generation Execution for Apache Storm
Next Generation Execution for Apache Storm
Hive acid and_2.x new_features
Hive acid and_2.x new_features
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
You Can't Search Without Data
You Can't Search Without Data
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Streaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Running Services on YARN
Running Services on YARN
ORC Deep Dive 2020
ORC Deep Dive 2020
Similar to ORC improvement in Apache Spark 2.3
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
DataWorks Summit
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
DataWorks Summit
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
Intro to Spark with Zeppelin
Intro to Spark with Zeppelin
Hortonworks
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
What's new in apache hive
What's new in apache hive
DataWorks Summit
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration story
Sunil Govindan
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
Hortonworks
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks
Hadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
DataWorks Summit/Hadoop Summit
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
Antonio García-Domínguez
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo
DataWorks Summit
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
DataWorks Summit
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
The Apache Software Foundation
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
High throughput data replication over RAFT
High throughput data replication over RAFT
DataWorks Summit
LLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
DataWorks Summit
Similar to ORC improvement in Apache Spark 2.3
(20)
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Spark with Zeppelin
Intro to Spark with Zeppelin
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
What's new in apache hive
What's new in apache hive
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration story
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
High throughput data replication over RAFT
High throughput data replication over RAFT
LLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
Recently uploaded
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
Safe Software
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
vyaparkranti
MYjobs Presentation Django-based project
MYjobs Presentation Django-based project
AnoyGreter
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Angel Borroy López
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
Łukasz Chruściel
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
Hironori Washizaki
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Rob Geurden
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
Christian Birchler
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Cizo Technology Services
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
Dinusha Kumarasiri
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
Envertis Software Solutions
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
Lionel Briand
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
OnePlan Solutions
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
RTS corp
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
31events.com
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
StefanoLambiase
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Stefano Stabellini
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
Diego Iván Oliveros Acosta
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
qr0udbr0
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
Philip Schwarz
Recently uploaded
(20)
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
MYjobs Presentation Django-based project
MYjobs Presentation Django-based project
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
ORC improvement in Apache Spark 2.3
1.
1 © Hortonworks
Inc. 2011–2018. All rights reserved ORC Improvement in Apache Spark 2.3 Dongjoon Hyun Principal Software Engineer @ Hortonworks Data Science Team April 2018
2.
2 © Hortonworks
Inc. 2011–2018. All rights reserved Dongjoon Hyun • Hortonworks • Principal Software Engineer @ Data Science Team • Apache Project • Apache REEF Project Management Committee(PMC) Member & Committer • Apache Spark Project Contributor • GitHub • https://github.com/dongjoon-hyun
3.
3 © Hortonworks
Inc. 2011–2018. All rights reserved Agenda • What’s New in Apache Spark 2.3 • Previous ORC issues in Apache Spark • Current Approach & Demo • Performance & Limitation • Future roadmap
4.
4 © Hortonworks
Inc. 2011–2018. All rights reserved • Vectorized ORC Reader • Structured Streaming with ORC • Schema evolution with ORC • PySpark Performance Enhancements with Apache Arrow and ORC • Structured stream-stream joins • Spark History Server V2 • Spark on Kubernetes • Data source API V2 • Streaming API V2 • Continuous Structured Streaming Processing Major Features Experimental Features What’s New in Apache Spark 2.3
5.
5 © Hortonworks
Inc. 2011–2018. All rights reserved • Vectorized ORC Reader • Structured Streaming with ORC • Schema evolution with ORC • PySpark Performance Enhancements with Apache Arrow and ORC • Structured stream-stream joins • Spark History Server V2 • Spark on Kubernetes • Data source API V2 • Streaming API V2 • Continuous Structured Streaming Processing Major Features Experimental Features What’s New in Apache Spark 2.3
6.
6 © Hortonworks
Inc. 2011–2018. All rights reserved Spark’s file-based data sources • TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Popular for shared Hive tables
7.
7 © Hortonworks
Inc. 2011–2018. All rights reserved Motivation • TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Popular for shared Hive tables Fast Flexible Hive Table Access
8.
8 © Hortonworks
Inc. 2011–2018. All rights reserved Previous ORC Issues in Spark
9.
9 © Hortonworks
Inc. 2011–2018. All rights reserved Background – The story of Spark, ORC, and Hive • Before Apache ORC • Hive 1.2.1 (2015 JUN) SPARK-2883 (Spark 1.4) • After Apache ORC • v1.0.0 (2016 JAN) ... • v1.3.3 (2017 FEB) • v1.4.0 (2017 MAY)
10.
10 © Hortonworks
Inc. 2011–2018. All rights reserved Background – The story of Spark, ORC, and Hive – Cont. • Before Apache ORC • Hive 1.2.1 (2015 JUN) SPARK-2883 Hive 1.2.1 Spark 1.4 • After Apache ORC • v1.0.0 (2016 JAN) ... • v1.3.3 (2017 FEB) HIVE-15841 Hive 2.3 • v1.4.0 (2017 MAY) SPARK-21422 Spark 2.3 • v1.4.1 (2017 OCT) SPARK-22300 Spark 2.3 • v1.4.3 (2018 FEB) SPARK-23340, HIVE-18674 Hive 3.0 Spark 2.4
11.
11 © Hortonworks
Inc. 2011–2018. All rights reserved Six Issue Categories • ORC Writer Versions • Performance • Structured streaming • Column names • Hive tables and schema evolution • Robustness
12.
12 © Hortonworks
Inc. 2011–2018. All rights reserved Category 1 – ORC Writer Versions • ORIGINAL • HIVE_8732 (2014) ORC string statistics are not merged correctly • HIVE_4243 (2015) Fix column names in FileSinkOperator • HIVE_12055(2015) Create row-by-row shims for the write path • HIVE_13083(2016) Writing HiveDecimal can wrongly suppress present stream • ORC_101 (2016) Correct the use of the default charset in bloomfilter • ORC_135 (2018) PPD for timestamp is wrong when reader/writer timezones are different
13.
13 © Hortonworks
Inc. 2011–2018. All rights reserved Category 2 – Performance • Vectorized ORC Reader (SPARK-16060) • Fast reading partition-columns (SPARK-22712) • Pushing down filters for DateType (SPARK-21787)
14.
14 © Hortonworks
Inc. 2011–2018. All rights reserved • `FileNotFoundException` at writing empty partitions as ORC • Create structured steam with ORC files Write (SPARK-15474) Read (SPARK-22781) Category 3 – Structured streaming spark.readStream.orc(path)
15.
15 © Hortonworks
Inc. 2011–2018. All rights reserved Category 4 – Column names • Unicode column names (SPARK-23072) • Column names with dot (SPARK-21791) • Should not create invalid column names (SPARK-21912)
16.
16 © Hortonworks
Inc. 2011–2018. All rights reserved Category 5 – Hive tables and schema evolution • Support `ALTER TABLE ADD COLUMNS` (SPARK-21929) • Introduced at Spark 2.2, but throws AnalysisException for ORC • Support column positional mismatch (SPARK-22267) • Return wrong result if ORC file schema is different from Hive MetaStore schema order
17.
17 © Hortonworks
Inc. 2011–2018. All rights reserved Category 6 – Robustness • ORC metadata exceed ProtoBuf message size limit (SPARK-19109) • NullPointerException on zero-size ORC file (SPARK-19809) • Support `ignoreCorruptFiles` (SPARK-23049) • Support `ignoreMissingFiles` (SPARK-23305)
18.
18 © Hortonworks
Inc. 2011–2018. All rights reserved Current Approach
19.
19 © Hortonworks
Inc. 2011–2018. All rights reserved Supports two ORC file formats • Adding a new OrcFileFormat (SPARK-20682) FileFormat TextBasedFileFormat ParquetFileFormat OrcFileFormat HiveFileFormat JsonFileFormat LibSVMFileFormat CSVFileFormat TextFileFormat o.a.s.sql.execution.datasources o.a.s.ml.source.libsvmo.a.s.sql.hive.orc OrcFileFormat `hive` OrcFileFormat from Hive 1.2.1 `native` OrcFileFormat with ORC 1.4.1
20.
20 © Hortonworks
Inc. 2011–2018. All rights reserved In Reality – Four cases for ORC Reader/Writer `hive` Reader`native` Reader `hive` Writer `native` Writer • New Data • New Apps • Best performance (Vectorized Reader) • New Data • Old Apps • Improved performance (Non-vectorized Reader) • Old Data • New Apps • Improved performance (Vectorized Reader) • Old Data • Old Apps • As-Is performance (Non-vectorized Reader) 1 2 3 4
21.
21 © Hortonworks
Inc. 2011–2018. All rights reserved Performance – Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / native reader native writer / hive reader hive writer / hive reader 4x 1 2 3 4 https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
22.
22 © Hortonworks
Inc. 2011–2018. All rights reserved How to specify `native` OrcFileFormat directly CREATE TABLE people (name string, age int) USING org.apache.spark.sql.execution.datasources.orc df.write .format("org.apache.spark.sql.execution.datasources.orc") .save(path) spark.read .format("org.apache.spark.sql.execution.datasources.orc") .load(path) Read Dataset Write Dataset Create ORC Table
23.
23 © Hortonworks
Inc. 2011–2018. All rights reserved Switch ORC implementation (SPARK-20728) • spark.sql.orc.impl=native (default: `hive`) CREATE TABLE people (name string, age int) USING ORC OPTIONS (orc.compress 'ZLIB') spark.read.orc(path) df.write.orc(path) spark.read.format("orc").load (path) df.write.format("orc").save(path) Read/Write Dataset Read/Write Dataset Create ORC Table
24.
24 © Hortonworks
Inc. 2011–2018. All rights reserved Switch ORC implementation (SPARK-20728) – Cont. • spark.sql.orc.impl=native (default: `hive`) spark.readStream.orc(path) spark.readStream.format("orc").load(path) df.writeStream .option("checkpointLocation", path1) .format("orc") .option("path", path2) .start Read/Write Structured Stream
25.
25 © Hortonworks
Inc. 2011–2018. All rights reserved ORC Readers with `spark.sql.` configurations orc.impl # of cols <= codegen.maxFields `native` `hive` ORC Reader `hive` true spark.sql.codegen.maxFields=100 (default) false `native` ORC Columnar Batch Reader all atomic types true false `native` ORC Record Reader orc.enableVectorizedReader false true
26.
26 © Hortonworks
Inc. 2011–2018. All rights reserved ORC Readers with `spark.sql.` configurations – Cont. orc.enableVectorizedReader Wrapping ORC ColumnVector Spark OrcColumnVector orc.copyBatchToSpark true false Copying ORC ColumnVector Spark OffHeapColumnVector true columnVector.offheap.enabled true Copying ORC ColumnVector Spark OnHeapColumnVector false `native` ORC Columnar Batch Reader
27.
27 © Hortonworks
Inc. 2011–2018. All rights reserved Support vectorized read on Hive ORC Tables • spark.sql.hive.convertMetastoreOrc=true (default: false) • `spark.sql.orc.impl=native` is required, too. CREATE TABLE people (name string, age int) STORED AS ORC CREATE TABLE people (name string, age int) USING HIVE OPTIONS (fileFormat 'ORC', orc.compress 'gzip') SPARK-23355
28.
28 © Hortonworks
Inc. 2011–2018. All rights reserved Schema evolution at reading file-based data sources • Frequently, new files can have wider column types or new columns • Before SPARK-21929, users drop and recreate ORC table with an updated schema. • User-defined schema reduces schema inference cost and handles upcasting • boolean -> byte -> short -> int -> long • float -> double spark.read.schema("col1 int").orc(path) spark.read.schema("col1 long, col2 long").orc(path) Old Data New Data
29.
29 © Hortonworks
Inc. 2011–2018. All rights reserved Schema evolution at reading file-based data sources – Cont. 1. Native Vectorized ORC Reader 2. Only safe change via upcasting 3. JSON is the most flexible for changing types File Format TEXT CSV JSON ORC `hive` ORC `native`1 PARQUET Add Column At The End ✔️ ✔️ ✔️ ✔️ ✔️ Hide Trailing Column ✔️ ✔️ ✔️ ✔️ ✔️ Hide Column ✔️ ✔️ ✔️ Change Type2 ✔️ ✔️3 ✔️ Change Position ✔️ ✔️ ✔️
30.
30 © Hortonworks
Inc. 2011–2018. All rights reserved Demo 1 ORC configuration
31.
31 © Hortonworks
Inc. 2011–2018. All rights reserved Demo 2 PySpark with ORC
32.
32 © Hortonworks
Inc. 2011–2018. All rights reserved Performance
33.
33 © Hortonworks
Inc. 2011–2018. All rights reserved Micro Benchmark • Target • Apache Spark 2.3.0 • Apache ORC 1.4.1 • Machine • MacBook Pro (2015 Mid) • Intel® Core™ i7-4770JQ CPI @ 2.20GHz • Mac OS X 10.13.4 • JDK 1.8.0_161
34.
34 © Hortonworks
Inc. 2011–2018. All rights reserved Performance – Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / hive reader 4x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
35.
35 © Hortonworks
Inc. 2011–2018. All rights reserved Performance – Vectorized Read 0 500 1000 1500 2000 2500 TINYINT SMALLINT INT BIGINT FLOAT DOULBE native hive 15M rows in a single-column table Time (ms) 10x 5x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 11x
36.
36 © Hortonworks
Inc. 2011–2018. All rights reserved Performance – Partitioned table read 0 500 1000 1500 2000 2500 Data column Partition column Both columns native hive Time (ms) 21x7x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 15M rows in a partitioned table
37.
37 © Hortonworks
Inc. 2011–2018. All rights reserved Predicate Pushdown 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Select 10% rows (id < value) Select 50% rows (id < value) Select 90% rows (id < value) Select all rows (id IS NOT NULL) parquet native Time (ms) https://github.com/apache/spark/blob/branch-2.3/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala 15M rows with 5 data columns and 1 sequential id column
38.
38 © Hortonworks
Inc. 2011–2018. All rights reserved Limitation Future Roadmap
39.
39 © Hortonworks
Inc. 2011–2018. All rights reserved Limitation • Spark vectorization supports atomic types only • Limited simple schema evolution. JSON provides more • boolean -> byte -> short -> int -> long • float -> double • `convertMetastore` ignores `STORED AS` table properties (SPARK-23355) • Both ORC/Parquet
40.
40 © Hortonworks
Inc. 2011–2018. All rights reserved Future Roadmap – Apache Spark 2.4 (2018 Fall) • Feature Parity for ORC with Parquet (SPARK-20901) • Use `native` ORC implementation by default (SPARK-23456) • Use ORC predicate pushdown by default (SPARK-21783) • Use `convertMetastoreOrc` by default (SPARK-22279) • Test ORC as default data source format (SPARK-23553) • Test and support Bloom Filters (SPARK-12417)
41.
41 © Hortonworks
Inc. 2011–2018. All rights reserved Future Roadmap – On-going work • Support VectorUDT/MatrixUDT (SPARK-22320) • Support CHAR/VARCHAR Types • Vectorized Writer with DataSource V2 • ALTER TABLE … CHANGE column type (SPARK-18727)
42.
42 © Hortonworks
Inc. 2011–2018. All rights reserved Summary • Apache Spark 2.3 starts to take advantage of Apache ORC • Native vectorized ORC reader • boosts Spark ORC performance • provides better schema evolution ability • Structured streaming starts to work with ORC (both reader/writer) • Spark is going to become faster and faster with ORC
43.
43 © Hortonworks
Inc. 2011–2018. All rights reserved Reference • https://youtu.be/EL-NHiwqCSY, ORC configuration in Apache Spark 2.3 • https://youtu.be/zJZ1gtzu-rs, Apache Spark 2.3 ORC with Apache Arrow • https://community.hortonworks.com/articles/148917/orc-improvements-for-apache- spark-22.html • https://www.slideshare.net/Hadoop_Summit/performance-update-when-apache-orc- met-apache-spark-81023199, Dataworks Summit 2017 Sydney • https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data, Dataworks Summit 2017 San Jose
44.
44 © Hortonworks
Inc. 2011–2018. All rights reserved Questions?
45.
45 © Hortonworks
Inc. 2011–2018. All rights reserved Thank you
Download now