SlideShare a Scribd company logo
1 of 26
Download to read offline
PERFORMANCE UPDATE: WHEN
APACHE ORC MET APACHE SPARK
Dongjoon Hyun - dhyun@hortonworks.com
Owen O'Malley - owen@hortonworks.com
@owen_omalley
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Performance Update
Spark – Apache ORC 1.4 Integration
Benchmark Overview
Results
Roadmap
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
ORC History
 Originally released as part of Hive
– Released in Hive 0.11.0 (2013-05-16)
– Included in each release before Hive 2.3.0 (2017-07-17)
 Factored out of Hive
– Improve integration with other tools
– Shrink the size of the dependencies
– Releases faster than Hive
– Added C++ reader and new C++ writer
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Integration History
 Before Apache ORC
– Hive 1.2.1 (2015-06-27)  SPARK-2883
Since Spark 1.4, Hive ORC is used.
 After Apache ORC
– v1.0.0 (2016-01-25)
– v1.1.0 (2016-06-10)
– v1.2.0 (2016-08-25)
– v1.3.0 (2017-01-23)
– v1.4.0 (2017-05-08)  SPARK-21422
For Spark 2.3, Apache ORC dependency is added and used.
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Integration Design (in HDP 2.6.3 with Spark 2.2)
 Switch between old and new formats by configuration
FileFormat
TextBasedFileFormat
ParquetFileFormat
OrcFileFormat
HiveFileFormat
JsonFileFormat
LibSVMFileFormat
CSVFileFormat
TextFileFormat
o.a.s.sql.execution.datasources
o.a.s.ml.source.libsvmo.a.s.sql.hive.orc
OrcFileFormat
Old OrcFileFormat
from Hive 1.2.1
(SPARK-2883)
New OrcFileFormat
with ORC 1.4.0
(SPARK-21422,
SPARK-20682,
SPARK-20728)
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Benefit in Apache Spark
 Speed
– Use both Spark ColumnarBatch and ORC RowBatch together more seamlessly
 Stability
– Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
 Usability
– User can use ORC data sources without hive module, i.e, -Phive.
 Maintainability
– Reduce the Hive dependency and can remove old legacy code later.
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Faster and More Scalable
Number of columns
Time
(ms)
Single Column Scan from Wide Tables
(1M rows with all BIGINT columns)
0 500 1000 1500 2000
TINYINT
SMALLINT
INT
BIGINT
FLOAT
DOULBE
OLD NEW
Vectorized Read
(15M rows in a single-column table)
Time
(ms)
0
200
400
600
800
100 200 300
NEW OLD
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Support Matrix
 HDP 2.6.3 is going to add a new faster and stable ORC file format for ORC tables
with the subset of limitations of Apache Spark 2.2.
 The followings are not supported yet
– Zero-byte ORC File
– Schema Evolution
• Adding columns at the end
• Changing types and deleting columns
 Please see the full JIRA issue list in next slides.
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Done Tickets
 SPARK-20901 Feature parity for ORC with Parquet
– The parent ticket of all ORC issues
 Done
– SPARK-20566 ColumnVector should support `appendFloats` for array
– SPARK-21422 Depend on Apache ORC 1.4.0
– SPARK-21831 Remove `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite
– SPARK-21839 Support SQL config for ORC compression
– SPARK-21884 Fix StackOverflowError on MetadataOnlyQuery
– SPARK-21912 ORC/Parquet table should not create invalid column names
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
On-Going Tickets
 SPARK-20682 Support a new faster ORC data source based on Apache ORC
 SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core
 SPARK-16060 Vectorized Orc reader
 SPARK-21791 ORC should support column names with dot
 SPARK-21787 Support for pushing down filters for DATE types in ORC
 SPARK-19809 Zero byte ORC file support
 SPARK-14387 Enable Hive-1.x ORC compatibility with
spark.sql.hive.convertMetastoreOrc
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
To-do Tickets
 Configuration
– SPARK-21783 Turn on ORC filter push-down by default
 Read
– SPARK-11412 Support merge schema for ORC
– SPARK-16628 OrcConversions should not convert an ORC table
 Alter
– SPARK-21929 Support `ALTER TABLE ADD COLUMNS(..)` for ORC data source
– SPARK-18355 Spark SQL fails to read from a ORC table with new column
 Write
– SPARK-12417 Orc bloom filter options are not propagated during file write
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Performance Update
Spark – Apache ORC 1.4 Integration
Benchmark Overview
Results
Roadmap
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Benchmark Overview
• Benchmark Objective
• Compare performance of New ORC vs Old ORC in Spark
• With TPC-DS at different scales 1 TB, 10 TB
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enabling New ORC in Spark
 spark.sql.hive.convertMetastoreOrc=true
 spark.sql.orc.enabled=true
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Software/Hardware/Cluster Details
 Software
– Internal branch of Spark 2.2 (in HDP)
 Node Configuration
– 32 CPU, Intel E5-2640, 2.00GHz
– 256 GB RAM
– 6 SATA Disks (~4 TB, 7200 rpm)
– 10 Gbps network card
 Cluster
– 10 nodes used for 1 TB
– 15 nodes used for 10 TB
– Double type used in data
– TPC-DS (1.4) queries in Spark used for benchmarking
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Tuning
Spark Parameter Name Default Value Tuned Value
spark.sql.hive.convertMetastoreOrc false true
spark.sql.orc.enabled false true
spark.sql.orc.filterPushdown false true
spark.sql.statistics.fallBackToHdfs false true
spark.sql.autoBroadcastJoinThreshold 10L * 1024 * 1024 (10MB) 26214400
spark.shuffle.io.numConnectionsPerPeer 1 10
spark.io.compression.lz4.blockSize 32k 128kb
spark.sql.shuffle.partitions 200 300
spark.network.timeout 120s 600s
spark.locality.wait 3s 0s
*Hive Parameter Name (hive-site.xml) Default Value Tuned Value
hive.exec.max.dynamic.partitions 1000 10000
hive.exec.dynamic.partition.mode strict nonstrict
*mainly for data ingestion
*all tables analyzed with `noscan` option for getting basic stats
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Performance Update
Spark – Apache ORC 1.4 Integration
Benchmark Overview
Results
Roadmap
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Results : 10 TB Scale (15 executors, 170 GB, 25 cores)
* Total runtime of 74 queries in TPC-DS
- New ORC is > 2x better than old ORC
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Results : 10 TB Scale (15 executors, 170 GB, 25 cores)
New ORC vs Old ORC
New ORC consistently outperforms old ORC in all queries significantly
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Results : 1 TB Scale (10 executors, 170 GB, 25 cores)
* Total runtime of 97 queries in TPC-DS @ 1 TB scale
• New ORC is ~2x faster than old ORC
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
*Profiler shows high CPU usage with libzip.so during ORC reads. ORC-175 can be used for improving this further
*ORC-175 (Intel ISAL: intelligent storage acceleration libraries for inflate) can be useful here
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Comparison of New ORC vs Old ORC vs Parquet (10 TB Scale)
* Total runtime of 74 queries in TPC-DS
* 10 TB Scale (15 executors, 170 GB, 25 cores)
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Performance Update
Spark – Apache ORC 1.4 Integration
Benchmark Overview
Results
Roadmap
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RoadMap
 Tune default ambari configs for spark based on these benchmark results to get better
OOTB experience for end users
 Additional enhancements like parallel reading of footers in ORC
 Include ORC-175 (Intel ISAL) when complete
 Contribute back the fixes back to community
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thanks!
 Questions
– For ORC – dev@@orc.apache.org
– For Spark – dev@spark.apache.org
– Benchmarks – dhyun@hortonworks.com
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Results : 10 TB Scale (15 executors, 170 GB, 25 cores)
Comparison of New ORC vs Parquet

More Related Content

What's hot

Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2ScyllaDB
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyDatabricks
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 

What's hot (20)

Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 

Viewers also liked

An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
MLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API ServersMLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API ServersDataWorks Summit
 
Accelerate Your Big Data Analytics Efforts with SAS and Hadoop
Accelerate Your Big Data Analytics Efforts with SAS and HadoopAccelerate Your Big Data Analytics Efforts with SAS and Hadoop
Accelerate Your Big Data Analytics Efforts with SAS and HadoopDataWorks Summit
 
Breaching the 100TB Mark with SQL Over Hadoop
Breaching the 100TB Mark with SQL Over HadoopBreaching the 100TB Mark with SQL Over Hadoop
Breaching the 100TB Mark with SQL Over HadoopDataWorks Summit
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Spark Summit
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...DataWorks Summit
 
Cloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
Cloud Operations with Streaming Analytics using Apache NiFi and Apache FlinkCloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
Cloud Operations with Streaming Analytics using Apache NiFi and Apache FlinkDataWorks Summit
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
 
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...DataWorks Summit
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in EnterpriseDataWorks Summit
 
Simplify and Secure your Hadoop Environment with Hortonworks and Centrify
Simplify and Secure your Hadoop Environment with Hortonworks and CentrifySimplify and Secure your Hadoop Environment with Hortonworks and Centrify
Simplify and Secure your Hadoop Environment with Hortonworks and CentrifyHortonworks
 
Hadoop and Data Access Security
Hadoop and Data Access SecurityHadoop and Data Access Security
Hadoop and Data Access SecurityCloudera, Inc.
 

Viewers also liked (15)

An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
MLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API ServersMLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API Servers
 
Accelerate Your Big Data Analytics Efforts with SAS and Hadoop
Accelerate Your Big Data Analytics Efforts with SAS and HadoopAccelerate Your Big Data Analytics Efforts with SAS and Hadoop
Accelerate Your Big Data Analytics Efforts with SAS and Hadoop
 
Breaching the 100TB Mark with SQL Over Hadoop
Breaching the 100TB Mark with SQL Over HadoopBreaching the 100TB Mark with SQL Over Hadoop
Breaching the 100TB Mark with SQL Over Hadoop
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
 
Cloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
Cloud Operations with Streaming Analytics using Apache NiFi and Apache FlinkCloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
Cloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in Enterprise
 
Simplify and Secure your Hadoop Environment with Hortonworks and Centrify
Simplify and Secure your Hadoop Environment with Hortonworks and CentrifySimplify and Secure your Hadoop Environment with Hortonworks and Centrify
Simplify and Secure your Hadoop Environment with Hortonworks and Centrify
 
Hadoop and Data Access Security
Hadoop and Data Access SecurityHadoop and Data Access Security
Hadoop and Data Access Security
 

Similar to Performance Update: When Apache ORC Met Apache Spark

ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3Dongjoon Hyun
 
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4Dongjoon Hyun
 
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4DataWorks Summit
 
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4DataWorks Summit
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetOwen O'Malley
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesBig Data Spain
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark Summit
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleHortonworks
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with ZeppelinHortonworks
 
Apache Accumulo 1.8.0 Overview
Apache Accumulo 1.8.0 OverviewApache Accumulo 1.8.0 Overview
Apache Accumulo 1.8.0 OverviewJosh Elser
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016alanfgates
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 

Similar to Performance Update: When Apache ORC Met Apache Spark (20)

ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
 
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
 
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
Sparc solaris servers
Sparc solaris serversSparc solaris servers
Sparc solaris servers
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Apache Accumulo 1.8.0 Overview
Apache Accumulo 1.8.0 OverviewApache Accumulo 1.8.0 Overview
Apache Accumulo 1.8.0 Overview
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 

Recently uploaded (20)

Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 

Performance Update: When Apache ORC Met Apache Spark

  • 1. PERFORMANCE UPDATE: WHEN APACHE ORC MET APACHE SPARK Dongjoon Hyun - dhyun@hortonworks.com Owen O'Malley - owen@hortonworks.com @owen_omalley
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Performance Update Spark – Apache ORC 1.4 Integration Benchmark Overview Results Roadmap
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ORC History  Originally released as part of Hive – Released in Hive 0.11.0 (2013-05-16) – Included in each release before Hive 2.3.0 (2017-07-17)  Factored out of Hive – Improve integration with other tools – Shrink the size of the dependencies – Releases faster than Hive – Added C++ reader and new C++ writer
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Integration History  Before Apache ORC – Hive 1.2.1 (2015-06-27)  SPARK-2883 Since Spark 1.4, Hive ORC is used.  After Apache ORC – v1.0.0 (2016-01-25) – v1.1.0 (2016-06-10) – v1.2.0 (2016-08-25) – v1.3.0 (2017-01-23) – v1.4.0 (2017-05-08)  SPARK-21422 For Spark 2.3, Apache ORC dependency is added and used.
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Integration Design (in HDP 2.6.3 with Spark 2.2)  Switch between old and new formats by configuration FileFormat TextBasedFileFormat ParquetFileFormat OrcFileFormat HiveFileFormat JsonFileFormat LibSVMFileFormat CSVFileFormat TextFileFormat o.a.s.sql.execution.datasources o.a.s.ml.source.libsvmo.a.s.sql.hive.orc OrcFileFormat Old OrcFileFormat from Hive 1.2.1 (SPARK-2883) New OrcFileFormat with ORC 1.4.0 (SPARK-21422, SPARK-20682, SPARK-20728)
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Benefit in Apache Spark  Speed – Use both Spark ColumnarBatch and ORC RowBatch together more seamlessly  Stability – Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.  Usability – User can use ORC data sources without hive module, i.e, -Phive.  Maintainability – Reduce the Hive dependency and can remove old legacy code later.
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Faster and More Scalable Number of columns Time (ms) Single Column Scan from Wide Tables (1M rows with all BIGINT columns) 0 500 1000 1500 2000 TINYINT SMALLINT INT BIGINT FLOAT DOULBE OLD NEW Vectorized Read (15M rows in a single-column table) Time (ms) 0 200 400 600 800 100 200 300 NEW OLD
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Support Matrix  HDP 2.6.3 is going to add a new faster and stable ORC file format for ORC tables with the subset of limitations of Apache Spark 2.2.  The followings are not supported yet – Zero-byte ORC File – Schema Evolution • Adding columns at the end • Changing types and deleting columns  Please see the full JIRA issue list in next slides.
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Done Tickets  SPARK-20901 Feature parity for ORC with Parquet – The parent ticket of all ORC issues  Done – SPARK-20566 ColumnVector should support `appendFloats` for array – SPARK-21422 Depend on Apache ORC 1.4.0 – SPARK-21831 Remove `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite – SPARK-21839 Support SQL config for ORC compression – SPARK-21884 Fix StackOverflowError on MetadataOnlyQuery – SPARK-21912 ORC/Parquet table should not create invalid column names
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved On-Going Tickets  SPARK-20682 Support a new faster ORC data source based on Apache ORC  SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core  SPARK-16060 Vectorized Orc reader  SPARK-21791 ORC should support column names with dot  SPARK-21787 Support for pushing down filters for DATE types in ORC  SPARK-19809 Zero byte ORC file support  SPARK-14387 Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved To-do Tickets  Configuration – SPARK-21783 Turn on ORC filter push-down by default  Read – SPARK-11412 Support merge schema for ORC – SPARK-16628 OrcConversions should not convert an ORC table  Alter – SPARK-21929 Support `ALTER TABLE ADD COLUMNS(..)` for ORC data source – SPARK-18355 Spark SQL fails to read from a ORC table with new column  Write – SPARK-12417 Orc bloom filter options are not propagated during file write
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Performance Update Spark – Apache ORC 1.4 Integration Benchmark Overview Results Roadmap
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Benchmark Overview • Benchmark Objective • Compare performance of New ORC vs Old ORC in Spark • With TPC-DS at different scales 1 TB, 10 TB
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enabling New ORC in Spark  spark.sql.hive.convertMetastoreOrc=true  spark.sql.orc.enabled=true
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Software/Hardware/Cluster Details  Software – Internal branch of Spark 2.2 (in HDP)  Node Configuration – 32 CPU, Intel E5-2640, 2.00GHz – 256 GB RAM – 6 SATA Disks (~4 TB, 7200 rpm) – 10 Gbps network card  Cluster – 10 nodes used for 1 TB – 15 nodes used for 10 TB – Double type used in data – TPC-DS (1.4) queries in Spark used for benchmarking
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Tuning Spark Parameter Name Default Value Tuned Value spark.sql.hive.convertMetastoreOrc false true spark.sql.orc.enabled false true spark.sql.orc.filterPushdown false true spark.sql.statistics.fallBackToHdfs false true spark.sql.autoBroadcastJoinThreshold 10L * 1024 * 1024 (10MB) 26214400 spark.shuffle.io.numConnectionsPerPeer 1 10 spark.io.compression.lz4.blockSize 32k 128kb spark.sql.shuffle.partitions 200 300 spark.network.timeout 120s 600s spark.locality.wait 3s 0s *Hive Parameter Name (hive-site.xml) Default Value Tuned Value hive.exec.max.dynamic.partitions 1000 10000 hive.exec.dynamic.partition.mode strict nonstrict *mainly for data ingestion *all tables analyzed with `noscan` option for getting basic stats
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Performance Update Spark – Apache ORC 1.4 Integration Benchmark Overview Results Roadmap
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Results : 10 TB Scale (15 executors, 170 GB, 25 cores) * Total runtime of 74 queries in TPC-DS - New ORC is > 2x better than old ORC
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Results : 10 TB Scale (15 executors, 170 GB, 25 cores) New ORC vs Old ORC New ORC consistently outperforms old ORC in all queries significantly
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Results : 1 TB Scale (10 executors, 170 GB, 25 cores) * Total runtime of 97 queries in TPC-DS @ 1 TB scale • New ORC is ~2x faster than old ORC
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved *Profiler shows high CPU usage with libzip.so during ORC reads. ORC-175 can be used for improving this further *ORC-175 (Intel ISAL: intelligent storage acceleration libraries for inflate) can be useful here
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Comparison of New ORC vs Old ORC vs Parquet (10 TB Scale) * Total runtime of 74 queries in TPC-DS * 10 TB Scale (15 executors, 170 GB, 25 cores)
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Performance Update Spark – Apache ORC 1.4 Integration Benchmark Overview Results Roadmap
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RoadMap  Tune default ambari configs for spark based on these benchmark results to get better OOTB experience for end users  Additional enhancements like parallel reading of footers in ORC  Include ORC-175 (Intel ISAL) when complete  Contribute back the fixes back to community
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thanks!  Questions – For ORC – dev@@orc.apache.org – For Spark – dev@spark.apache.org – Benchmarks – dhyun@hortonworks.com
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Results : 10 TB Scale (15 executors, 170 GB, 25 cores) Comparison of New ORC vs Parquet