SlideShare a Scribd company logo
1 of 32
Download to read offline
Optimizing S3 Write-heavy Spark workloads
Apache Spark meetup, Qubole office, Bangalore
3rd March 2018
bharatb@qubole.com
Senior Engineering Manager, Spark team, Qubole
Context
● Cloud, object storage, ephemeral clusters, large writes
● df.write().save()
● df.write().saveAsTable()
● spark.sql(“INSERT INTO ….”)
● spark.sql(“INSERT OVERWRITE …”)
● spark.sql(“ALTER TABLE RECOVER PARTITIONS”)
● Changes in spark/hive itself rather than in user programs
Agenda
● Problems with S3 writes
● Spark writes
● Faster hive writes, iteration 1
● Faster hive writes, iteration 2
● Fault tolerant DFOC
● Faster recover partitions
Part 1: Problems with S3 writes
Problems when writing to S3: EC
- Eventual consistency problems
- HEAD (404) -> PUT -> GET
- PUT -> PUT -> GET
- PUT -> DELETE -> LIST-PARENT
Problems when writing to S3: Rename
Operation:
Rename
s3://bucket/x to
s3://bucket/y
Copy x to y Delete x
- Copy is slow and depends on file size
- Two calls needed
Problems when writing to S3: Failures
- Transient failures of S3 rest calls
- Throttling
Part 2: Spark writes
Two kinds of tables
Write
Hive table
Datasource
table
Distributed
write to hive
staging dir
Hive.loadTable /
Hive.loadPartition
called to move data
to warehouse
Distributed
write to final
dest
Part 3: Faster hive table writes, iteration 1
Problem: loadPartition
Write
Hive table
Datasource
table
Distributed
write to hive
staging dir
Hive.loadTable /
Hive.loadPartition
called to move data
to warehouse
Distributed
write to final
dest
Problem: loadPartition is slow
- Hive.replaceFiles / Hive.copyFiles primitive is used to move
data from hive staging dir to warehouse dir
- Rename done in the hive operations is slow and serialized
- No retries to account for transient failures
Problem: loadPartition has EC issues
- EC issues during the copy/move
- Few files written to the hive staging directory may not appear in
the listing done on the driver during Hive.replaceFiles
- Few files deleted may appear in the listing (especially in FOC v1
case)
Solution
- Parallelize Hive.copyFiles/Hive.replaceFiles
- Changes in hive codebase
- Algorithm:
- listFiles(src); renameFiles(src, dest)
- loop-until-no-change
- listFiles(src)
- renameFiles(src, dest)
Solution: Robustness
- Listing related
- diff(oldListing, newListing)
- if new files appear, rename them in this iteration
- if existing files disappear, dont try to rename them
- Rename related
- if rename failed, try to rename them in next iteration
Solution: Performance
- Rename in parallel in a threadpool of 128 threads
- For INSERT INTO, find the N to use for file_copy_N, for all files
in the dest dir, in one shot
- Rename the biggest files to be first so that they don’t become
the long pole
- Rename the recently modified files last (FIFO on time) so that
they get time to vanish
Solution: Performance numbers
- INSERT OVERWRITE TABLE user PARTITION(date="2011")
SELECT userId, firstName, email FROM people
- For example, 100GB data spread over 10000 files
- Before optimization: 110 mins
- After optimization: 12 mins (not sensitive to file count)
Part 4: Faster hive table writes, iteration 2
Problem
- Can we write directly to the hive warehouse folder?
- Avoid hive staging-dir ?
Solution
Write
Hive table
Datasource
table
Distributed
write to hive
staging dir
Hive.loadTable /
Hive.loadPartition
called to move data
to warehouse
Distributed
write to final
dest
Solution: Algorithm
- InsertIntoHiveTable.run()
- if (useDirectWrites)
- InsertIntoHadoopFsRelationCommand
- FileFormatWriter.write(fileFormat)
- else // existing code
- FileFormatWriter.write(HiveFileFormat)
- Hive.loadTable / Hive.loadPartition
Solution: Write directly to the warehouse
- Use spark’s default write flow for hive tables also
- Avoid using staging_dir
- Uses whatever OutputCommitter which is active
- Changes in spark code base
- Cases: INSERT INTO/OVERWRITE + Static/dynamic partitions
- Except INSERT OVERWRITE involving dynamic partitions
- Con: Affects warehouse directory immediately on job start
Solution: Write directly to the warehouse
- Very good performance gains
- Hive.loadTable / Hive.loadPartition not needed
- Error recovery needs be done carefully
- On failure, delete all files s3://bucket/path/*/*/*<jobId>*
Solution: Performance
- Data: 142 GB (Records - 149994000, Partitions - 9000)
- Each partition had one file
- Direct writes disabled: 7 hr, 30 min
- Direct writes enabled: 24.5 mins
- Spark distributed write was fast in both cases. In the first case
extra move was needed.
Part 5: Fault tolerant DFOC
DirectFileOutputCommitter (DFOC)
- Directly write to output location
- Pros: No EC, high performance
- Cons: Speculation and task retries will fail
- Cons: Output is visible before job finish
Problem
- If you use DFOC, any task failure will cause job failure
- Empty S3 file is created even on task failure
- Retry will always fail with FileAlreadyExistsException
- 7/08/16 00:33:55 task-result-getter-1 WARN TaskSetManager: Lost task 0.1 in stage 42.0 (TID 5782, 10.23.7.190, executor 10):
org.apache.hadoop.fs.FileAlreadyExistsException:
s3n://bucket/path/2017/08/15/23/part-00000-017681ee-5206-4163-b4a9-a29cf8a67ab4.json.gz already exists
Solution: Overwrite if file already exists
- fs.create(path, false) -> fs.create(path, true)
- Spark changes - different across versions
- Hive changes - orc
- Parquet changes
Part 6: Faster recover partitions
Problem:
- alter table recover partitions is slow
- Algorithm
- Generate list of all partitions and their statistics
- Add partitions to metastore
- Example: Two partition keys, 100 values each, 10k partitions in
total - takes close to (10+20) mins to recover partitions (spark
2.1.0)
Solution
- Use faster variant of S3 listing, prefix based
- 10 mins for gathering partitions and stats reduced to 10 secs
- Now total time is (10 secs + 20 mins), 33% improvement
- Spark only changes
Thank you - Q&A
- rohitk@qubole.com
- prakharj@qubole.com
- bharatb@qubole.com

More Related Content

What's hot

Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...SindhuVasireddy1
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysDatabricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta LakeKnoldus Inc.
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedDatabricks
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...Edureka!
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
 

What's hot (20)

Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 

Similar to Optimizing S3 Write-heavy Spark workloads

Managing ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkManaging ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkDatabricks
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptxDori Waldman
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks
 
Understanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New YorkUnderstanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New YorkRachel Warren
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkHolden Karau
 
Spark Autotuning - Strata EU 2018
Spark Autotuning - Strata EU 2018Spark Autotuning - Strata EU 2018
Spark Autotuning - Strata EU 2018Holden Karau
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk finalRachel Warren
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLHyderabad Scalability Meetup
 
FireWorks workflow software
FireWorks workflow softwareFireWorks workflow software
FireWorks workflow softwareAnubhav Jain
 
Speed it up and Spark it up at Intel
Speed it up and Spark it up at IntelSpeed it up and Spark it up at Intel
Speed it up and Spark it up at IntelDataWorks Summit
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
Speed up large-scale ML/DL offline inference job with Alluxio
Speed up large-scale ML/DL offline inference job with AlluxioSpeed up large-scale ML/DL offline inference job with Alluxio
Speed up large-scale ML/DL offline inference job with AlluxioAlluxio, Inc.
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfprevota
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceSpeck&Tech
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogVadim Semenov
 

Similar to Optimizing S3 Write-heavy Spark workloads (20)

Managing ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkManaging ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache Spark
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
 
Understanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New YorkUnderstanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New York
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New York
 
Spark Autotuning - Strata EU 2018
Spark Autotuning - Strata EU 2018Spark Autotuning - Strata EU 2018
Spark Autotuning - Strata EU 2018
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk final
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
FireWorks workflow software
FireWorks workflow softwareFireWorks workflow software
FireWorks workflow software
 
Speed it up and Spark it up at Intel
Speed it up and Spark it up at IntelSpeed it up and Spark it up at Intel
Speed it up and Spark it up at Intel
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Speed up large-scale ML/DL offline inference job with Alluxio
Speed up large-scale ML/DL offline inference job with AlluxioSpeed up large-scale ML/DL offline inference job with Alluxio
Speed up large-scale ML/DL offline inference job with Alluxio
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
 
Spark 101
Spark 101Spark 101
Spark 101
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Oracle11g notes
Oracle11g notesOracle11g notes
Oracle11g notes
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 
Handout3o
Handout3oHandout3o
Handout3o
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
 

More from datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 APIdatamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scaladatamantra
 

More from datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 

Recently uploaded

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 

Recently uploaded (20)

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 

Optimizing S3 Write-heavy Spark workloads

  • 1. Optimizing S3 Write-heavy Spark workloads Apache Spark meetup, Qubole office, Bangalore 3rd March 2018 bharatb@qubole.com Senior Engineering Manager, Spark team, Qubole
  • 2. Context ● Cloud, object storage, ephemeral clusters, large writes ● df.write().save() ● df.write().saveAsTable() ● spark.sql(“INSERT INTO ….”) ● spark.sql(“INSERT OVERWRITE …”) ● spark.sql(“ALTER TABLE RECOVER PARTITIONS”) ● Changes in spark/hive itself rather than in user programs
  • 3. Agenda ● Problems with S3 writes ● Spark writes ● Faster hive writes, iteration 1 ● Faster hive writes, iteration 2 ● Fault tolerant DFOC ● Faster recover partitions
  • 4. Part 1: Problems with S3 writes
  • 5. Problems when writing to S3: EC - Eventual consistency problems - HEAD (404) -> PUT -> GET - PUT -> PUT -> GET - PUT -> DELETE -> LIST-PARENT
  • 6. Problems when writing to S3: Rename Operation: Rename s3://bucket/x to s3://bucket/y Copy x to y Delete x - Copy is slow and depends on file size - Two calls needed
  • 7. Problems when writing to S3: Failures - Transient failures of S3 rest calls - Throttling
  • 8. Part 2: Spark writes
  • 9. Two kinds of tables Write Hive table Datasource table Distributed write to hive staging dir Hive.loadTable / Hive.loadPartition called to move data to warehouse Distributed write to final dest
  • 10. Part 3: Faster hive table writes, iteration 1
  • 11. Problem: loadPartition Write Hive table Datasource table Distributed write to hive staging dir Hive.loadTable / Hive.loadPartition called to move data to warehouse Distributed write to final dest
  • 12. Problem: loadPartition is slow - Hive.replaceFiles / Hive.copyFiles primitive is used to move data from hive staging dir to warehouse dir - Rename done in the hive operations is slow and serialized - No retries to account for transient failures
  • 13. Problem: loadPartition has EC issues - EC issues during the copy/move - Few files written to the hive staging directory may not appear in the listing done on the driver during Hive.replaceFiles - Few files deleted may appear in the listing (especially in FOC v1 case)
  • 14. Solution - Parallelize Hive.copyFiles/Hive.replaceFiles - Changes in hive codebase - Algorithm: - listFiles(src); renameFiles(src, dest) - loop-until-no-change - listFiles(src) - renameFiles(src, dest)
  • 15. Solution: Robustness - Listing related - diff(oldListing, newListing) - if new files appear, rename them in this iteration - if existing files disappear, dont try to rename them - Rename related - if rename failed, try to rename them in next iteration
  • 16. Solution: Performance - Rename in parallel in a threadpool of 128 threads - For INSERT INTO, find the N to use for file_copy_N, for all files in the dest dir, in one shot - Rename the biggest files to be first so that they don’t become the long pole - Rename the recently modified files last (FIFO on time) so that they get time to vanish
  • 17. Solution: Performance numbers - INSERT OVERWRITE TABLE user PARTITION(date="2011") SELECT userId, firstName, email FROM people - For example, 100GB data spread over 10000 files - Before optimization: 110 mins - After optimization: 12 mins (not sensitive to file count)
  • 18. Part 4: Faster hive table writes, iteration 2
  • 19. Problem - Can we write directly to the hive warehouse folder? - Avoid hive staging-dir ?
  • 20. Solution Write Hive table Datasource table Distributed write to hive staging dir Hive.loadTable / Hive.loadPartition called to move data to warehouse Distributed write to final dest
  • 21. Solution: Algorithm - InsertIntoHiveTable.run() - if (useDirectWrites) - InsertIntoHadoopFsRelationCommand - FileFormatWriter.write(fileFormat) - else // existing code - FileFormatWriter.write(HiveFileFormat) - Hive.loadTable / Hive.loadPartition
  • 22. Solution: Write directly to the warehouse - Use spark’s default write flow for hive tables also - Avoid using staging_dir - Uses whatever OutputCommitter which is active - Changes in spark code base - Cases: INSERT INTO/OVERWRITE + Static/dynamic partitions - Except INSERT OVERWRITE involving dynamic partitions - Con: Affects warehouse directory immediately on job start
  • 23. Solution: Write directly to the warehouse - Very good performance gains - Hive.loadTable / Hive.loadPartition not needed - Error recovery needs be done carefully - On failure, delete all files s3://bucket/path/*/*/*<jobId>*
  • 24. Solution: Performance - Data: 142 GB (Records - 149994000, Partitions - 9000) - Each partition had one file - Direct writes disabled: 7 hr, 30 min - Direct writes enabled: 24.5 mins - Spark distributed write was fast in both cases. In the first case extra move was needed.
  • 25. Part 5: Fault tolerant DFOC
  • 26. DirectFileOutputCommitter (DFOC) - Directly write to output location - Pros: No EC, high performance - Cons: Speculation and task retries will fail - Cons: Output is visible before job finish
  • 27. Problem - If you use DFOC, any task failure will cause job failure - Empty S3 file is created even on task failure - Retry will always fail with FileAlreadyExistsException - 7/08/16 00:33:55 task-result-getter-1 WARN TaskSetManager: Lost task 0.1 in stage 42.0 (TID 5782, 10.23.7.190, executor 10): org.apache.hadoop.fs.FileAlreadyExistsException: s3n://bucket/path/2017/08/15/23/part-00000-017681ee-5206-4163-b4a9-a29cf8a67ab4.json.gz already exists
  • 28. Solution: Overwrite if file already exists - fs.create(path, false) -> fs.create(path, true) - Spark changes - different across versions - Hive changes - orc - Parquet changes
  • 29. Part 6: Faster recover partitions
  • 30. Problem: - alter table recover partitions is slow - Algorithm - Generate list of all partitions and their statistics - Add partitions to metastore - Example: Two partition keys, 100 values each, 10k partitions in total - takes close to (10+20) mins to recover partitions (spark 2.1.0)
  • 31. Solution - Use faster variant of S3 listing, prefix based - 10 mins for gathering partitions and stats reduced to 10 secs - Now total time is (10 secs + 20 mins), 33% improvement - Spark only changes
  • 32. Thank you - Q&A - rohitk@qubole.com - prakharj@qubole.com - bharatb@qubole.com