SlideShare a Scribd company logo
1 of 14
Download to read offline
Common strategies for
improving performance on your
Delta Lakehouse
Franco Patano
Sr. Solutions Architect, databricks
Funny thing here
Agenda
Table Properties and Z-Ordering
Principles of structuring delta tables for optimal
performance at each stage
Configurations
Spark and Delta configs to tune in common
situations
Query Optimizations
Query Hints to optimize join strategies
Table Structure
▪ Stats are only collected on the first 32
ordinal fields, including fields in nested
structures
▪ You can change this with this property: dataSkippingNumIndexedCols
▪ Restructure data accordingly
▪ Move numericals, keys, high cardinality query predicates to the left, long
strings that are not distinct enough for stats collection to the right past
the dataSkippingNumIndexedCols
▪ Long strings are kryptonite to stats collection, move these to past the
32nd position, or past dataSkippingNumIndexedCols
Numbers, Keys, High Cardinality Long Strings
32 columns or dataSkippingNumIndexedCols
Table Properties
Optimized Writes
Adaptive Shuffle before writing
files
Works for Inserts, Merge, and
Updates to speed up the writes
Select queries would benefit from
ordered data in the files between
optimize commands on streaming
use cases
ALTER TABLE [<table-name>|delta.`<path-to-table>`] SET TBLPROPERTIES
'delta.autoOptimize.optimizeWrite' = 'true'
set spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true;
Table Properties High Velocity
Do you have requirements for thousands of requests per seconds (read/write)?
▪ Randomize Prefixes on S3
▪ Avoids hotspots in S3 metadata
▪ Dedicate S3 bucket per Delta Table (root bucket)
▪ Turn on Table Property
▪ ALTER TABLE [table_name | delta.`<table-path>`] SET TBLPROPERTIES (delta.randomizeFilePrefixes = true)
▪ spark.sql("SET delta.randomizeFilePrefixes = true")
Optimize and Z-Order
Optimize will bin pack our files for better read performance
Z-Order will organize our data for better data skipping
What fields should you Z-Order by?
Fields that are being joined on, or included in a predicate
▪ Primary Key , Foriegn Keys on dim and fact tables
▪ ID fields that are joined to other tables
▪ High Cardinality fields used in query predicates
Partitioning and Z-Order effectiveness
High Cardinality Regular Cardinality Low Cardinality
Very Uncommon or Unique Datum
● User or Device ID
● Email Address
● Phone Number
Common Repeatable Data
● People or Object Names
● Street Addresses
● Categories
Repeatable, limited distinct data
● Gender
● Status Flags
● Boolean Values
SELECT COUNT(DISTINCT(x))
Partitioning effectiveness
Z-Order effectiveness
Spark and Adaptive Query Execution
Turn AQE on: spark.sql.adaptive.enabled true (default in DBR 7.3+, yay!)
▪ Need to turn on for all adaptive configs
Turn Coalesce Partitions on: spark.sql.adaptive.coalescePartitions.enabled true
▪ Let AQE manage SQL Partitions
Turn Skew Join on: spark.sql.adaptive.skewJoin.enabled true
▪ Let AQE manage skewey data in sort merge join
Turn Local Shuffle Reader on: spark.sql.adaptive.localShuffleReader.enabled true
▪ Save time on network transport by reading shuffle files locally if we can
Broadcast Join Threshold: spark.sql.autoBroadcastJoinThreshold 100*1024*1024
▪ Increase threshold for tables when broadcasting small tables
Not Prefer SortMergeJoin: spark.sql.join.prefersortmergejoin false
Delta Configs
Delta Cache: spark.databricks.io.cache.enabled true
▪ Should be enabled by default on Delta Cache Enabled clusters
▪ Can be enabled for any cluster, the faster the local disk = better performance
Delta Cache Staleness: spark.databricks.delta.stalenesslimit 1h
▪ If your data is not getting refreshed often, turn up the staleness limit to decrease query processing
▪ Use for BI or Analytics clusters
▪ Should NOT use for ETL clusters
Enhanced checkpoints for low-latency queries:delta.checkpoint.writeStatsAsJson
▪ Use DBR 7.3 LTS+ (enabled by default)
▪ Eliminates deserialization step for checkpoints, speeding up latency on short queries
Broadcast Hash Join / Nested Loop
SELECT /*+ BROADCAST(a) */ id FROM a JOIN b
ON a.key = b.key
Shuffle Hash Join
SELECT /*+ SHUFFLE_HASH(a, b) */ id FROM a
JOIN b ON a.key = b.key
Sort-Merge Join
SELECT /*+ MERGE(a, b) */ id FROM a JOIN b ON
a.key = b.key
Shuffle Nested Loop Join (Cartesian)
SELECT /*+ SHUFFLE_REPLICATE_NL(a, b) */ id
FROM a JOIN b
Requires one side to be small. No shuffle, no sort, very fast.
Needs to shuffle data but no sort. Can handle large tables, but will OOM too if data is skewed.
One side is smaller (3x or more) and a partition of it can fit in memory
(enable by `spark.sql.join.preferSortMergeJoin = false`)
Robust. Can handle any data size. Needs to shuffle and sort data, slower in most cases when the
table size is small.
Does not require join keys as it is a cartesian product of the tables. Avoid doing this if you can
When AQE is not getting the hint...
Tips for each layer
Business-level
Aggregates
Filtered,
Cleaned
Augmented
Raw, Historical
Ingestion
Bronze Silver Gold
● Turn off stats collection
○ dataSkippingNumIndexedCols 0
● Optimize and Z-Order by merge
keys between Bronze and Silver
● Turn Optimized Writes
● Restructure columns to account
for data skipping index columns
● Optimize and Z-Order by join keys
or common High Cardinality query
predicates
● Turn Optimized Writes
● Enable Delta Cache (with fast disk
cluster types)
● Turn up Staleness Limit to align
with your orchestration
Pro-tips
Use the latest Databricks Runtime
▪ We are constantly improving performance and adding features
The key to fast Update/Merge operations is to re-write the least amount of files
▪ Optimized Writes helps
▪ spark.databricks.delta.optimize.maxfilesize = 32MB (16 to 128)
The key to fast Select queries
▪ Delta Cache
▪ Optimize and Z-Order
▪ Turn on AQE
Try using Hilbert curve for optimize
▪ spark.databricks.io.skipping.mdc.curve hilbert
It’s Demo Time!
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
Databricks
 

What's hot (20)

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 

Similar to Common Strategies for Improving Performance on Your Delta Lakehouse

Scaling MySQL Strategies for Developers
Scaling MySQL Strategies for DevelopersScaling MySQL Strategies for Developers
Scaling MySQL Strategies for Developers
Jonathan Levin
 

Similar to Common Strategies for Improving Performance on Your Delta Lakehouse (20)

iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
2017 AWS DB Day | Amazon Redshift 소개 및 실습
2017 AWS DB Day | Amazon Redshift  소개 및 실습2017 AWS DB Day | Amazon Redshift  소개 및 실습
2017 AWS DB Day | Amazon Redshift 소개 및 실습
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
 
GIDS 2016 Understanding and Building No SQLs
GIDS 2016 Understanding and Building No SQLsGIDS 2016 Understanding and Building No SQLs
GIDS 2016 Understanding and Building No SQLs
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Berlin buzzwords 2018
Berlin buzzwords 2018Berlin buzzwords 2018
Berlin buzzwords 2018
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
How nebula graph index works
How nebula graph index worksHow nebula graph index works
How nebula graph index works
 
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
 
Scaling MySQL Strategies for Developers
Scaling MySQL Strategies for DevelopersScaling MySQL Strategies for Developers
Scaling MySQL Strategies for Developers
 
MySQL performance tuning
MySQL performance tuningMySQL performance tuning
MySQL performance tuning
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 

Recently uploaded (20)

Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 

Common Strategies for Improving Performance on Your Delta Lakehouse

  • 1. Common strategies for improving performance on your Delta Lakehouse Franco Patano Sr. Solutions Architect, databricks Funny thing here
  • 2. Agenda Table Properties and Z-Ordering Principles of structuring delta tables for optimal performance at each stage Configurations Spark and Delta configs to tune in common situations Query Optimizations Query Hints to optimize join strategies
  • 3. Table Structure ▪ Stats are only collected on the first 32 ordinal fields, including fields in nested structures ▪ You can change this with this property: dataSkippingNumIndexedCols ▪ Restructure data accordingly ▪ Move numericals, keys, high cardinality query predicates to the left, long strings that are not distinct enough for stats collection to the right past the dataSkippingNumIndexedCols ▪ Long strings are kryptonite to stats collection, move these to past the 32nd position, or past dataSkippingNumIndexedCols Numbers, Keys, High Cardinality Long Strings 32 columns or dataSkippingNumIndexedCols
  • 4. Table Properties Optimized Writes Adaptive Shuffle before writing files Works for Inserts, Merge, and Updates to speed up the writes Select queries would benefit from ordered data in the files between optimize commands on streaming use cases ALTER TABLE [<table-name>|delta.`<path-to-table>`] SET TBLPROPERTIES 'delta.autoOptimize.optimizeWrite' = 'true' set spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true;
  • 5. Table Properties High Velocity Do you have requirements for thousands of requests per seconds (read/write)? ▪ Randomize Prefixes on S3 ▪ Avoids hotspots in S3 metadata ▪ Dedicate S3 bucket per Delta Table (root bucket) ▪ Turn on Table Property ▪ ALTER TABLE [table_name | delta.`<table-path>`] SET TBLPROPERTIES (delta.randomizeFilePrefixes = true) ▪ spark.sql("SET delta.randomizeFilePrefixes = true")
  • 6. Optimize and Z-Order Optimize will bin pack our files for better read performance Z-Order will organize our data for better data skipping What fields should you Z-Order by? Fields that are being joined on, or included in a predicate ▪ Primary Key , Foriegn Keys on dim and fact tables ▪ ID fields that are joined to other tables ▪ High Cardinality fields used in query predicates
  • 7. Partitioning and Z-Order effectiveness High Cardinality Regular Cardinality Low Cardinality Very Uncommon or Unique Datum ● User or Device ID ● Email Address ● Phone Number Common Repeatable Data ● People or Object Names ● Street Addresses ● Categories Repeatable, limited distinct data ● Gender ● Status Flags ● Boolean Values SELECT COUNT(DISTINCT(x)) Partitioning effectiveness Z-Order effectiveness
  • 8. Spark and Adaptive Query Execution Turn AQE on: spark.sql.adaptive.enabled true (default in DBR 7.3+, yay!) ▪ Need to turn on for all adaptive configs Turn Coalesce Partitions on: spark.sql.adaptive.coalescePartitions.enabled true ▪ Let AQE manage SQL Partitions Turn Skew Join on: spark.sql.adaptive.skewJoin.enabled true ▪ Let AQE manage skewey data in sort merge join Turn Local Shuffle Reader on: spark.sql.adaptive.localShuffleReader.enabled true ▪ Save time on network transport by reading shuffle files locally if we can Broadcast Join Threshold: spark.sql.autoBroadcastJoinThreshold 100*1024*1024 ▪ Increase threshold for tables when broadcasting small tables Not Prefer SortMergeJoin: spark.sql.join.prefersortmergejoin false
  • 9. Delta Configs Delta Cache: spark.databricks.io.cache.enabled true ▪ Should be enabled by default on Delta Cache Enabled clusters ▪ Can be enabled for any cluster, the faster the local disk = better performance Delta Cache Staleness: spark.databricks.delta.stalenesslimit 1h ▪ If your data is not getting refreshed often, turn up the staleness limit to decrease query processing ▪ Use for BI or Analytics clusters ▪ Should NOT use for ETL clusters Enhanced checkpoints for low-latency queries:delta.checkpoint.writeStatsAsJson ▪ Use DBR 7.3 LTS+ (enabled by default) ▪ Eliminates deserialization step for checkpoints, speeding up latency on short queries
  • 10. Broadcast Hash Join / Nested Loop SELECT /*+ BROADCAST(a) */ id FROM a JOIN b ON a.key = b.key Shuffle Hash Join SELECT /*+ SHUFFLE_HASH(a, b) */ id FROM a JOIN b ON a.key = b.key Sort-Merge Join SELECT /*+ MERGE(a, b) */ id FROM a JOIN b ON a.key = b.key Shuffle Nested Loop Join (Cartesian) SELECT /*+ SHUFFLE_REPLICATE_NL(a, b) */ id FROM a JOIN b Requires one side to be small. No shuffle, no sort, very fast. Needs to shuffle data but no sort. Can handle large tables, but will OOM too if data is skewed. One side is smaller (3x or more) and a partition of it can fit in memory (enable by `spark.sql.join.preferSortMergeJoin = false`) Robust. Can handle any data size. Needs to shuffle and sort data, slower in most cases when the table size is small. Does not require join keys as it is a cartesian product of the tables. Avoid doing this if you can When AQE is not getting the hint...
  • 11. Tips for each layer Business-level Aggregates Filtered, Cleaned Augmented Raw, Historical Ingestion Bronze Silver Gold ● Turn off stats collection ○ dataSkippingNumIndexedCols 0 ● Optimize and Z-Order by merge keys between Bronze and Silver ● Turn Optimized Writes ● Restructure columns to account for data skipping index columns ● Optimize and Z-Order by join keys or common High Cardinality query predicates ● Turn Optimized Writes ● Enable Delta Cache (with fast disk cluster types) ● Turn up Staleness Limit to align with your orchestration
  • 12. Pro-tips Use the latest Databricks Runtime ▪ We are constantly improving performance and adding features The key to fast Update/Merge operations is to re-write the least amount of files ▪ Optimized Writes helps ▪ spark.databricks.delta.optimize.maxfilesize = 32MB (16 to 128) The key to fast Select queries ▪ Delta Cache ▪ Optimize and Z-Order ▪ Turn on AQE Try using Hilbert curve for optimize ▪ spark.databricks.io.skipping.mdc.curve hilbert
  • 14. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.