Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
1 of 14

Common Strategies for Improving Performance on Your Delta Lakehouse



Download to read offline

The Delta Architecture pattern has made the lives of data engineers much simpler, but what about improving query performance for data analysts? What are some common places to look at for tuning query performance? In this session we will cover some common techniques to apply to our delta tables to make them perform better for data analysts queries. We will look at a few examples of how you can analyze a query, and determine what to focus on to deliver better performance results.

Common Strategies for Improving Performance on Your Delta Lakehouse

  1. 1. Common strategies for improving performance on your Delta Lakehouse Franco Patano Sr. Solutions Architect, databricks Funny thing here
  2. 2. Agenda Table Properties and Z-Ordering Principles of structuring delta tables for optimal performance at each stage Configurations Spark and Delta configs to tune in common situations Query Optimizations Query Hints to optimize join strategies
  3. 3. Table Structure ▪ Stats are only collected on the first 32 ordinal fields, including fields in nested structures ▪ You can change this with this property: dataSkippingNumIndexedCols ▪ Restructure data accordingly ▪ Move numericals, keys, high cardinality query predicates to the left, long strings that are not distinct enough for stats collection to the right past the dataSkippingNumIndexedCols ▪ Long strings are kryptonite to stats collection, move these to past the 32nd position, or past dataSkippingNumIndexedCols Numbers, Keys, High Cardinality Long Strings 32 columns or dataSkippingNumIndexedCols
  4. 4. Table Properties Optimized Writes Adaptive Shuffle before writing files Works for Inserts, Merge, and Updates to speed up the writes Select queries would benefit from ordered data in the files between optimize commands on streaming use cases ALTER TABLE [<table-name>|delta.`<path-to-table>`] SET TBLPROPERTIES 'delta.autoOptimize.optimizeWrite' = 'true' set = true;
  5. 5. Table Properties High Velocity Do you have requirements for thousands of requests per seconds (read/write)? ▪ Randomize Prefixes on S3 ▪ Avoids hotspots in S3 metadata ▪ Dedicate S3 bucket per Delta Table (root bucket) ▪ Turn on Table Property ▪ ALTER TABLE [table_name | delta.`<table-path>`] SET TBLPROPERTIES (delta.randomizeFilePrefixes = true) ▪ spark.sql("SET delta.randomizeFilePrefixes = true")
  6. 6. Optimize and Z-Order Optimize will bin pack our files for better read performance Z-Order will organize our data for better data skipping What fields should you Z-Order by? Fields that are being joined on, or included in a predicate ▪ Primary Key , Foriegn Keys on dim and fact tables ▪ ID fields that are joined to other tables ▪ High Cardinality fields used in query predicates
  7. 7. Partitioning and Z-Order effectiveness High Cardinality Regular Cardinality Low Cardinality Very Uncommon or Unique Datum ● User or Device ID ● Email Address ● Phone Number Common Repeatable Data ● People or Object Names ● Street Addresses ● Categories Repeatable, limited distinct data ● Gender ● Status Flags ● Boolean Values SELECT COUNT(DISTINCT(x)) Partitioning effectiveness Z-Order effectiveness
  8. 8. Spark and Adaptive Query Execution Turn AQE on: spark.sql.adaptive.enabled true (default in DBR 7.3+, yay!) ▪ Need to turn on for all adaptive configs Turn Coalesce Partitions on: spark.sql.adaptive.coalescePartitions.enabled true ▪ Let AQE manage SQL Partitions Turn Skew Join on: spark.sql.adaptive.skewJoin.enabled true ▪ Let AQE manage skewey data in sort merge join Turn Local Shuffle Reader on: spark.sql.adaptive.localShuffleReader.enabled true ▪ Save time on network transport by reading shuffle files locally if we can Broadcast Join Threshold: spark.sql.autoBroadcastJoinThreshold 100*1024*1024 ▪ Increase threshold for tables when broadcasting small tables Not Prefer SortMergeJoin: spark.sql.join.prefersortmergejoin false
  9. 9. Delta Configs Delta Cache: true ▪ Should be enabled by default on Delta Cache Enabled clusters ▪ Can be enabled for any cluster, the faster the local disk = better performance Delta Cache Staleness: 1h ▪ If your data is not getting refreshed often, turn up the staleness limit to decrease query processing ▪ Use for BI or Analytics clusters ▪ Should NOT use for ETL clusters Enhanced checkpoints for low-latency queries:delta.checkpoint.writeStatsAsJson ▪ Use DBR 7.3 LTS+ (enabled by default) ▪ Eliminates deserialization step for checkpoints, speeding up latency on short queries
  10. 10. Broadcast Hash Join / Nested Loop SELECT /*+ BROADCAST(a) */ id FROM a JOIN b ON a.key = b.key Shuffle Hash Join SELECT /*+ SHUFFLE_HASH(a, b) */ id FROM a JOIN b ON a.key = b.key Sort-Merge Join SELECT /*+ MERGE(a, b) */ id FROM a JOIN b ON a.key = b.key Shuffle Nested Loop Join (Cartesian) SELECT /*+ SHUFFLE_REPLICATE_NL(a, b) */ id FROM a JOIN b Requires one side to be small. No shuffle, no sort, very fast. Needs to shuffle data but no sort. Can handle large tables, but will OOM too if data is skewed. One side is smaller (3x or more) and a partition of it can fit in memory (enable by `spark.sql.join.preferSortMergeJoin = false`) Robust. Can handle any data size. Needs to shuffle and sort data, slower in most cases when the table size is small. Does not require join keys as it is a cartesian product of the tables. Avoid doing this if you can When AQE is not getting the hint...
  11. 11. Tips for each layer Business-level Aggregates Filtered, Cleaned Augmented Raw, Historical Ingestion Bronze Silver Gold ● Turn off stats collection ○ dataSkippingNumIndexedCols 0 ● Optimize and Z-Order by merge keys between Bronze and Silver ● Turn Optimized Writes ● Restructure columns to account for data skipping index columns ● Optimize and Z-Order by join keys or common High Cardinality query predicates ● Turn Optimized Writes ● Enable Delta Cache (with fast disk cluster types) ● Turn up Staleness Limit to align with your orchestration
  12. 12. Pro-tips Use the latest Databricks Runtime ▪ We are constantly improving performance and adding features The key to fast Update/Merge operations is to re-write the least amount of files ▪ Optimized Writes helps ▪ = 32MB (16 to 128) The key to fast Select queries ▪ Delta Cache ▪ Optimize and Z-Order ▪ Turn on AQE Try using Hilbert curve for optimize ▪ hilbert
  13. 13. It’s Demo Time!
  14. 14. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.