It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?
9. When you think Warehouse…
We automatically think of Star
Schemas and Kimball warehousing
approaches.
A large central fact table with smaller
reference dimensions… some of which
aren’t so small
13. SCD - Enabling the Familiar
PrimaryKey Address Current EffectiveDate EndDate
11 A new customer address TRUE 03/08/2020 null
58 Yet another address TRUE 03/08/2020 null
41 A different address TRUE 03/08/2020 null
PrimaryKey Address Current EffectiveDate EndDate
11 A new customer address FALSE 03/08/2020 22/10/2020
11 An updated address TRUE 22/10/2020 null
58 Yet another address TRUE 03/08/2020 null
41 A different address TRUE 03/08/2020 null
14. SCD - Merge Commands
MERGE INTO dataai.addresses as original
USING updates
ON original.primaryKey = updates.primaryKey
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
Available in SQL, Scala and Python APIs - the merge command has made many complex warehousing
jobs accessible to the wider Analytics community
This is enabled by the Delta file format
16. Spark Partitioning
SELECT * FROM Sales WHERE Month
= 3
SQL Query Action
Filtering performed by
selectively reading files
SALES
Month=1 Month=2
Month=3 Month=4
17. Cross-Filter Spark 2.4
SELECT * FROM Sales JOIN Date
WHERE DateMonth = 3
SQL Query Action
SALES
Month=1 Month=2
Month=3 Month=4
DimDATE
Partition Keys not hit when
filtering on joined tables
18. Cross-Filter Spark 3.0
SELECT * FROM Sales JOIN Date
WHERE DateMonth = 3
SQL Query Action
SALES
Month=1 Month=2
Month=3 Month=4
DimDATE
Dynamic Partition Pruning
determines partition filters
during runtime
20. AQE in Spark 3.0
AQE will speed up common queries in a number of ways:
▪ Coalescing Shuffle Partitions
▪ Switching Join Strategies
▪ Optimizing Skew Joins