Successfully reported this slideshow.
Your SlideShare is downloading.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

0

Share

Download Now Download

Download to read offline

An unconventional approach for ETL of historized data

Download Now Download

Download to read offline

Maintaining a data historization is a very common but time consuming task in a data warehouse environment. The common techniques used involve outer joins and some kind of change detection. This change detection must be done with respect of Null-values and is possibly the most trickiest part. But, on the other hand, SQL offers standard functionality with exactly desired behaviour: Group By or Partitioning with analytic functions. Can it be used for this task?

  • Be the first to comment

  • Be the first to like this

An unconventional approach for ETL of historized data

  1. 1. SCD2 mal anders Andrej Pashchenko Senior Consultant, Düsseldorf @Andrej_SQL doag2017
  2. 2. Unser Unternehmen. Trivadis DOAG17: SCD2 mal anders2 29.11.2018 Trivadis ist führend bei der IT-Beratung, der Systemintegration, dem Solution Engineering und der Erbringung von IT-Services mit Fokussierung auf - und -Technologien in der Schweiz, Deutschland, Österreich und Dänemark. Trivadis erbringt ihre Leistungen aus den strategischen Geschäftsfeldern: Trivadis Services übernimmt den korrespondierenden Betrieb Ihrer IT Systeme. B E T R I E B
  3. 3. KOPENHAGEN MÜNCHEN LAUSANNE BERN ZÜRICH BRUGG GENF HAMBURG DÜSSELDORF FRANKFURT STUTTGART FREIBURG BASEL WIEN Mit über 600 IT- und Fachexperten bei Ihnen vor Ort. Trivadis DOAG17: SCD2 mal anders3 29.11.2018 14 Trivadis Niederlassungen mit über 600 Mitarbeitenden. Über 200 Service Level Agreements. Mehr als 4'000 Trainingsteilnehmer. Forschungs- und Entwicklungsbudget: CHF 5.0 Mio. / EUR 4.0 Mio. Finanziell unabhängig und nachhaltig profitabel. Erfahrung aus mehr als 1'900 Projekten pro Jahr bei über 800 Kunden.
  4. 4. Über mich Trivadis DOAG17: SCD2 mal anders4 29.11.2018 Senior Consultant bei der Trivadis GmbH, Düsseldorf Schwerpunkt Oracle – Data Warehousing – Application Development – Application Performance Kurs-Referent „Oracle 12c New Features für Entwickler“ und „TechnoCircle Oracle 12c Release 2“ Blog: http://blog.sqlora.com
  5. 5. Agenda Trivadis DOAG17: SCD2 mal anders5 29.11.2018 1. Introduction and state of the art 2. The „new“ approach 3. Use cases and performance 4. Conclusion
  6. 6. Trivadis DOAG17: SCD2 mal anders6 29.11.2018 Introduction and state of the art
  7. 7. Introduction Trivadis DOAG17: SCD2 mal anders7 29.11.2018 Historization? As a part of loading process in a data warehouse We consider Slowly Changing Dimensions Type II All changes are completely tracked. The change in at least one of the tracked columns toggles the creation of the new version record The most challenging task is the change detection DWH_KEY VALID_FROM VALID_TO CUR_VERSION ETL_OP BUS_KEY FIRST_NAME SECOND_NAMES LAST_NAME HIRE_DATE FIRE_DATE SALARY 1 01.12.2016 02.12.2016 N UPD 123 Roger Federer 01.01.2010 900000 11 03.12.2016 Y INS 123 Roger Federer 01.01.2010 920000 6 02.12.2016 02.12.2016 N UPD 345 Venus Williams 01.11.2016 500000 10 03.12.2016 Y INS 345 Venus Williams 01.11.2016 01.12.2016 500000 2 01.12.2016 02.12.2016 N UPD 456 Rafael Nadal 01.05.2009 720000 3 01.12.2016 01.12.2016 N UPD 789 Serena Williams 01.06.2008 650000 5 02.12.2016 Y INS 789 Serena Jameka Williams 01.06.2008 650000
  8. 8. State of the Art Trivadis DOAG17: SCD2 mal anders8 29.11.2018 Typical OWB mapping
  9. 9. BK_T C1_T C2_T 11 A BB 22 D E 77 M N 33 F G State of the Art Trivadis DOAG17: SCD2 mal anders9 29.11.2018 BK C1 C2 11 A B 22 D E 44 K L 77 M BK C1 C2 11 A BB 22 D E 33 F G 77 M N BK_S C1_S C2_S 11 A B 22 D E 44 K L 77 M NVL(C2_S,'(NULL)') != NVL(C2_T,'(NULL)') LNNVL(C2_S = C2_T) AND NVL(C2_S, C2_T) IS NOT NULL DECODE, STANDARD_HASH, SYS_OP_MAP_NONNULL … Full Outer Join Change Detection? Old Versions New Versions Old New Target Source Target Split UNION ALL MERGE More on delta detection: https://danischnider.wordpress.com/2016/10/08/delta-detection-in-oracle-sql/ Data to the left has to be accessed twice!
  10. 10. State of the Art Trivadis DOAG17: SCD2 mal anders10 29.11.2018 Change detection must be done with respect to null values Comparing each and every column in a complex way Or maintaining and comparing hash-diffs: common rules needed, re-hashing after structural changes sometimes needed Full outer join may be expensive if not working with „deltas“ Splitting the join result into two data sets causes this join to be made twice Another solution?
  11. 11. Trivadis DOAG17: SCD2 mal anders11 29.11.2018 The „new“ approach
  12. 12. The „new“ approach Trivadis DOAG17: SCD2 mal anders12 29.11.2018 The „new“ approach is not really new Oft used for ad hoc queries Are these two records different? Using Group BY BK C1 C2 C3 C4 … … C467 C468 C469 11 A B C D … … AA BB CC 11 A B C D … … AB BB CC SELECT COUNT(*) FROM t GROUP BY BK, C1, C2, C3, C4, … C467, C468, C469
  13. 13. The „new“ approach Trivadis DOAG17: SCD2 mal anders13 29.11.2018 Or using analytical function: If count equals 2 – they are the same If count equals 1 – they are different For GROUP BY and PARTITION BY: NULL=NULL, VALUE!=NULL SELECT COUNT(*) OVER (PARTITION BY BK, C1, C2, C3, … C468, C469) FROM t; But what about NULLs?
  14. 14. BK C1 C2 11 A BB 33 F G 77 M N S_T BK C1 C2 T 11 A BB T 22 D E T 33 F G T 77 M N The „new“ approach Trivadis DOAG17: SCD2 mal anders14 29.11.2018 BK C1 C2 11 A B 22 D E 44 K L 77 M BK C1 C2 11 A BB 22 D E 33 F G 77 M N UNION ALL Target Source Target GROUP BY MERGE S_T BK C1 C2 S 11 A B S 22 D E S 44 K L S 77 M MIN (S_T) S S S S T T T DEMO! BK C1 C2 11 A B 22 D E 44 K L 77 M CNT 1 2 1 1 1 1 1
  15. 15. The „new“ approach Trivadis DOAG17: SCD2 mal anders15 29.11.2018
  16. 16. An unconventional approach for ETL of historized data16 19.03.2017 Use Cases and Performance
  17. 17. Use Cases and Performance Trivadis DOAG17: SCD2 mal anders17 29.11.2018 Source Older Versions Full Data Current VersionsJOIN may be slow Filter may be slow Partitio- ning? Target Full Data Load Full Data Current Versions Group By may be slow UNION ALLLegacy New
  18. 18. Use Cases and Performance Trivadis DOAG17: SCD2 mal anders18 29.11.2018 Source Delta JOIN Filter may be slow Partitio- ning? Older Versions Current Versions Target Delta Load Delta Current Versions Group By may be slow UNION ALLLegacy New
  19. 19. Use Cases and Performance Trivadis DOAG17: SCD2 mal anders19 29.11.2018 Source Older Versions Delta Current Versions JOIN Filter Business_key IN … Target Delta Load with pre-filter Delta Current Ver- sions (filtered) Group By fast UNION ALLLegacy New
  20. 20. Use Cases and Performance Trivadis DOAG17: SCD2 mal anders20 29.11.2018 Data Warehouse with Siebel-CRM as a source Order table S_ORDER – 120 columns „only“ Comparing legacy approach vs. GROUP BY vs. analytical functions Full staging table as a source vs. delta (with or without pre-filtering) Ca. 6 Mio rows in the target table Ca. 3 Mio rows in the full load dataset Ca. 3000 rows in the delta load dataset
  21. 21. Use Cases and Performance Trivadis DOAG17: SCD2 mal anders21 29.11.2018 Method Delta Load, min Full Load, min Outer Join (legacy approach) 0:09 0:41 GROUP BY 1:10 1:04 GROUP BY with pre-filter 0:04 N/A Analytic Function 2:12 4:52 Analytic with pre-filter 0:12 N/A
  22. 22. Use Cases and Performance Trivadis DOAG17: SCD2 mal anders22 29.11.2018 Execution Plan -------------------------------------------------------------------------------------------- | Id | Operation | Name | A-Rows | A-Time | -------------------------------------------------------------------------------------------- | 0 | MERGE STATEMENT | | 0 |00:00:04.33 | | 1 | MERGE | CO_S_ORDER_TEST | 0 |00:00:04.33 | | 2 | VIEW | | 3799 |00:00:04.29 | | 3 | SEQUENCE | SEQ_CO_S_ORDER | 3799 |00:00:04.29 | | 4 | PX COORDINATOR | | 3799 |00:00:04.28 | | 5 | PX SEND QC (RANDOM) | :TQ10005 | 0 |00:00:00.01 | |* 6 | HASH JOIN OUTER BUFFERED | | 3799 |00:00:11.51 | | 7 | PX RECEIVE | | 3799 |00:00:00.01 | ... | 15 | PX RECEIVE | | 4654 |00:00:00.04 | | 16 | PX SEND HASH | :TQ10001 | 0 |00:00:00.01 | | 17 | HASH GROUP BY | | 4654 |00:00:03.41 | | 18 | VIEW | | 4801 |00:00:00.77 | | 19 | UNION-ALL | | 4801 |00:00:00.77 | | 20 | PX BLOCK ITERATOR | | 3120 |00:00:00.01 | |* 21 | TABLE ACCESS FULL | STG_S_ORDER_DELTA | 3120 |00:00:00.01 | |* 22 | HASH JOIN RIGHT SEMI | | 1681 |00:00:04.41 | | 23 | PX RECEIVE | | 12480 |00:00:00.02 | | 24 | PX SEND BROADCAST | :TQ10000 | 0 |00:00:00.01 | | 25 | PX BLOCK ITERATOR | | 3120 |00:00:00.01 | |* 26 | TABLE ACCESS FULL| STG_S_ORDER_DELTA | 3120 |00:00:00.01 | | 27 | PX BLOCK ITERATOR | | 3710K|00:00:03.26 | |* 28 | TABLE ACCESS FULL | CO_S_ORDER_TEST | 3710K|00:00:02.92 | | 29 | PX RECEIVE | | 6107K|00:00:11.11 | | 30 | PX SEND HASH | :TQ10004 | 0 |00:00:00.01 | | 31 | PX BLOCK ITERATOR | | 6107K|00:00:05.37 | |* 32 | TABLE ACCESS FULL | CO_S_ORDER_TEST | 6107K|00:00:04.69 | --------------------------------------------------------------------------------------------
  23. 23. Legacy New Use Cases and Performance Trivadis DOAG17: SCD2 mal anders23 29.11.2018 Source Older Versions Current Versions Core Current Versions Dim JOIN may be slow Filter may be slow Partitio- ning? Target Loading Dimensions from Core Current Versions Core Current Versions Dim Group By may be slow UNION ALL Older Versions
  24. 24. Legacy New Use Cases and Performance Trivadis DOAG17: SCD2 mal anders24 29.11.2018 Source is a View Older Versions Current VersionsJOIN may be slow Filter may be slow Partitio- ning? Target Loading Dimensions from Core Full Data Current Versions Group By may be slow UNION ALL
  25. 25. Use Cases and Performance Trivadis DOAG17: SCD2 mal anders25 29.11.2018 Loading of a dimension via view The view joins some „big“ tables (50 Gb, 40+ Mio rows) And produces < 500 dimension records per day The loading time could be reduced by 45 percent (3 min 50 sec → 2 min)
  26. 26. Conclusion Trivadis DOAG17: SCD2 mal anders26 29.11.2018 It is simpler and faster in certain cases The source is queried only once, can be significant if the source is a view The code can be simply generated Simple to build even without generation (only a plain list of columns to Copy&Paste) It‘s worth to do an ad hoc testing with your data Test it!
  27. 27. Andrej Pashchenko Senior Consultant Tel. +49 211 58 666 470 andrej.pashchenko@trivadis.com 29.11.2018 Trivadis DOAG17: SCD2 mal anders28 blog.sqlora.com
  28. 28. Trivadis @ DOAG 2017 #opencompany Stand: 3ter Stock, direkt an der Rolltreppe Wir teilen unser Know how! Einfach vorbei kommen, Live-Präsentationen und Dokumentenarchiv T-Shirts, Gewinnspiel und mehr Wir freuen uns wenn Sie vorbei schauen 29.11.2018 Trivadis DOAG17: SCD2 mal anders29

    Be the first to comment

    Login to see the comments

Maintaining a data historization is a very common but time consuming task in a data warehouse environment. The common techniques used involve outer joins and some kind of change detection. This change detection must be done with respect of Null-values and is possibly the most trickiest part. But, on the other hand, SQL offers standard functionality with exactly desired behaviour: Group By or Partitioning with analytic functions. Can it be used for this task?

Views

Total views

79

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

1

Shares

0

Comments

0

Likes

0

×