In this webcast, Jason Pohl, Solution Engineer from Databricks, will cover how to build a Just-in-Time Data Warehouse on Databricks with a focus on performing Change Data Capture from a relational database and joining that data to a variety of data sources. Not only does Apache Spark and Databricks allow you to do this easier with less code, the routine will automatically ingest changes to the source schema.
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read
1. Just-in-Time Data Warehousing on
Databricks: Change Data Capture
and Schema On Read
Jason Pohl, Data Solutions Engineer
Denny Lee, Technology Evangelist
2. About the speaker: Jason Pohl
Jason Pohl is a solutions engineer with Databricks,
focused on helping customers become successful
with their data initiatives. Jason has spent his
career building data-driven products and solutions.
2
3. About the moderator: Denny Lee
Denny Lee is a Technology Evangelist with
Databricks; he is a hands-on data sciences engineer
with more than 15 years of experience developing
internet-scale infrastructure, data platforms, and
distributed systems for both on-premises and cloud.
Prior to joining Databricks, Denny worked as a Senior
Director of Data Sciences Engineering at Concur and
was part of the incubation team that built Hadoop on
Windows and Azure (currently known as HDInsight).
3
4. We are Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
4
Data Value
Created Databricks on top of Spark to make big data simple.
5. …
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R APIs
Standard libraries
6.
7. NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT
2015 SAN FRANCISCO
Source: Slide 5 of Spark Community Update
8.
9.
10. Traditional Data Warehousing Pain Points
Inelasticity of compute and storage resources
• Burst workloads requires max. load capacity planning
• Fixed size DW = compute and storage to scale linearly together
(these are orthogonal problems)
• Expensive conundrum:
• If your DW is successful, you cannot easily exapnd
• If there is overcapacity = idle resources
11. Traditional Data Warehousing Pain Points
Rigid architecture that’s difficult to change
• Traditional DW are schema-on-write requiring schemas, partitions, and indexes to be
pre-built.
• Rigidity = maintaining costly ETL pipelines
• Expend finite resources to continually augment pipelines to absorb new data.
12. Traditional Data Warehousing Pain Points
Limited advanced analytics capabilities
• Want more than what business intelligence and data warehousing provides
• More than just counts, aggregates and trends
• Desire forecasting using ML, segmentation, graph processing, etc.
13. Just-in-Time Data Warehousing
Scale resources on demand
13
• Scale resources based on query load
• Separate compute and storage to scale
either independently
• Easily setup multiple clusters against the
same data sources
14. Just-in-Time Data Warehousing
Direct access to data sources
14
• Scale resources based on query load
• Separate compute and storage to scale
either independently
• Easily setup multiple clusters against the
same data sources
15. Just-in-Time Data Warehousing
Scale resources on demand
15
• Scale resources based on query load
• Separate compute and storage to scale
either independently
• Easily setup multiple clusters against the
same data sources
16. Change Data Capture
What is it?
• System to automatically capture changes in source system (e.g.
transactional database) and automatically capture those changes
in a target system (e.g. data warehouse).
• Important for data warehouses because it allows it to record (and
ultimately report) any changes, e.g.:
• Customer A buys a pair of skis for $250 on 1/2/2015
• On 1/5/2015, realize that the purchase was $350 not $250
16
17. Change Data Capture
Source to Target
17
Source
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
Target
ID Date Product Price
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
18. Change Data Capture
Add new row
18
Source
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
Target
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
19. Change Data Capture
Update an existing row
19
Source
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
Target
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $350.00
103 1/3/2016 Disc $15.00
20. Change Data Capture
Update an existing row
20
Source Target
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $350.00 1/5/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/5/2016
103 1/3/2016 Disc $15.00 1/3/2016
102 1/2/2016 Skis $350.00 1/5/2016
23. 23
Update Records in Employee Source Database
UPDATE employees
SET last_name = 'Spark'
WHERE emp_no = 16894
24. Job to Automate CDC
24
Source Target
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
Jobs
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
25. 25
Add a column to the Departments table
ALTER TABLE departments
ADD COLUMN dept_desc VARCHAR(50)
UPDATE departments
SET dept_desc = dept_name
26. Job to Automate CDC
Source Target
Jobs
dept_no
dept_name
dept_no
dept_namedept_no
dept_name
dept_desc
27. Notebooks
To access the notebooks, please reference the attachments in the Just-in-Time Data
Warehousing on Databricks: Change Data Capture and Schema On Read webinar.
• Stage Data From Employee Database:
• Notebook that starts the process
• Defines the ETL process
• Change Schema in Employee Source Database
• Update Records in Employee Source Database
• Validate Departments
28. Resources
• Just-in-Time Data Warehousing Solution Brief
• Building a Turbo-fast Data Warehousing Platform with
Databricks
• Spark DataFrames: Simple and Fast Analysis of Structured Data
• Transitioning from Traditional DW to Spark in OR Predictive
Modeling
• Advertising Technology Sample Notebook (Part 1)
29. More resources
• Databricks Guide
• Apache Spark User Guide
• Databricks Community Forum
• Training courses: public classes, MOOCs, & private training
• Databricks Community Edition: Free hosted Apache Spark.
Join the waitlist for the beta release!
29