SlideShare a Scribd company logo
1 of 30
Download to read offline
Just-in-Time Data Warehousing on
Databricks: Change Data Capture
and Schema On Read
Jason Pohl, Data Solutions Engineer
Denny Lee, Technology Evangelist
About the speaker: Jason Pohl
Jason Pohl is a solutions engineer with Databricks,
focused on helping customers become successful
with their data initiatives. Jason has spent his
career building data-driven products and solutions.
2
About the moderator: Denny Lee
Denny Lee is a Technology Evangelist with
Databricks; he is a hands-on data sciences engineer
with more than 15 years of experience developing
internet-scale infrastructure, data platforms, and
distributed systems for both on-premises and cloud.
Prior to joining Databricks, Denny worked as a Senior
Director of Data Sciences Engineering at Concur and
was part of the incubation team that built Hadoop on
Windows and Azure (currently known as HDInsight).
3
We are Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
4
Data Value
Created Databricks on top of Spark to make big data simple.
…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R APIs
Standard libraries
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT
2015 SAN FRANCISCO
Source: Slide 5 of Spark Community Update
Traditional Data Warehousing Pain Points

Inelasticity of compute and storage resources
• Burst workloads requires max. load capacity planning
• Fixed size DW = compute and storage to scale linearly together
(these are orthogonal problems)
• Expensive conundrum:
• If your DW is successful, you cannot easily exapnd
• If there is overcapacity = idle resources
Traditional Data Warehousing Pain Points

Rigid architecture that’s difficult to change

• Traditional DW are schema-on-write requiring schemas, partitions, and indexes to be
pre-built.
• Rigidity = maintaining costly ETL pipelines
• Expend finite resources to continually augment pipelines to absorb new data.
Traditional Data Warehousing Pain Points

Limited advanced analytics capabilities

• Want more than what business intelligence and data warehousing provides
• More than just counts, aggregates and trends
• Desire forecasting using ML, segmentation, graph processing, etc.
Just-in-Time Data Warehousing

Scale resources on demand
13
• Scale resources based on query load
• Separate compute and storage to scale
either independently
• Easily setup multiple clusters against the
same data sources
Just-in-Time Data Warehousing

Direct access to data sources
14
• Scale resources based on query load
• Separate compute and storage to scale
either independently
• Easily setup multiple clusters against the
same data sources
Just-in-Time Data Warehousing

Scale resources on demand
15
• Scale resources based on query load
• Separate compute and storage to scale
either independently
• Easily setup multiple clusters against the
same data sources
Change Data Capture

What is it?
• System to automatically capture changes in source system (e.g.
transactional database) and automatically capture those changes
in a target system (e.g. data warehouse).
• Important for data warehouses because it allows it to record (and
ultimately report) any changes, e.g.:
• Customer A buys a pair of skis for $250 on 1/2/2015
• On 1/5/2015, realize that the purchase was $350 not $250
16
Change Data Capture

Source to Target
17
Source
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
Target
ID Date Product Price
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
Change Data Capture

Add new row
18
Source
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
Target
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
Change Data Capture

Update an existing row
19
Source
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
Target
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $350.00
103 1/3/2016 Disc $15.00
Change Data Capture

Update an existing row
20
Source Target
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $350.00 1/5/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/5/2016
103 1/3/2016 Disc $15.00 1/3/2016
102 1/2/2016 Skis $350.00 1/5/2016
Demo
High Watermark with LastUpdatedDate
21
22
Stage Data from Employee Database
23
Update Records in Employee Source Database
UPDATE employees
SET last_name = 'Spark'
WHERE emp_no = 16894
Job to Automate CDC
24
Source Target
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
Jobs
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
25
Add a column to the Departments table
ALTER TABLE departments
ADD COLUMN dept_desc VARCHAR(50)
UPDATE departments
SET dept_desc = dept_name
Job to Automate CDC
Source Target
Jobs
dept_no
dept_name
dept_no
dept_namedept_no
dept_name
dept_desc
Notebooks
To access the notebooks, please reference the attachments in the Just-in-Time Data
Warehousing on Databricks: Change Data Capture and Schema On Read webinar.
• Stage Data From Employee Database:
• Notebook that starts the process
• Defines the ETL process
• Change Schema in Employee Source Database
• Update Records in Employee Source Database
• Validate Departments
Resources
• Just-in-Time Data Warehousing Solution Brief
• Building a Turbo-fast Data Warehousing Platform with
Databricks
• Spark DataFrames: Simple and Fast Analysis of Structured Data
• Transitioning from Traditional DW to Spark in OR Predictive
Modeling
• Advertising Technology Sample Notebook (Part 1)
More resources
• Databricks Guide
• Apache Spark User Guide
• Databricks Community Forum
• Training courses: public classes, MOOCs, & private training
• Databricks Community Edition: Free hosted Apache Spark.
Join the waitlist for the beta release!
29
Thanks!

More Related Content

More from Databricks

Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks
 

More from Databricks (20)

Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

  • 1. Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read Jason Pohl, Data Solutions Engineer Denny Lee, Technology Evangelist
  • 2. About the speaker: Jason Pohl Jason Pohl is a solutions engineer with Databricks, focused on helping customers become successful with their data initiatives. Jason has spent his career building data-driven products and solutions. 2
  • 3. About the moderator: Denny Lee Denny Lee is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud. Prior to joining Databricks, Denny worked as a Senior Director of Data Sciences Engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight). 3
  • 4. We are Databricks, the company behind Apache Spark Founded by the creators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% 4 Data Value Created Databricks on top of Spark to make big data simple.
  • 5. … Apache Spark Engine Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engine across diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries
  • 6.
  • 7. NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO Source: Slide 5 of Spark Community Update
  • 8.
  • 9.
  • 10. Traditional Data Warehousing Pain Points
 Inelasticity of compute and storage resources • Burst workloads requires max. load capacity planning • Fixed size DW = compute and storage to scale linearly together (these are orthogonal problems) • Expensive conundrum: • If your DW is successful, you cannot easily exapnd • If there is overcapacity = idle resources
  • 11. Traditional Data Warehousing Pain Points
 Rigid architecture that’s difficult to change
 • Traditional DW are schema-on-write requiring schemas, partitions, and indexes to be pre-built. • Rigidity = maintaining costly ETL pipelines • Expend finite resources to continually augment pipelines to absorb new data.
  • 12. Traditional Data Warehousing Pain Points
 Limited advanced analytics capabilities
 • Want more than what business intelligence and data warehousing provides • More than just counts, aggregates and trends • Desire forecasting using ML, segmentation, graph processing, etc.
  • 13. Just-in-Time Data Warehousing
 Scale resources on demand 13 • Scale resources based on query load • Separate compute and storage to scale either independently • Easily setup multiple clusters against the same data sources
  • 14. Just-in-Time Data Warehousing
 Direct access to data sources 14 • Scale resources based on query load • Separate compute and storage to scale either independently • Easily setup multiple clusters against the same data sources
  • 15. Just-in-Time Data Warehousing
 Scale resources on demand 15 • Scale resources based on query load • Separate compute and storage to scale either independently • Easily setup multiple clusters against the same data sources
  • 16. Change Data Capture
 What is it? • System to automatically capture changes in source system (e.g. transactional database) and automatically capture those changes in a target system (e.g. data warehouse). • Important for data warehouses because it allows it to record (and ultimately report) any changes, e.g.: • Customer A buys a pair of skis for $250 on 1/2/2015 • On 1/5/2015, realize that the purchase was $350 not $250 16
  • 17. Change Data Capture
 Source to Target 17 Source ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $250.00 Target ID Date Product Price ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $250.00
  • 18. Change Data Capture
 Add new row 18 Source ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $250.00 Target ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $250.00 ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $250.00 103 1/3/2016 Disc $15.00 ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $250.00 103 1/3/2016 Disc $15.00
  • 19. Change Data Capture
 Update an existing row 19 Source ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $250.00 103 1/3/2016 Disc $15.00 Target ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $250.00 103 1/3/2016 Disc $15.00 ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $350.00 103 1/3/2016 Disc $15.00
  • 20. Change Data Capture
 Update an existing row 20 Source Target ID Date Product Price LastUpdated 101 1/1/2016 Skates $80.00 1/1/2016 102 1/2/2016 Skis $250.00 1/2/2016 103 1/3/2016 Disc $15.00 1/3/2016 ID Date Product Price LastUpdated 101 1/1/2016 Skates $80.00 1/1/2016 102 1/2/2016 Skis $350.00 1/5/2016 103 1/3/2016 Disc $15.00 1/3/2016 ID Date Product Price LastUpdated 101 1/1/2016 Skates $80.00 1/1/2016 102 1/2/2016 Skis $250.00 1/2/2016 103 1/3/2016 Disc $15.00 1/3/2016 ID Date Product Price LastUpdated 101 1/1/2016 Skates $80.00 1/1/2016 102 1/2/2016 Skis $250.00 1/2/2016 103 1/3/2016 Disc $15.00 1/3/2016 ID Date Product Price LastUpdated 101 1/1/2016 Skates $80.00 1/1/2016 102 1/2/2016 Skis $250.00 1/5/2016 103 1/3/2016 Disc $15.00 1/3/2016 102 1/2/2016 Skis $350.00 1/5/2016
  • 21. Demo High Watermark with LastUpdatedDate 21
  • 22. 22 Stage Data from Employee Database
  • 23. 23 Update Records in Employee Source Database UPDATE employees SET last_name = 'Spark' WHERE emp_no = 16894
  • 24. Job to Automate CDC 24 Source Target ID Date Product Tag Price LastUpdated 101 1/1/2016 Skates ice $80.00 1/1/2016 102 1/2/2016 Skis snow $250.00 1/2/2016 103 1/3/2016 Disc field $15.00 1/3/2016 ID Date Product Tag Price LastUpdated 101 1/1/2016 Skates ice $80.00 1/1/2016 102 1/2/2016 Skis snow $250.00 1/2/2016 103 1/3/2016 Disc field $15.00 1/3/2016 Jobs ID Date Product Tag Price LastUpdated 101 1/1/2016 Skates ice $80.00 1/1/2016 102 1/2/2016 Skis snow $250.00 1/2/2016 103 1/3/2016 Disc field $15.00 1/3/2016 ID Date Product Price LastUpdated 101 1/1/2016 Skates $80.00 1/1/2016 102 1/2/2016 Skis $250.00 1/2/2016 103 1/3/2016 Disc $15.00 1/3/2016 ID Date Product Tag Price LastUpdated 101 1/1/2016 Skates ice $80.00 1/1/2016 102 1/2/2016 Skis snow $250.00 1/2/2016 103 1/3/2016 Disc field $15.00 1/3/2016 ID Date Product Price LastUpdated 101 1/1/2016 Skates $80.00 1/1/2016 102 1/2/2016 Skis $250.00 1/2/2016 103 1/3/2016 Disc $15.00 1/3/2016
  • 25. 25 Add a column to the Departments table ALTER TABLE departments ADD COLUMN dept_desc VARCHAR(50) UPDATE departments SET dept_desc = dept_name
  • 26. Job to Automate CDC Source Target Jobs dept_no dept_name dept_no dept_namedept_no dept_name dept_desc
  • 27. Notebooks To access the notebooks, please reference the attachments in the Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read webinar. • Stage Data From Employee Database: • Notebook that starts the process • Defines the ETL process • Change Schema in Employee Source Database • Update Records in Employee Source Database • Validate Departments
  • 28. Resources • Just-in-Time Data Warehousing Solution Brief • Building a Turbo-fast Data Warehousing Platform with Databricks • Spark DataFrames: Simple and Fast Analysis of Structured Data • Transitioning from Traditional DW to Spark in OR Predictive Modeling • Advertising Technology Sample Notebook (Part 1)
  • 29. More resources • Databricks Guide • Apache Spark User Guide • Databricks Community Forum • Training courses: public classes, MOOCs, & private training • Databricks Community Edition: Free hosted Apache Spark. Join the waitlist for the beta release! 29