SlideShare a Scribd company logo
1 of 17
Design Principles for a Modern
Data Warehouse
CASE STUDIES AT DE BIJENKORF AND TRAVELBIRD
Old Challenges, New Considerations
Data warehouses still must deliver:
◦ Data integration of multiple systems
◦ Accuracy, completeness, and auditability
◦ Reporting for assorted stakeholders and business needs
◦ Clean data
◦ A “single version of the truth”
But the problem space now contains:
◦ Unstructured/Semi-structured data
◦ Real time data
◦ Shorter time to access / self-service BI
◦ SO MUCH DATA (terabytes/hour to load)
◦ More systems to integrate (everything has an API)
New technologies are changing the landscape
What is best practice today?
A modern, best in class data warehouse:
◦ Is designed for scalability, ideally using cloud architecture
◦ Uses a bus-based, lambda architecture
◦ Has a federated data model for structured and unstructured data
◦ Leverages MPP databases
◦ Uses an agile data model like Data Vault
◦ Is built using code automation
◦ Processes data using ELT, not ETL
All the buzzwords! But what does it look like and why do these things help?
Architectural overview at de Bijenkorf
Tools
AWS
◦ S3
◦ Kinesis
◦ Elasticache
◦ Elastic Beanstalk
◦ EC2
◦ DynamoDB
Open Source
◦ Snowplow Event Tracker
◦ Rundeck Scheduler
◦ Jenkins Continuous Integration
◦ Pentaho PDI
Other
◦ HP Vertica
◦ Tableau
◦ Github
◦ RStudio Server
DWH internal architecture, Travelbird and Bijenkorf
• Traditional three tier DWH
• ODS generated automatically from
staging
• Allow regeneration of vault
without replaying logs
• Ops mart reflects data in original
source form
• Helps offload queries from
source systems
• Business marts materialized
exclusively from vault
Why use the cloud?
Cost Management
•Services billed by the hour, pay for what you use
•For small deployments (<50 machines), cloud hosting can be significantly cheaper
•Ex. a 3 node Vertica cluster in AWS with 25TB data: $2.2k/mo
Off the Shelf Services
•Minimize administration by using pre-built services like message buses (Kinesis), databases (RDS), Key/Value
stores (Elasticache), simplifying technology stack
•Increase speed of delivery of new functionality by eliminating most deployment tasks
•Full stack in a day? No problem!
Scalability
•Services can automatically be scaled up/down based on time, load, or other triggers
•Adding additional services can be done within minutes
•Services can scale (near) infinitely
Designed to solve both primary data needs:
◦ Damn close, right now
◦ Correct, tomorrow
Data is processed twice per stream
As implemented at BYK and TB:
◦ Real time flow from Kinesis to DWH
◦ Simultaneous process to S3
◦ Reprocessing as needed from S3 (batch)
Lambda architecture: Right Now and Right Later
Hadoop in the DWH
What is Hadoop?
◦ A distributed, fault tolerant file system
◦ A set of tools for file/data stream processing
Where does it fit into the DWH stack?
◦ Data Lake: Save all raw data for cheap; don’t force
schemas on unstructured data
◦ ETL: Distributed batch processing, aggregation, and
loading
Hadoop at Bijenkorf
◦ We had it but threw it out; the use cases didn’t fit
◦ Very little data is unstructured and the DWH supports JSON
◦ Data volumes are limited and growing slowly
◦ How did we solve the use cases?
◦ Data lake: S3 file storage + semi-structured data in Vertica
◦ Data processing: Stream processing (stable event volumes + clean
events)
Hadoop at Travelbird
◦ Dirty, fast growing event data, so…
◦ Hadoop in the typical role
◦ Raw data in AWS Elastic Map Reduce via S3
◦ Data cleaned and processed in Hadoop, then loaded into Redshift
• C-Stores persist each column independently and allow
column compression
• Queries retrieve data only from needed columns
Example: 7 billion rows, 25 columns, 10 bytes/column = 1,6 TB
table
Query: Select A, sum( D ) from table where C >= X;
Row Store: 1,6TB of data scanned
Column Store (50% compression): <100 GB data scanned
The Role of Column Store Databases in the DWH
187
230
600
600
0.63
2.1
23
62
Count Distinct
Count
Top 20, One Month
Top 20
Query Performance Results
(seconds)
C-store Postgres
Performance Comparison
Loads fast too! Facebook loads
35TB/hour into Vertica
But are there tradeoffs to a C-Store?
Weaknesses
◦ No PK/FK integrity enforced on write
◦ Slow on DELETE, UPDATE
◦ REALLY slow on single record INSERT and
SELECT
◦ Optimized for limited concurrency but big
queries; only a few users can use at a
time
Solutions
◦ Design to use calculated keys (ex. hashes)
◦ Build ETLs around COPY, TRUNCATE
◦ Individual transactions should use OLTP
or Key/Value systems
◦ Optimize data structures for common
queries and leverage big, slow disks to
create denormalized tables
Data Vault 1.618 at Bijenkorf
3rd Normal Form Data Vault
So many tables! WHY?!?!?!
What we gained
◦ Speed of integration
of new entities
◦ Fast primary keys
without lookups by
using hash keys
◦ Data matches
business processes,
not systems
◦ Easy parallelization
of table loading (24
concurrent tables?
OK!)
ELT, not ETL
Advantages of ELT
◦ Performance: Bijenkorf benchmark showed ELT was >50x faster than ETL
◦ Plus horizontal scalability is Web scale, big data, <insert buzzword here>
◦ Data Availability: You want an exact replica of your source data in the DWH anyways
◦ Simpler Architecture: Fewer systems, fewer interdependencies (decouple STG and DV), can build multiple
transformations from STG simultaneously
Myths of ELT
◦ Source and Target DB must match: Intelligently coded ELT jobs leverage platform agnostic code (or a library for each
source DB type) for loading to STG
◦ Bijenkorf runs MySQL and Oracle ELT into Vertica
◦ Travelbird runs MySQL and Postgres ELT into Redshift
◦ Limited tool availability: DV 2.0 lends itself to code generators / managers, which are best built internally anyways
◦ Talend is free (like speech and hugs) and offers ELT for many systems
◦ ELT takes longer to deploy: Because data is perfectly replicated from source, getting records in is faster;
transformations can be iterated quicker since they are independent of source->stg loading
Targeted benefits of DWH automation at Bijenkorf
Objective Achievements at Bijenkorf
Speed of development • Integration of new sources or data from existing sources takes 1-2 steps
• Adding a new vault dependency takes one step
Simplicity • Five jobs handle all ETL processes across DWH
Traceability • Every record/source file is traced in the database and every row automatically
identified by source file in ODS
Code simplification • Replaced most common key definitions with dynamic variable replacement
File management • Every source file automatically archived to Amazon S3 in appropriate locations
sorted by source, table, and date
• Entire source systems, periods, etc can be replayed in minutes
Data Vault loading automation at BYK
• New sources
automatically
added
• Last change
epoch based
on load
stamps,
advanced
each time all
dependencies
execute
successfully
All Staging
Tables
Checked for
Changes
• Dependencies
declared at
time of job
creation
• Load
prioritization
possible but
not utilized
List of
Dependent
Vault Loads
Identified
• Jobs
parallelized
across tables
but serialized
per job
• Dynamic job
queueing
ensures
appropriate
execution
order
Loads
Planned in
Hub, Link,
Sat Order
• Variables
automatically
identified and
replaced
• Each load
records
performance
statistics and
error
messages
Loads
Executed
o Loader is fully metadata driven with focus on horizontal scalability and management simplicity
o To support speed of development and performance, variable-driven SQL templates used throughout
Bringing it back: Best practice, in practice
Code
Automation
Cloud Based
Bus
Architecture
MPP
Data Vault
Unstructured
Data Stores
ELT controlled
by scheduler
Rob Winters
WintersRD@gmail.com

More Related Content

What's hot

Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture DesignKujambu Murugesan
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?confluent
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architectureSudheer Kondla
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDATAVERSITY
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen
 

What's hot (20)

Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Data Mesh
Data MeshData Mesh
Data Mesh
 

Similar to Design Principles for a Modern Data Warehouse

Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfRob Winters
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureOliver Buckley-Salmon
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Amazon Web Services
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Emprovise
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftAmazon Web Services
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Data Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDCData Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDCAbhijit Kumar
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 

Similar to Design Principles for a Modern Data Warehouse (20)

Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the Bijenkorf
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architecture
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon Redshift
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Data Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDCData Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDC
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 

More from Rob Winters

A brief history of data warehousing
A brief history of data warehousingA brief history of data warehousing
A brief history of data warehousingRob Winters
 
Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActionsRob Winters
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"Rob Winters
 
Architecting for analytics
Architecting for analyticsArchitecting for analytics
Architecting for analyticsRob Winters
 
Building a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine LearningBuilding a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine LearningRob Winters
 
Architecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data AnalyticsArchitecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data AnalyticsRob Winters
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesRob Winters
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataRob Winters
 
Getting Started with Big Data Analytics
Getting Started with Big Data AnalyticsGetting Started with Big Data Analytics
Getting Started with Big Data AnalyticsRob Winters
 
Billions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowBillions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowRob Winters
 
Tableau @ Spil Games
Tableau @ Spil GamesTableau @ Spil Games
Tableau @ Spil GamesRob Winters
 

More from Rob Winters (11)

A brief history of data warehousing
A brief history of data warehousingA brief history of data warehousing
A brief history of data warehousing
 
Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActions
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
 
Architecting for analytics
Architecting for analyticsArchitecting for analytics
Architecting for analytics
 
Building a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine LearningBuilding a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine Learning
 
Architecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data AnalyticsArchitecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data Analytics
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big Data
 
Getting Started with Big Data Analytics
Getting Started with Big Data AnalyticsGetting Started with Big Data Analytics
Getting Started with Big Data Analytics
 
Billions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowBillions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right Now
 
Tableau @ Spil Games
Tableau @ Spil GamesTableau @ Spil Games
Tableau @ Spil Games
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Design Principles for a Modern Data Warehouse

  • 1. Design Principles for a Modern Data Warehouse CASE STUDIES AT DE BIJENKORF AND TRAVELBIRD
  • 2. Old Challenges, New Considerations Data warehouses still must deliver: ◦ Data integration of multiple systems ◦ Accuracy, completeness, and auditability ◦ Reporting for assorted stakeholders and business needs ◦ Clean data ◦ A “single version of the truth” But the problem space now contains: ◦ Unstructured/Semi-structured data ◦ Real time data ◦ Shorter time to access / self-service BI ◦ SO MUCH DATA (terabytes/hour to load) ◦ More systems to integrate (everything has an API)
  • 3. New technologies are changing the landscape
  • 4. What is best practice today? A modern, best in class data warehouse: ◦ Is designed for scalability, ideally using cloud architecture ◦ Uses a bus-based, lambda architecture ◦ Has a federated data model for structured and unstructured data ◦ Leverages MPP databases ◦ Uses an agile data model like Data Vault ◦ Is built using code automation ◦ Processes data using ELT, not ETL All the buzzwords! But what does it look like and why do these things help?
  • 5. Architectural overview at de Bijenkorf Tools AWS ◦ S3 ◦ Kinesis ◦ Elasticache ◦ Elastic Beanstalk ◦ EC2 ◦ DynamoDB Open Source ◦ Snowplow Event Tracker ◦ Rundeck Scheduler ◦ Jenkins Continuous Integration ◦ Pentaho PDI Other ◦ HP Vertica ◦ Tableau ◦ Github ◦ RStudio Server
  • 6. DWH internal architecture, Travelbird and Bijenkorf • Traditional three tier DWH • ODS generated automatically from staging • Allow regeneration of vault without replaying logs • Ops mart reflects data in original source form • Helps offload queries from source systems • Business marts materialized exclusively from vault
  • 7. Why use the cloud? Cost Management •Services billed by the hour, pay for what you use •For small deployments (<50 machines), cloud hosting can be significantly cheaper •Ex. a 3 node Vertica cluster in AWS with 25TB data: $2.2k/mo Off the Shelf Services •Minimize administration by using pre-built services like message buses (Kinesis), databases (RDS), Key/Value stores (Elasticache), simplifying technology stack •Increase speed of delivery of new functionality by eliminating most deployment tasks •Full stack in a day? No problem! Scalability •Services can automatically be scaled up/down based on time, load, or other triggers •Adding additional services can be done within minutes •Services can scale (near) infinitely
  • 8. Designed to solve both primary data needs: ◦ Damn close, right now ◦ Correct, tomorrow Data is processed twice per stream As implemented at BYK and TB: ◦ Real time flow from Kinesis to DWH ◦ Simultaneous process to S3 ◦ Reprocessing as needed from S3 (batch) Lambda architecture: Right Now and Right Later
  • 9. Hadoop in the DWH What is Hadoop? ◦ A distributed, fault tolerant file system ◦ A set of tools for file/data stream processing Where does it fit into the DWH stack? ◦ Data Lake: Save all raw data for cheap; don’t force schemas on unstructured data ◦ ETL: Distributed batch processing, aggregation, and loading Hadoop at Bijenkorf ◦ We had it but threw it out; the use cases didn’t fit ◦ Very little data is unstructured and the DWH supports JSON ◦ Data volumes are limited and growing slowly ◦ How did we solve the use cases? ◦ Data lake: S3 file storage + semi-structured data in Vertica ◦ Data processing: Stream processing (stable event volumes + clean events) Hadoop at Travelbird ◦ Dirty, fast growing event data, so… ◦ Hadoop in the typical role ◦ Raw data in AWS Elastic Map Reduce via S3 ◦ Data cleaned and processed in Hadoop, then loaded into Redshift
  • 10. • C-Stores persist each column independently and allow column compression • Queries retrieve data only from needed columns Example: 7 billion rows, 25 columns, 10 bytes/column = 1,6 TB table Query: Select A, sum( D ) from table where C >= X; Row Store: 1,6TB of data scanned Column Store (50% compression): <100 GB data scanned The Role of Column Store Databases in the DWH 187 230 600 600 0.63 2.1 23 62 Count Distinct Count Top 20, One Month Top 20 Query Performance Results (seconds) C-store Postgres Performance Comparison Loads fast too! Facebook loads 35TB/hour into Vertica
  • 11. But are there tradeoffs to a C-Store? Weaknesses ◦ No PK/FK integrity enforced on write ◦ Slow on DELETE, UPDATE ◦ REALLY slow on single record INSERT and SELECT ◦ Optimized for limited concurrency but big queries; only a few users can use at a time Solutions ◦ Design to use calculated keys (ex. hashes) ◦ Build ETLs around COPY, TRUNCATE ◦ Individual transactions should use OLTP or Key/Value systems ◦ Optimize data structures for common queries and leverage big, slow disks to create denormalized tables
  • 12. Data Vault 1.618 at Bijenkorf 3rd Normal Form Data Vault So many tables! WHY?!?!?! What we gained ◦ Speed of integration of new entities ◦ Fast primary keys without lookups by using hash keys ◦ Data matches business processes, not systems ◦ Easy parallelization of table loading (24 concurrent tables? OK!)
  • 13. ELT, not ETL Advantages of ELT ◦ Performance: Bijenkorf benchmark showed ELT was >50x faster than ETL ◦ Plus horizontal scalability is Web scale, big data, <insert buzzword here> ◦ Data Availability: You want an exact replica of your source data in the DWH anyways ◦ Simpler Architecture: Fewer systems, fewer interdependencies (decouple STG and DV), can build multiple transformations from STG simultaneously Myths of ELT ◦ Source and Target DB must match: Intelligently coded ELT jobs leverage platform agnostic code (or a library for each source DB type) for loading to STG ◦ Bijenkorf runs MySQL and Oracle ELT into Vertica ◦ Travelbird runs MySQL and Postgres ELT into Redshift ◦ Limited tool availability: DV 2.0 lends itself to code generators / managers, which are best built internally anyways ◦ Talend is free (like speech and hugs) and offers ELT for many systems ◦ ELT takes longer to deploy: Because data is perfectly replicated from source, getting records in is faster; transformations can be iterated quicker since they are independent of source->stg loading
  • 14. Targeted benefits of DWH automation at Bijenkorf Objective Achievements at Bijenkorf Speed of development • Integration of new sources or data from existing sources takes 1-2 steps • Adding a new vault dependency takes one step Simplicity • Five jobs handle all ETL processes across DWH Traceability • Every record/source file is traced in the database and every row automatically identified by source file in ODS Code simplification • Replaced most common key definitions with dynamic variable replacement File management • Every source file automatically archived to Amazon S3 in appropriate locations sorted by source, table, and date • Entire source systems, periods, etc can be replayed in minutes
  • 15. Data Vault loading automation at BYK • New sources automatically added • Last change epoch based on load stamps, advanced each time all dependencies execute successfully All Staging Tables Checked for Changes • Dependencies declared at time of job creation • Load prioritization possible but not utilized List of Dependent Vault Loads Identified • Jobs parallelized across tables but serialized per job • Dynamic job queueing ensures appropriate execution order Loads Planned in Hub, Link, Sat Order • Variables automatically identified and replaced • Each load records performance statistics and error messages Loads Executed o Loader is fully metadata driven with focus on horizontal scalability and management simplicity o To support speed of development and performance, variable-driven SQL templates used throughout
  • 16. Bringing it back: Best practice, in practice Code Automation Cloud Based Bus Architecture MPP Data Vault Unstructured Data Stores ELT controlled by scheduler

Editor's Notes

  1. http://lambda-architecture.net/ http://www.semantikoz.com/blog/lambda-architecture-velocity-volume-big-data-hadoop-storm/
  2. https://redshiftuser.wordpress.com/2013/02/17/aws-redshift-query-comparison-times-against-hadoop-and-postgres/