This document discusses design principles for a modern data warehouse based on case studies from de Bijenkorf and Travelbird. It advocates for a scalable cloud-based architecture using a bus, lambda architecture to process both real-time and batch data, a federated data model to handle structured and unstructured data, massively parallel processing databases, an agile data model like Data Vault, code automation, and using ELT rather than ETL. Specific technologies used by de Bijenkorf include AWS services, Snowplow, Rundeck, Jenkins, Pentaho, Vertica, Tableau, and automated Data Vault loading. Travelbird additionally uses Hadoop for initial data processing before loading into Redshift
1. Design Principles for a Modern
Data Warehouse
CASE STUDIES AT DE BIJENKORF AND TRAVELBIRD
2. Old Challenges, New Considerations
Data warehouses still must deliver:
◦ Data integration of multiple systems
◦ Accuracy, completeness, and auditability
◦ Reporting for assorted stakeholders and business needs
◦ Clean data
◦ A “single version of the truth”
But the problem space now contains:
◦ Unstructured/Semi-structured data
◦ Real time data
◦ Shorter time to access / self-service BI
◦ SO MUCH DATA (terabytes/hour to load)
◦ More systems to integrate (everything has an API)
4. What is best practice today?
A modern, best in class data warehouse:
◦ Is designed for scalability, ideally using cloud architecture
◦ Uses a bus-based, lambda architecture
◦ Has a federated data model for structured and unstructured data
◦ Leverages MPP databases
◦ Uses an agile data model like Data Vault
◦ Is built using code automation
◦ Processes data using ELT, not ETL
All the buzzwords! But what does it look like and why do these things help?
5. Architectural overview at de Bijenkorf
Tools
AWS
◦ S3
◦ Kinesis
◦ Elasticache
◦ Elastic Beanstalk
◦ EC2
◦ DynamoDB
Open Source
◦ Snowplow Event Tracker
◦ Rundeck Scheduler
◦ Jenkins Continuous Integration
◦ Pentaho PDI
Other
◦ HP Vertica
◦ Tableau
◦ Github
◦ RStudio Server
6. DWH internal architecture, Travelbird and Bijenkorf
• Traditional three tier DWH
• ODS generated automatically from
staging
• Allow regeneration of vault
without replaying logs
• Ops mart reflects data in original
source form
• Helps offload queries from
source systems
• Business marts materialized
exclusively from vault
7. Why use the cloud?
Cost Management
•Services billed by the hour, pay for what you use
•For small deployments (<50 machines), cloud hosting can be significantly cheaper
•Ex. a 3 node Vertica cluster in AWS with 25TB data: $2.2k/mo
Off the Shelf Services
•Minimize administration by using pre-built services like message buses (Kinesis), databases (RDS), Key/Value
stores (Elasticache), simplifying technology stack
•Increase speed of delivery of new functionality by eliminating most deployment tasks
•Full stack in a day? No problem!
Scalability
•Services can automatically be scaled up/down based on time, load, or other triggers
•Adding additional services can be done within minutes
•Services can scale (near) infinitely
8. Designed to solve both primary data needs:
◦ Damn close, right now
◦ Correct, tomorrow
Data is processed twice per stream
As implemented at BYK and TB:
◦ Real time flow from Kinesis to DWH
◦ Simultaneous process to S3
◦ Reprocessing as needed from S3 (batch)
Lambda architecture: Right Now and Right Later
9. Hadoop in the DWH
What is Hadoop?
◦ A distributed, fault tolerant file system
◦ A set of tools for file/data stream processing
Where does it fit into the DWH stack?
◦ Data Lake: Save all raw data for cheap; don’t force
schemas on unstructured data
◦ ETL: Distributed batch processing, aggregation, and
loading
Hadoop at Bijenkorf
◦ We had it but threw it out; the use cases didn’t fit
◦ Very little data is unstructured and the DWH supports JSON
◦ Data volumes are limited and growing slowly
◦ How did we solve the use cases?
◦ Data lake: S3 file storage + semi-structured data in Vertica
◦ Data processing: Stream processing (stable event volumes + clean
events)
Hadoop at Travelbird
◦ Dirty, fast growing event data, so…
◦ Hadoop in the typical role
◦ Raw data in AWS Elastic Map Reduce via S3
◦ Data cleaned and processed in Hadoop, then loaded into Redshift
10. • C-Stores persist each column independently and allow
column compression
• Queries retrieve data only from needed columns
Example: 7 billion rows, 25 columns, 10 bytes/column = 1,6 TB
table
Query: Select A, sum( D ) from table where C >= X;
Row Store: 1,6TB of data scanned
Column Store (50% compression): <100 GB data scanned
The Role of Column Store Databases in the DWH
187
230
600
600
0.63
2.1
23
62
Count Distinct
Count
Top 20, One Month
Top 20
Query Performance Results
(seconds)
C-store Postgres
Performance Comparison
Loads fast too! Facebook loads
35TB/hour into Vertica
11. But are there tradeoffs to a C-Store?
Weaknesses
◦ No PK/FK integrity enforced on write
◦ Slow on DELETE, UPDATE
◦ REALLY slow on single record INSERT and
SELECT
◦ Optimized for limited concurrency but big
queries; only a few users can use at a
time
Solutions
◦ Design to use calculated keys (ex. hashes)
◦ Build ETLs around COPY, TRUNCATE
◦ Individual transactions should use OLTP
or Key/Value systems
◦ Optimize data structures for common
queries and leverage big, slow disks to
create denormalized tables
12. Data Vault 1.618 at Bijenkorf
3rd Normal Form Data Vault
So many tables! WHY?!?!?!
What we gained
◦ Speed of integration
of new entities
◦ Fast primary keys
without lookups by
using hash keys
◦ Data matches
business processes,
not systems
◦ Easy parallelization
of table loading (24
concurrent tables?
OK!)
13. ELT, not ETL
Advantages of ELT
◦ Performance: Bijenkorf benchmark showed ELT was >50x faster than ETL
◦ Plus horizontal scalability is Web scale, big data, <insert buzzword here>
◦ Data Availability: You want an exact replica of your source data in the DWH anyways
◦ Simpler Architecture: Fewer systems, fewer interdependencies (decouple STG and DV), can build multiple
transformations from STG simultaneously
Myths of ELT
◦ Source and Target DB must match: Intelligently coded ELT jobs leverage platform agnostic code (or a library for each
source DB type) for loading to STG
◦ Bijenkorf runs MySQL and Oracle ELT into Vertica
◦ Travelbird runs MySQL and Postgres ELT into Redshift
◦ Limited tool availability: DV 2.0 lends itself to code generators / managers, which are best built internally anyways
◦ Talend is free (like speech and hugs) and offers ELT for many systems
◦ ELT takes longer to deploy: Because data is perfectly replicated from source, getting records in is faster;
transformations can be iterated quicker since they are independent of source->stg loading
14. Targeted benefits of DWH automation at Bijenkorf
Objective Achievements at Bijenkorf
Speed of development • Integration of new sources or data from existing sources takes 1-2 steps
• Adding a new vault dependency takes one step
Simplicity • Five jobs handle all ETL processes across DWH
Traceability • Every record/source file is traced in the database and every row automatically
identified by source file in ODS
Code simplification • Replaced most common key definitions with dynamic variable replacement
File management • Every source file automatically archived to Amazon S3 in appropriate locations
sorted by source, table, and date
• Entire source systems, periods, etc can be replayed in minutes
15. Data Vault loading automation at BYK
• New sources
automatically
added
• Last change
epoch based
on load
stamps,
advanced
each time all
dependencies
execute
successfully
All Staging
Tables
Checked for
Changes
• Dependencies
declared at
time of job
creation
• Load
prioritization
possible but
not utilized
List of
Dependent
Vault Loads
Identified
• Jobs
parallelized
across tables
but serialized
per job
• Dynamic job
queueing
ensures
appropriate
execution
order
Loads
Planned in
Hub, Link,
Sat Order
• Variables
automatically
identified and
replaced
• Each load
records
performance
statistics and
error
messages
Loads
Executed
o Loader is fully metadata driven with focus on horizontal scalability and management simplicity
o To support speed of development and performance, variable-driven SQL templates used throughout
16. Bringing it back: Best practice, in practice
Code
Automation
Cloud Based
Bus
Architecture
MPP
Data Vault
Unstructured
Data Stores
ELT controlled
by scheduler