How to Design a Modern Data Warehouse in BigQuery Using Partitioning, Clustering, and Nested/Repeated Fields

How to Design a Modern Data
Warehouse in BigQuery
...or why I needed to forget everything I learned in data
modeling school
Dan Sullivan
March 11, 2021

Datastore Options
➤ Relational
➢ Highly structured and transactional
➢ Difficult to scale
➤ NoSQL
➢ Semi-structured, eventual consistency, scalable
➤ Analytical
➢ Structured, scalable, not transactional

Data Warehouse (early 2000s)
➤ Few servers
➤ Tightly coupled storage and
compute
➤ Scale vertically
➤ Built on same relational database
management systems used for
OLTP

BigQuery
➤ Serverless data warehouse
➤ Petabyte scale
➤ Uses SQL but is not a relational database
➤ Analytical database
➤ Other features
➢ BigQuery ML
➢ BigQuery BI Engine
➢ BigQuery GIS

So What’s Different
about BigQuery?

Source: https://cloud.google.com/blog/products/data-analytics/cloud-data-warehouse-bigquery-4-9s-sla

Dremel
➤ Multi-tenant cluster
➤ SQL queries to execution trees
➢ Leaves are called slots; read data and perform computation
➢ Inner nodes perform aggregation
➤ Dynamically allocate slots to queries
➤ Maintains fairness
➤ Single user cloud get 1,000s of slots

Source: https://cloud.google.com/blog/products/data-analytics/new-blog-series-bigquery-explained-overview

Colossus
➤ Distributed storage system
➤ Handles replication and recovery
➤ No need to managed storage
https://en.wikipedia.org/wiki/Google_File_System#/media/File:GoogleFileSystemGFS.svg

Jupiter & Borg
➤ Jupiter
➢ Google networking switch
➢ Petibit scale
➢ Storage to compute communication
➢ No need for rack awareness
➤ Borg
➢ Predecessor of Kubernetes
➢ Manages mixers and slots
https://medium.com/@jerub/the-production-environment-at-google-
8a1aaece3767
https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf

Capacitor
➤ Columnar storage format
➤ Supports semi-structured data
➢ Nested structures
➢ Repeated fields
➤ No need to read parent column to produce a
nested structure attribute value
➤ Compression

What Does this Mean
for Data Modeling?

If you remember anything
from this talk ...
➤ Design for scanning in parallel
➤ Partition to minimize amount of data scanned
➤ Cluster to further reduce the amount of data scanned
➤ Joins may require shuffling data across slots so ...
➤ Denormalize using nested and repeated fields

Partitioned Tables
➤ Table is divided into segments called partitions
➤ Improves query performance
➤ Lowers cost by reducing amount of data scanned

Partition by Ingestion Time
➤ Loads data into daily, date-based partitions
➤ Automatically creates new partitions
➤ Uses ingestion time to determine partition
➤ Create pseudo-column _PARTITIONTIME
➢ Date-based timestamp
➢ Used in queries to limit the number of partitions scanned

Date/Timestamp Partitioning
➤ Partition based on date or timestamp column
➤ Each partition holds one day of data
➤ No need for _PARTITIONTIME
➤ Special partitions
➢ _NULL_ when nulls in partition column
➢ _UNPARTITION_ when values in column outside allowed range

Integer Range Partition
➤ Partition column must be an integer type
➤ Partition column cannot be repeated
➤ Cannot use Legacy SQL to query partitioned tables

Sharding vs. Partitioning
➤ Sharding
➢ Use separate table for each day
➢ [TABLE_NAME_PREFIX]_YYMMDD
➢ Use UNION in queries to scan multiple tables
➤ Partitioning is preferred over sharding
➢ Less metadata to maintain
➢ Less permission checking overhead
➢ Better performance

Requiring Partition Filter
➤ Require_partitioning_filter parameter
➤ Specified at table level (formerly at partition level)
➤ Requires a WHERE clause with the partition column

Clustered Tables
➤ Data sorted based on values in one or more columns
➤ Can improve performance of aggregate queries
➤ Can reduce scanning when cluster columns used in WHERE clause
➤ Used with partitioned tables

Automatic Reclustering
➤ As new data is added to a table, data
may be stored out of order
➤ BigQuery automatically re-clusters in the
background

One more time … if you remember
anything from this talk ...
➤ Design for scanning in parallel
➤ Partition to minimize amount of data scanned
➤ Cluster to further reduce the amount of data scanned
➤ Joins may require shuffling data across slots so ...
➤ Denormalize using nested and repeated fields to avoid needing joins

How to Design a Modern Data Warehouse in BigQuery Using Partitioning, Clustering, and Nested/Repeated Fields

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How to Design a Modern Data Warehouse in BigQuery Using Partitioning, Clustering, and Nested/Repeated Fields

Similar to How to Design a Modern Data Warehouse in BigQuery Using Partitioning, Clustering, and Nested/Repeated Fields (20)

More from Dan Sullivan, Ph.D.

More from Dan Sullivan, Ph.D. (13)

Recently uploaded

Recently uploaded (20)

How to Design a Modern Data Warehouse in BigQuery Using Partitioning, Clustering, and Nested/Repeated Fields