This document discusses best practices for data modeling in BigQuery, Google's serverless data warehouse. It recommends designing tables to enable parallel scanning by partitioning and clustering data. Since joins may require shuffling data across compute slots, the document suggests denormalizing data using nested and repeated fields to avoid joins. BigQuery uses a multi-tenant query execution engine called Dremel that dynamically allocates slots, and a distributed storage system called Colossus that handles replication and recovery without needing to manage storage. Data modeling approaches for BigQuery are different than traditional relational databases due to its petabyte scale, serverless architecture, and use of nested data structures.
Invezz.com - Grow your wealth with trading signals
How to Design a Modern Data Warehouse in BigQuery Using Partitioning, Clustering, and Nested/Repeated Fields
1. How to Design a Modern Data
Warehouse in BigQuery
...or why I needed to forget everything I learned in data
modeling school
Dan Sullivan
March 11, 2021
3. Datastore Options
➤ Relational
➢ Highly structured and transactional
➢ Difficult to scale
➤ NoSQL
➢ Semi-structured, eventual consistency, scalable
➤ Analytical
➢ Structured, scalable, not transactional
4. Data Warehouse (early 2000s)
➤ Few servers
➤ Tightly coupled storage and
compute
➤ Scale vertically
➤ Built on same relational database
management systems used for
OLTP
5. BigQuery
➤ Serverless data warehouse
➤ Petabyte scale
➤ Uses SQL but is not a relational database
➤ Analytical database
➤ Other features
➢ BigQuery ML
➢ BigQuery BI Engine
➢ BigQuery GIS
8. Dremel
➤ Multi-tenant cluster
➤ SQL queries to execution trees
➢ Leaves are called slots; read data and perform computation
➢ Inner nodes perform aggregation
➤ Dynamically allocate slots to queries
➤ Maintains fairness
➤ Single user cloud get 1,000s of slots
10. Colossus
➤ Distributed storage system
➤ Handles replication and recovery
➤ No need to managed storage
https://en.wikipedia.org/wiki/Google_File_System#/media/File:GoogleFileSystemGFS.svg
11. Jupiter & Borg
➤ Jupiter
➢ Google networking switch
➢ Petibit scale
➢ Storage to compute communication
➢ No need for rack awareness
➤ Borg
➢ Predecessor of Kubernetes
➢ Manages mixers and slots
https://medium.com/@jerub/the-production-environment-at-google-
8a1aaece3767
https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf
12. Capacitor
➤ Columnar storage format
➤ Supports semi-structured data
➢ Nested structures
➢ Repeated fields
➤ No need to read parent column to produce a
nested structure attribute value
➤ Compression
14. If you remember anything
from this talk ...
➤ Design for scanning in parallel
➤ Partition to minimize amount of data scanned
➤ Cluster to further reduce the amount of data scanned
➤ Joins may require shuffling data across slots so ...
➤ Denormalize using nested and repeated fields
16. Partitioned Tables
➤ Table is divided into segments called partitions
➤ Improves query performance
➤ Lowers cost by reducing amount of data scanned
17. Partition by Ingestion Time
➤ Loads data into daily, date-based partitions
➤ Automatically creates new partitions
➤ Uses ingestion time to determine partition
➤ Create pseudo-column _PARTITIONTIME
➢ Date-based timestamp
➢ Used in queries to limit the number of partitions scanned
18. Date/Timestamp Partitioning
➤ Partition based on date or timestamp column
➤ Each partition holds one day of data
➤ No need for _PARTITIONTIME
➤ Special partitions
➢ _NULL_ when nulls in partition column
➢ _UNPARTITION_ when values in column outside allowed range
19. Integer Range Partition
➤ Partition column must be an integer type
➤ Partition column cannot be repeated
➤ Cannot use Legacy SQL to query partitioned tables
20. Sharding vs. Partitioning
➤ Sharding
➢ Use separate table for each day
➢ [TABLE_NAME_PREFIX]_YYMMDD
➢ Use UNION in queries to scan multiple tables
➤ Partitioning is preferred over sharding
➢ Less metadata to maintain
➢ Less permission checking overhead
➢ Better performance
21. Requiring Partition Filter
➤ Require_partitioning_filter parameter
➤ Specified at table level (formerly at partition level)
➤ Requires a WHERE clause with the partition column
23. Clustered Tables
➤ Data sorted based on values in one or more columns
➤ Can improve performance of aggregate queries
➤ Can reduce scanning when cluster columns used in WHERE clause
➤ Used with partitioned tables
24. Automatic Reclustering
➤ As new data is added to a table, data
may be stored out of order
➤ BigQuery automatically re-clusters in the
background
28. One more time … if you remember
anything from this talk ...
➤ Design for scanning in parallel
➤ Partition to minimize amount of data scanned
➤ Cluster to further reduce the amount of data scanned
➤ Joins may require shuffling data across slots so ...
➤ Denormalize using nested and repeated fields to avoid needing joins