SlideShare a Scribd company logo
1 of 28
How to Design a Modern Data
Warehouse in BigQuery
...or why I needed to forget everything I learned in data
modeling school
Dan Sullivan
March 11, 2021
Architecture
Ahead
Datastore Options
➤ Relational
➢ Highly structured and transactional
➢ Difficult to scale
➤ NoSQL
➢ Semi-structured, eventual consistency, scalable
➤ Analytical
➢ Structured, scalable, not transactional
Data Warehouse (early 2000s)
➤ Few servers
➤ Tightly coupled storage and
compute
➤ Scale vertically
➤ Built on same relational database
management systems used for
OLTP
BigQuery
➤ Serverless data warehouse
➤ Petabyte scale
➤ Uses SQL but is not a relational database
➤ Analytical database
➤ Other features
➢ BigQuery ML
➢ BigQuery BI Engine
➢ BigQuery GIS
So What’s Different
about BigQuery?
Source: https://cloud.google.com/blog/products/data-analytics/cloud-data-warehouse-bigquery-4-9s-sla
Dremel
➤ Multi-tenant cluster
➤ SQL queries to execution trees
➢ Leaves are called slots; read data and perform computation
➢ Inner nodes perform aggregation
➤ Dynamically allocate slots to queries
➤ Maintains fairness
➤ Single user cloud get 1,000s of slots
Source: https://cloud.google.com/blog/products/data-analytics/new-blog-series-bigquery-explained-overview
Colossus
➤ Distributed storage system
➤ Handles replication and recovery
➤ No need to managed storage
https://en.wikipedia.org/wiki/Google_File_System#/media/File:GoogleFileSystemGFS.svg
Jupiter & Borg
➤ Jupiter
➢ Google networking switch
➢ Petibit scale
➢ Storage to compute communication
➢ No need for rack awareness
➤ Borg
➢ Predecessor of Kubernetes
➢ Manages mixers and slots
https://medium.com/@jerub/the-production-environment-at-google-
8a1aaece3767
https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf
Capacitor
➤ Columnar storage format
➤ Supports semi-structured data
➢ Nested structures
➢ Repeated fields
➤ No need to read parent column to produce a
nested structure attribute value
➤ Compression
What Does this Mean
for Data Modeling?
If you remember anything
from this talk ...
➤ Design for scanning in parallel
➤ Partition to minimize amount of data scanned
➤ Cluster to further reduce the amount of data scanned
➤ Joins may require shuffling data across slots so ...
➤ Denormalize using nested and repeated fields
Partitioning
Partitioned Tables
➤ Table is divided into segments called partitions
➤ Improves query performance
➤ Lowers cost by reducing amount of data scanned
Partition by Ingestion Time
➤ Loads data into daily, date-based partitions
➤ Automatically creates new partitions
➤ Uses ingestion time to determine partition
➤ Create pseudo-column _PARTITIONTIME
➢ Date-based timestamp
➢ Used in queries to limit the number of partitions scanned
Date/Timestamp Partitioning
➤ Partition based on date or timestamp column
➤ Each partition holds one day of data
➤ No need for _PARTITIONTIME
➤ Special partitions
➢ _NULL_ when nulls in partition column
➢ _UNPARTITION_ when values in column outside allowed range
Integer Range Partition
➤ Partition column must be an integer type
➤ Partition column cannot be repeated
➤ Cannot use Legacy SQL to query partitioned tables
Sharding vs. Partitioning
➤ Sharding
➢ Use separate table for each day
➢ [TABLE_NAME_PREFIX]_YYMMDD
➢ Use UNION in queries to scan multiple tables
➤ Partitioning is preferred over sharding
➢ Less metadata to maintain
➢ Less permission checking overhead
➢ Better performance
Requiring Partition Filter
➤ Require_partitioning_filter parameter
➤ Specified at table level (formerly at partition level)
➤ Requires a WHERE clause with the partition column
Clustered Tables
Clustered Tables
➤ Data sorted based on values in one or more columns
➤ Can improve performance of aggregate queries
➤ Can reduce scanning when cluster columns used in WHERE clause
➤ Used with partitioned tables
Automatic Reclustering
➤ As new data is added to a table, data
may be stored out of order
➤ BigQuery automatically re-clusters in the
background
Nested and Repeated
Fields
Nested and Repeated Fields
Nested and Repeated Fields
One more time … if you remember
anything from this talk ...
➤ Design for scanning in parallel
➤ Partition to minimize amount of data scanned
➤ Cluster to further reduce the amount of data scanned
➤ Joins may require shuffling data across slots so ...
➤ Denormalize using nested and repeated fields to avoid needing joins

More Related Content

What's hot

Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentialsqureshihamid
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
An overview of BigQuery
An overview of BigQuery An overview of BigQuery
An overview of BigQuery GirdhareeSaran
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Best Practices for Becoming an Exceptional Postgres DBA
Best Practices for Becoming an Exceptional Postgres DBA Best Practices for Becoming an Exceptional Postgres DBA
Best Practices for Becoming an Exceptional Postgres DBA EDB
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxCalvinSim10
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceHarald Erb
 
Redshift performance tuning
Redshift performance tuningRedshift performance tuning
Redshift performance tuningCarlos del Cacho
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)Aurimas Mikalauskas
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 

What's hot (20)

Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentials
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
An overview of BigQuery
An overview of BigQuery An overview of BigQuery
An overview of BigQuery
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
 
Best Practices for Becoming an Exceptional Postgres DBA
Best Practices for Becoming an Exceptional Postgres DBA Best Practices for Becoming an Exceptional Postgres DBA
Best Practices for Becoming an Exceptional Postgres DBA
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data Science
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Redshift performance tuning
Redshift performance tuningRedshift performance tuning
Redshift performance tuning
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 

Similar to How to Design a Modern Data Warehouse in BigQuery Using Partitioning, Clustering, and Nested/Repeated Fields

My Database Skills Killed the Server
My Database Skills Killed the ServerMy Database Skills Killed the Server
My Database Skills Killed the ServerColdFusionConference
 
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customersLunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customersDaniel Zivkovic
 
NDC Sydney - Analyzing StackExchange with Azure Data Lake
NDC Sydney - Analyzing StackExchange with Azure Data LakeNDC Sydney - Analyzing StackExchange with Azure Data Lake
NDC Sydney - Analyzing StackExchange with Azure Data LakeTom Kerkhove
 
ORACLE 12C-New-Features
ORACLE 12C-New-FeaturesORACLE 12C-New-Features
ORACLE 12C-New-FeaturesNavneet Upneja
 
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data LakeNDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data LakeTom Kerkhove
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
 
My SQL Skills Killed the Server
My SQL Skills Killed the ServerMy SQL Skills Killed the Server
My SQL Skills Killed the ServerdevObjective
 
Best storage engine for MySQL
Best storage engine for MySQLBest storage engine for MySQL
Best storage engine for MySQLtomflemingh2
 
Data Warehouse Best Practices
Data Warehouse Best PracticesData Warehouse Best Practices
Data Warehouse Best PracticesEduardo Castro
 
Google - Bigtable
Google - BigtableGoogle - Bigtable
Google - Bigtable영원 서
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
 
Kill mysql-performance
Kill mysql-performanceKill mysql-performance
Kill mysql-performancekriptonium
 
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingAmir Reza Hashemi
 
The Peoper Care and Feeding of a MySQL Server for Busy Linux Admin
The Peoper Care and Feeding of a MySQL Server for Busy Linux AdminThe Peoper Care and Feeding of a MySQL Server for Busy Linux Admin
The Peoper Care and Feeding of a MySQL Server for Busy Linux AdminDave Stokes
 
Tech-Spark: Scaling Databases
Tech-Spark: Scaling DatabasesTech-Spark: Scaling Databases
Tech-Spark: Scaling DatabasesRalph Attard
 
Data Enginering from Google Data Warehouse
Data Enginering from Google Data WarehouseData Enginering from Google Data Warehouse
Data Enginering from Google Data Warehousearungansi
 
Where do I put this data? #lessql
Where do I put this data? #lessqlWhere do I put this data? #lessql
Where do I put this data? #lessqlEzra Zygmuntowicz
 

Similar to How to Design a Modern Data Warehouse in BigQuery Using Partitioning, Clustering, and Nested/Repeated Fields (20)

My Database Skills Killed the Server
My Database Skills Killed the ServerMy Database Skills Killed the Server
My Database Skills Killed the Server
 
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customersLunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
 
NDC Sydney - Analyzing StackExchange with Azure Data Lake
NDC Sydney - Analyzing StackExchange with Azure Data LakeNDC Sydney - Analyzing StackExchange with Azure Data Lake
NDC Sydney - Analyzing StackExchange with Azure Data Lake
 
ORACLE 12C-New-Features
ORACLE 12C-New-FeaturesORACLE 12C-New-Features
ORACLE 12C-New-Features
 
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data LakeNDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
Mysql For Developers
Mysql For DevelopersMysql For Developers
Mysql For Developers
 
My SQL Skills Killed the Server
My SQL Skills Killed the ServerMy SQL Skills Killed the Server
My SQL Skills Killed the Server
 
Sql killedserver
Sql killedserverSql killedserver
Sql killedserver
 
Best storage engine for MySQL
Best storage engine for MySQLBest storage engine for MySQL
Best storage engine for MySQL
 
Data Warehouse Best Practices
Data Warehouse Best PracticesData Warehouse Best Practices
Data Warehouse Best Practices
 
Amazon Redshift Masterclass
Amazon Redshift MasterclassAmazon Redshift Masterclass
Amazon Redshift Masterclass
 
Google - Bigtable
Google - BigtableGoogle - Bigtable
Google - Bigtable
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Kill mysql-performance
Kill mysql-performanceKill mysql-performance
Kill mysql-performance
 
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / Sharding
 
The Peoper Care and Feeding of a MySQL Server for Busy Linux Admin
The Peoper Care and Feeding of a MySQL Server for Busy Linux AdminThe Peoper Care and Feeding of a MySQL Server for Busy Linux Admin
The Peoper Care and Feeding of a MySQL Server for Busy Linux Admin
 
Tech-Spark: Scaling Databases
Tech-Spark: Scaling DatabasesTech-Spark: Scaling Databases
Tech-Spark: Scaling Databases
 
Data Enginering from Google Data Warehouse
Data Enginering from Google Data WarehouseData Enginering from Google Data Warehouse
Data Enginering from Google Data Warehouse
 
Where do I put this data? #lessql
Where do I put this data? #lessqlWhere do I put this data? #lessql
Where do I put this data? #lessql
 

More from Dan Sullivan, Ph.D.

With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?Dan Sullivan, Ph.D.
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery MLDan Sullivan, Ph.D.
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningDan Sullivan, Ph.D.
 
Unstructured text to structured data
Unstructured text to structured dataUnstructured text to structured data
Unstructured text to structured dataDan Sullivan, Ph.D.
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupDan Sullivan, Ph.D.
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyDan Sullivan, Ph.D.
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivanDan Sullivan, Ph.D.
 
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyDan Sullivan, Ph.D.
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsDan Sullivan, Ph.D.
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Dan Sullivan, Ph.D.
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsDan Sullivan, Ph.D.
 

More from Dan Sullivan, Ph.D. (13)

With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery ML
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
 
Unstructured text to structured data
Unstructured text to structured dataUnstructured text to structured data
Unstructured text to structured data
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
Text mining meets neural nets
Text mining meets neural netsText mining meets neural nets
Text mining meets neural nets
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False Dichotomy
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
 
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key Patterns
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in Bioinformatics
 

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

How to Design a Modern Data Warehouse in BigQuery Using Partitioning, Clustering, and Nested/Repeated Fields

  • 1. How to Design a Modern Data Warehouse in BigQuery ...or why I needed to forget everything I learned in data modeling school Dan Sullivan March 11, 2021
  • 3. Datastore Options ➤ Relational ➢ Highly structured and transactional ➢ Difficult to scale ➤ NoSQL ➢ Semi-structured, eventual consistency, scalable ➤ Analytical ➢ Structured, scalable, not transactional
  • 4. Data Warehouse (early 2000s) ➤ Few servers ➤ Tightly coupled storage and compute ➤ Scale vertically ➤ Built on same relational database management systems used for OLTP
  • 5. BigQuery ➤ Serverless data warehouse ➤ Petabyte scale ➤ Uses SQL but is not a relational database ➤ Analytical database ➤ Other features ➢ BigQuery ML ➢ BigQuery BI Engine ➢ BigQuery GIS
  • 8. Dremel ➤ Multi-tenant cluster ➤ SQL queries to execution trees ➢ Leaves are called slots; read data and perform computation ➢ Inner nodes perform aggregation ➤ Dynamically allocate slots to queries ➤ Maintains fairness ➤ Single user cloud get 1,000s of slots
  • 10. Colossus ➤ Distributed storage system ➤ Handles replication and recovery ➤ No need to managed storage https://en.wikipedia.org/wiki/Google_File_System#/media/File:GoogleFileSystemGFS.svg
  • 11. Jupiter & Borg ➤ Jupiter ➢ Google networking switch ➢ Petibit scale ➢ Storage to compute communication ➢ No need for rack awareness ➤ Borg ➢ Predecessor of Kubernetes ➢ Manages mixers and slots https://medium.com/@jerub/the-production-environment-at-google- 8a1aaece3767 https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf
  • 12. Capacitor ➤ Columnar storage format ➤ Supports semi-structured data ➢ Nested structures ➢ Repeated fields ➤ No need to read parent column to produce a nested structure attribute value ➤ Compression
  • 13. What Does this Mean for Data Modeling?
  • 14. If you remember anything from this talk ... ➤ Design for scanning in parallel ➤ Partition to minimize amount of data scanned ➤ Cluster to further reduce the amount of data scanned ➤ Joins may require shuffling data across slots so ... ➤ Denormalize using nested and repeated fields
  • 16. Partitioned Tables ➤ Table is divided into segments called partitions ➤ Improves query performance ➤ Lowers cost by reducing amount of data scanned
  • 17. Partition by Ingestion Time ➤ Loads data into daily, date-based partitions ➤ Automatically creates new partitions ➤ Uses ingestion time to determine partition ➤ Create pseudo-column _PARTITIONTIME ➢ Date-based timestamp ➢ Used in queries to limit the number of partitions scanned
  • 18. Date/Timestamp Partitioning ➤ Partition based on date or timestamp column ➤ Each partition holds one day of data ➤ No need for _PARTITIONTIME ➤ Special partitions ➢ _NULL_ when nulls in partition column ➢ _UNPARTITION_ when values in column outside allowed range
  • 19. Integer Range Partition ➤ Partition column must be an integer type ➤ Partition column cannot be repeated ➤ Cannot use Legacy SQL to query partitioned tables
  • 20. Sharding vs. Partitioning ➤ Sharding ➢ Use separate table for each day ➢ [TABLE_NAME_PREFIX]_YYMMDD ➢ Use UNION in queries to scan multiple tables ➤ Partitioning is preferred over sharding ➢ Less metadata to maintain ➢ Less permission checking overhead ➢ Better performance
  • 21. Requiring Partition Filter ➤ Require_partitioning_filter parameter ➤ Specified at table level (formerly at partition level) ➤ Requires a WHERE clause with the partition column
  • 23. Clustered Tables ➤ Data sorted based on values in one or more columns ➤ Can improve performance of aggregate queries ➤ Can reduce scanning when cluster columns used in WHERE clause ➤ Used with partitioned tables
  • 24. Automatic Reclustering ➤ As new data is added to a table, data may be stored out of order ➤ BigQuery automatically re-clusters in the background
  • 28. One more time … if you remember anything from this talk ... ➤ Design for scanning in parallel ➤ Partition to minimize amount of data scanned ➤ Cluster to further reduce the amount of data scanned ➤ Joins may require shuffling data across slots so ... ➤ Denormalize using nested and repeated fields to avoid needing joins