SlideShare a Scribd company logo
1 of 64
Download to read offline
Database Design
most common pitfalls
€ whoami
● Federico Razzoli
● Freelance consultant
● Working with databases since 2000
hello@federico-razzoli.com
federico-razzoli.com
● I worked as a consultant for Percona and Ibuildings
(mainly MySQL and MariaDB)
● I worked as a DBA for fast-growing companies like
Catawiki, HumanState, TransferWise
Agenda
We will talk about…
● The most common design bad practices
● Information that is not easy to represent
● Relational model: why?
● Keys and indexes
● Data types
● Abusing NULL
● Hierarchies (trees)
● Lists
● Inheritance & polymorphism
● Heterogeneous rows
● Misc
Criteria
Criteria
● Queries should be fast
● Data structures should be reasonably simple
● Design must be reasonably extendable
Why Relational?
Specific Use Cases
● Some databases are designed for specific use cases
● In those cases, they may work much better than generic technologies
● Using them when not necessary may lead to use many technologies
● A technology should only be introduced if our company has:
○ Skills
○ Knowledge necessary for troubleshooting
○ Backups
○ High Availability
○ ...
Relational is flexible
With the relational model we:
● Are sure that data is written correctly (transactions)
● Can make sure that data is valid (schema, integrity constraints)
● Design tables with access patterns in mind
● To run a query we initially didn’t consider, most of the times we can just add
an index
Flexibility example
CREATE TABLE user (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
surname VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL UNIQUE
);
SELECT * FROM user WHERE id = 24;
SELECT name, surname FROM user
WHERE email = 'picard@starfleet.earth';
CREATE INDEX idx_surname_name ON user (surname, name);
SELECT name, surname FROM user
WHERE surname LIKE 'B%'
ORDER BY surname, name;
When Relational is not a good fit
● Heterogeneous data (product catalogue)
● Searchable text
● Graphs
● …
However, for simple use cases relational databases include non-relational
features, like:
● JSON type and functions
● Arrays (PostgreSQL)
● Fulltext indexes
● ...
Keys and Indexes
Primary Key
● Column or set of columns that identifies each row (unique, not null)
● Usually you want to create an artificial column for this:
○ id
○ or uuid
Poor Primary Keys
● No primary key!
○ In MySQL this causes many performance problems
○ CDC applications need a way to identify each row
● Wrong columns
○ email
■ An email can change over time
■ An email address can be assigned to another person
■ The primary key is a PII!
○ name (eg: city name, product name…)
■ Quite long, especially if it must be UTF-8
■ Certain names can change over time
○ timestamp
■ Multiple rows could be created at the same timestamp!
■ Long
○ ...
UNIQUE
● An index whose values are distinct, or NULL
● Could theoretically be a primary key, but it’s not
Poor UNIQUE keys
● Columns whose values will always be distinct, no matter if there is an index or
not
○ Enforcing unicity implies extra reads, possibly on disk
● Columns that could have duplicates, but they’re unlikely
○ timestamp
○ (last_name, first_name)
Foreign Keys
● References to another table (user.city_id -> city.id)
● In most cases they are bad for performance
● They create problems for operations (ALTER TABLE)
● In MySQL they are not compatible with some other features
○ They don’t activate triggers
○ Table partitioning
○ Tables not using InnoDB
○ Many bugs
Indexing Bad Practices
● Indexing all columns: it won’t work
● Multi-columns indexes in random order
● Indexing columns with few distinct values (eg, boolean)
○ Unless you know what you’re doing
● Indexes contained in other indexes:
idx1 (email), idx2 (email, last_name)
idx (email, id)
UNIQUE unq1 (email), INDEX idx1 (email)
● Non-descriptive index names (like the ones above)
Looking at an index name (EXPLAIN),
I should know which columns it contains
Quick hints
● Learn how indexes work
○ Google: Federico Razzoli indexes bad practices
● Use pt-duplicate-key-checker, from Percona Toolkit
Data Types
Integer Types
● Don’t use bigger types than necessary
● ...but don’t overoptimise when you are not 100% sure. You’ll hardly see a
benefit using TINYINT instead of SMALLINT
● MySQL UNSIGNED is good, column’s max is double
● I discourage the use of exotic MySQL syntax like:
○ MEDIUMINT: non-standard, and 3-bytes variables don’t exist in nature
○ INT(length)
○ ZEROFILL
Real Numbers
● FLOAT and DOUBLE are fast when aggregating many values
● But they are subject to approximation. Don’t use them for prices, etc
● Instead you can use:
○ DECIMAL
○ INT - Multiply a number by 100, for example
○ DECIMAL is slower if heavy arithmetics is performed on many values
○ But storing a transformed value (price*100) can lead to
misunderstandings and bugs
Text Values
● Be sure that VARCHAR columns have adequate size for your data
● In PostgreSQL there is no difference between VARCHAR and TEXT, except
that for VARCHAR you specify a max size
● In MySQL TEXT and BLOB columns are stored separately
○ Less data read if you often don’t read those columns
○ More read operations if you always use SELECT *
● CHAR is only good for small fixed-size data. The space saving is tiny.
Temporal Types
● TIMESTAMP and DATETIME are mostly interchangeable
● MySQL YEAR is weird. 2-digit values meaning changes over time. Use
SMALLINT inxtead.
● MySQL TIME is apparently weird and useless. But not if you consider it as an
interval. (range: -838:59:59 .. 838:59:59)
● PostgreSQL has a proper INTERVAL type, which is surely better
● PostgreSQL allows to specify a timezone for each value (TIMESTAMP WITH
TIMEZONE)
○ Timezones depend on policy, economy and religion. They may vary by 15
mins. Timezones are created, dismissed, and changed. In one case a
timezone was changed by skipping a whole calendar day.
○ Never deal with timezones yourself, no one ever succeeded in history.
Store all dates as UTC, use an external library for conversion.
ENUM, SET
● MySQL weird types that include a list of allowed string values
● With ENUM, any number of values from the list are allowed
● With SET, exactly one value from the list is allowed
● '' is always allowed, because.
● Specifying the value by index is allowed, so 0 could match '1'
● Adding, dropping and changing values requires an ALTER TABLE
○ And possibly a locking table rebuild
Instead of ENUM
CREATE TABLE account (
state ENUM('active', 'suspended') NOT NULL,
...
)
Instead of ENUM
CREATE TABLE account (
state_id INT UNSIGNED NOT NULL,
...
)
CREATE TABLE state (
id INT UNSIGNED PRIMARY KEY,
state VARCHAR(100) NOT NULL UNIQUE
)
INSERT INTO state (state) VALUES ('active'), ('suspended');
Abusing NULL
NULL anomalies
mysql> SELECT
NULL = 1 AS a,
NULL <> 1 AS b,
NULL IS NULL AS c,
1 IS NOT NULL AS d;
+------+------+---+---+
| a | b | c | d |
+------+------+---+---+
| NULL | NULL | 1 | 1 |
+------+------+---+---+
-- This returns TRUE in MySQL:
NULL <=> NULL AND 1 <=> 1
Problematic queries
These queries will not return rows with age = NULL or approved = NULL
● WHERE year != 1994
● WHERE NOT (year = 1994)
● WHERE year > 2000
● WHERE NOT (year > 2000)
● WHERE approved != TRUE
● WHERE NOT approved
And:
SELECT CONCAT(year, ' years old') FROM user ...
Bad Reasons for NULL
● Because columns are NULLable by default
● To indicate that a value doesn’t exist
○ Use a special value instead: '' or -1 or 0 or …
○ But this is not always a bad reason: UNIQUE allows multiple NULLs
● Using your tables as spreadsheets
Spreadsheet Example
CREATE TABLE user (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL,
-- if a user may have multiple URL’s, let’s move them
-- to a separate table:
-- url { id, user_id, url }
url_1 VARCHAR(100),
url_2 VARCHAR(100),
url_3 VARCHAR(100),
url_4 VARCHAR(100),
url_5 VARCHAR(100)
);
Spreadsheet Example
CREATE TABLE user (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL,
-- if we may have users bank data or not,
-- let’s move them to another table:
-- bank { user_id, account_no, account_holder, ... }
bank_account_no VARCHAR(50),
bank_account_holder VARCHAR(100),
bank_iban VARCHAR(100),
bank_swift_code VARCHAR(5)
);
Hierarchies
Category Hierarchies
Antipattern: column-per-level
TABLE product (id, category_name, subcategory_name, name, price, ..)
-----
TABLE category (id, name)
TABLE product (id, category_id, subcategory_id, name, price, ...)
Possible problems:
● To add or delete a level, we need to add or drop a column
● A subcategory can be erroneously linked to multiple categories
● A category can be erroneously used as subcategory, and vice versa
Category Hierarchies
A better way:
TABLE category (id, parent_id, name)
TABLE product (id, category_id, name, price, ...)
Possible problems:
● Circular dependencies (must be prevented at application level)
Category Networks
What if every category can have multiple parents?
Antipattern:
TABLE category (id, parent_id1, parent_id2, name)
Category Graphs
If every category can have multiple parents, correct pattern:
TABLE category (id, name)
TABLE category_relationship (parent_id, child_id)
Antipattern: Parent List
If every category can have multiple parents, correct pattern:
TABLE category (id, name, parent_list)
INSERT INTO category (parent_list, name) VALUES
('sports/football/wear', 'football shoes');
● This antipattern is sometimes used because it simplifies certain aspects
● But it overcomplicates other aspects
● Also, up to recently MySQL and MariaDB did not support recursive queries,
but now they do
Storing Lists
Tags Column
● Suppose you want to store user-typed tags for posts
● You may be tempted to:
CREATE TABLE post (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
tags VARCHAR(200)
);
INSERT INTO post (tags, ... ) VALUES (sunday,venus, ... );
Tags Column
● But what about this query?
SELECT id FROM post WHERE tags LIKE '%sun%';
● Mmm, maybe this is better:
INSERT INTO post (tags, ... ) VALUES (',events,diversity,', ... );
SELECT id FROM post WHERE tags LIKE '%,sun,%';
However, this query cannot take advantage of indexes
Tag Table
CREATE TABLE post (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
...
);
CREATE TABLE tag (
post_id INT UNSIGNED,
tag VARCHAR(50),
PRIMARY KEY (post_id, tag),
INDEX (tag)
);
It works.
Queries will be able to use indexes.
Tag Array
-- PostgreSQL
CREATE TABLE post (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
tags TEXT[]
);
CREATE INDEX idx_tags on post USING GIN (tags);
-- MySQL
CREATE TABLE post (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
tags JSON DEFAULT JSON_ARRAY(),
INDEX idx_tags (tags)
);
-- MariaDB can store JSON arrays,
-- but since it cannot index them this solution is not viable
Inheritance
And
Polymorphism
Not So Different Entities
● Your DB has users, landlords and tenants
● Separate entities with different info
● But sometimes you treat them as one thing
● What to do?
Inheritance
● In the simplest case, they are just subclasses
● For example, landlords and tenants could be types of users
● Common properties are in the parent class
-- relational way to represent it:
TABLE user (id, first_name, last_name, email)
TABLE landlord (id, user_id, vat_number)
TABLE tenant (id, user_id, landlord_id)
PostgreSQL allows to do this in a more object oriented way, with Table Inheritance
Different Entities
● But sometimes it’s better to consider them different entities
● Antipattern: Union View
CREATE VIEW everyone AS
(SELECT id, first_name, last_name FROM landlord)
UNION
(SELECT id, first_name, last_name FROM tenant)
;
This makes some queries less verbose, at the cost of making them
potentially very slow
Unicity Across Tables /1
● But maybe both landlords and tenants have emails,
and we want to make sure they are UNIQUE
● Question: is there a practical reason?
Unicity Across Tables /2
● If it is necessary, you’re thinking about the problem in a wrong way
● If emails need be unique, they are a whole entity, so you’ll guarantee unicity
on a single table
TABLE landlord (id, first_name, last_name, vat_number)
TABLE tenant (id, first_name, last_name, landlord_id)
TABLE email (id, email UNIQUE, landlord_id, tenant_id)
Bloody hell! The solution initially looks great, but linking emails to landlords or
tenants in that way is horrific!
Unicity Across Tables /2bis
Why?
● Cannot build foreign keys (I don’t recommend it, but…)
● If in the future we want to link emails to suppliers, employees, etc, we’ll need
to add columns to the table
Unicity Across Tables /3
Even if we keep the landlord and tenant tables separated,
we can create a superset called person.
We decided it’s not a parent class, so it can just have an id column.
Every landlord, tenant and email is linked to a person.
TABLE landlord (id, person_id, first_name, last_name, vat_number)
TABLE tenant (id, person_id, first_name, last_name, landlord_id)
TABLE person (id)
TABLE email (id, person_id, email UNIQUE)
Heterogeneous Rows
Catalog of Products
Imagine we have a catalogue of products where:
● Every product has certain common characteristics
● It’s important to be able to run queries on all products
○ SELECT id FROM p WHERE qty = 0;
○ SELECT MAX(price) FROM p GROUP BY vendor;
● Each product type also has a unique set of characteristics
Antipattern: Stylesheet Table
● Keep all products in the same table
● Add a column for every characteristic that applies to at least one product
● Where a column doesn’t make sense, set to NULL
Problems:
● Too many columns and indexes
○ Generally bad for query performance, especially INSERTs
○ Generally bad for operations (repair, backup, restore, ALTER TABLE…)
● Adding/removing a product type means to add/remove a set of columns
○ But in practice columns will hardly be removed and will remain unused
● NULL means both “no value for this product” and “doesn’t apply to this type of
products”, leading to endless confusion
Antipattern: Table per Type
● Store products of different types in different tables
Problems:
● Metadata become data
○ How to get the list of product types?
● Some queries become overcomplicated
○ Get the id’s of out of stock products
○ Most expensive product for each vendor
Hybrid
● A single table for characteristics common to all product types
● A separate table per product type, for non-common characteristics
Problems:
● Many JOINs
● Adding/removing product types means to add/remove tables
Semi-Structured Data
● A single table for all products
● A regular column for each column common to all product types
● A semi-structured column for all type-specific characteristics
○ JSON, HStore…
○ Not arrays
○ Not CSV
● Proper indexes on unstructured data (depending on your technology)
Problems:
● Still a big table
● Queries on semi-structured data may be complicated and not supported by
ORMs
Antipattern: Entity,Attribute,Value
TABLE entity (id, name)
TABLE attribute (id, entity_id, name)
TABLE value (id, attribute_id, value)
● Each product type is an entity
● Each type characteristics are stored in attribute
● Each product is a set of values
Example:
Entity { id: 24, name: "Bed" }
Attribute [ { id: 123, entity_id: 24, name: "material" }, ... ]
Value [ { id: 999, attribute_id: 123, value: "wood" } ]
Antipattern: Entity,Attribute,Value
Problems:
● We JOIN 3 tables every time we want to get a single value!
● All values must be treated as texts
○ Unless we create multiple value tables: int_value, text_value...
○ Which means, even more JOINs
Misc Antipatterns
Names Beyond Comprehension
● I saw the following table names in production:
○ marco2015 # Marco was the table’s creation
○ jan2015 # jan was the month
○ tmp_tmp_tmp_fix
○ tmp_fix_fix_fix # Because symmetry is cool
I forgot many other examples because...
“Ultimate horror often paralyses memory in a merciful way.”
― H.P. Lovecraft
Data in Metadata
● Include data in table names
○ invoice_2020, invoice_2019, invoice_2018…
● User a year column instead
● If the table is too big, there are other ways to contain the problem
(partitioning)
Bad Names in General
● A names should tell everyone what a table or column is
○ Even to new hires!
○ Even to you… in 5 years from now!
● Otherwise people have to look at other documentation sources
○ ….which typically don’t exist
● Names should follow a standard across all company databases
○ singular/plural, long/short names, ...
● So people don’t have to check how a table / column is called exactly
Thank you for listening!
federico-razzoli.com/services
Telegram channel:
open_source_databases

More Related Content

Similar to Database Design most common pitfalls

NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
Guillaume Lefranc
 

Similar to Database Design most common pitfalls (20)

How MySQL can boost (or kill) your application v2
How MySQL can boost (or kill) your application v2How MySQL can boost (or kill) your application v2
How MySQL can boost (or kill) your application v2
 
Database design best practices
Database design best practicesDatabase design best practices
Database design best practices
 
Introduction to Databases - query optimizations for MySQL
Introduction to Databases - query optimizations for MySQLIntroduction to Databases - query optimizations for MySQL
Introduction to Databases - query optimizations for MySQL
 
Advanced MariaDB features that developers love.pdf
Advanced MariaDB features that developers love.pdfAdvanced MariaDB features that developers love.pdf
Advanced MariaDB features that developers love.pdf
 
DBMS 4 | MySQL - DDL & DML Commands
DBMS 4 | MySQL - DDL & DML CommandsDBMS 4 | MySQL - DDL & DML Commands
DBMS 4 | MySQL - DDL & DML Commands
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
 
MySQL Performance Optimization
MySQL Performance OptimizationMySQL Performance Optimization
MySQL Performance Optimization
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
No sql bigdata and postgresql
No sql bigdata and postgresqlNo sql bigdata and postgresql
No sql bigdata and postgresql
 
SQL.pptx
SQL.pptxSQL.pptx
SQL.pptx
 
CS121Lec04.pdf
CS121Lec04.pdfCS121Lec04.pdf
CS121Lec04.pdf
 
Database
Database Database
Database
 
Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...
Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...
Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...
 
Mangala Deshpande MySQL0710.ppt
Mangala Deshpande MySQL0710.pptMangala Deshpande MySQL0710.ppt
Mangala Deshpande MySQL0710.ppt
 
Really Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DWReally Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DW
 
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
 
lec02-data-models-sql-basics.pptx
lec02-data-models-sql-basics.pptxlec02-data-models-sql-basics.pptx
lec02-data-models-sql-basics.pptx
 
Ms sql-server
Ms sql-serverMs sql-server
Ms sql-server
 
Developers’ mDay 2019. - Bogdan Kecman, Oracle – MySQL 8.0 – why upgrade
Developers’ mDay 2019. - Bogdan Kecman, Oracle – MySQL 8.0 – why upgradeDevelopers’ mDay 2019. - Bogdan Kecman, Oracle – MySQL 8.0 – why upgrade
Developers’ mDay 2019. - Bogdan Kecman, Oracle – MySQL 8.0 – why upgrade
 
MySQL performance tuning
MySQL performance tuningMySQL performance tuning
MySQL performance tuning
 

More from Federico Razzoli

More from Federico Razzoli (18)

Webinar - Unleash AI power with MySQL and MindsDB
Webinar - Unleash AI power with MySQL and MindsDBWebinar - Unleash AI power with MySQL and MindsDB
Webinar - Unleash AI power with MySQL and MindsDB
 
MariaDB Security Best Practices
MariaDB Security Best PracticesMariaDB Security Best Practices
MariaDB Security Best Practices
 
A first look at MariaDB 11.x features and ideas on how to use them
A first look at MariaDB 11.x features and ideas on how to use themA first look at MariaDB 11.x features and ideas on how to use them
A first look at MariaDB 11.x features and ideas on how to use them
 
MariaDB stored procedures and why they should be improved
MariaDB stored procedures and why they should be improvedMariaDB stored procedures and why they should be improved
MariaDB stored procedures and why they should be improved
 
Webinar - MariaDB Temporal Tables: a demonstration
Webinar - MariaDB Temporal Tables: a demonstrationWebinar - MariaDB Temporal Tables: a demonstration
Webinar - MariaDB Temporal Tables: a demonstration
 
Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
 
MariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAsMariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAs
 
Recent MariaDB features to learn for a happy life
Recent MariaDB features to learn for a happy lifeRecent MariaDB features to learn for a happy life
Recent MariaDB features to learn for a happy life
 
Automate MariaDB Galera clusters deployments with Ansible
Automate MariaDB Galera clusters deployments with AnsibleAutomate MariaDB Galera clusters deployments with Ansible
Automate MariaDB Galera clusters deployments with Ansible
 
Creating Vagrant development machines with MariaDB
Creating Vagrant development machines with MariaDBCreating Vagrant development machines with MariaDB
Creating Vagrant development machines with MariaDB
 
MariaDB, MySQL and Ansible: automating database infrastructures
MariaDB, MySQL and Ansible: automating database infrastructuresMariaDB, MySQL and Ansible: automating database infrastructures
MariaDB, MySQL and Ansible: automating database infrastructures
 
Playing with the CONNECT storage engine
Playing with the CONNECT storage enginePlaying with the CONNECT storage engine
Playing with the CONNECT storage engine
 
MariaDB Temporal Tables
MariaDB Temporal TablesMariaDB Temporal Tables
MariaDB Temporal Tables
 
MySQL and MariaDB Backups
MySQL and MariaDB BackupsMySQL and MariaDB Backups
MySQL and MariaDB Backups
 
JSON in MySQL and MariaDB Databases
JSON in MySQL and MariaDB DatabasesJSON in MySQL and MariaDB Databases
JSON in MySQL and MariaDB Databases
 
MySQL Transaction Isolation Levels (lightning talk)
MySQL Transaction Isolation Levels (lightning talk)MySQL Transaction Isolation Levels (lightning talk)
MySQL Transaction Isolation Levels (lightning talk)
 
Cassandra sharding and consistency (lightning talk)
Cassandra sharding and consistency (lightning talk)Cassandra sharding and consistency (lightning talk)
Cassandra sharding and consistency (lightning talk)
 
MariaDB Temporal Tables
MariaDB Temporal TablesMariaDB Temporal Tables
MariaDB Temporal Tables
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 

Database Design most common pitfalls

  • 2. € whoami ● Federico Razzoli ● Freelance consultant ● Working with databases since 2000 hello@federico-razzoli.com federico-razzoli.com ● I worked as a consultant for Percona and Ibuildings (mainly MySQL and MariaDB) ● I worked as a DBA for fast-growing companies like Catawiki, HumanState, TransferWise
  • 3. Agenda We will talk about… ● The most common design bad practices ● Information that is not easy to represent ● Relational model: why? ● Keys and indexes ● Data types ● Abusing NULL ● Hierarchies (trees) ● Lists ● Inheritance & polymorphism ● Heterogeneous rows ● Misc
  • 5. Criteria ● Queries should be fast ● Data structures should be reasonably simple ● Design must be reasonably extendable
  • 7. Specific Use Cases ● Some databases are designed for specific use cases ● In those cases, they may work much better than generic technologies ● Using them when not necessary may lead to use many technologies ● A technology should only be introduced if our company has: ○ Skills ○ Knowledge necessary for troubleshooting ○ Backups ○ High Availability ○ ...
  • 8. Relational is flexible With the relational model we: ● Are sure that data is written correctly (transactions) ● Can make sure that data is valid (schema, integrity constraints) ● Design tables with access patterns in mind ● To run a query we initially didn’t consider, most of the times we can just add an index
  • 9. Flexibility example CREATE TABLE user ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, name VARCHAR(100) NOT NULL, surname VARCHAR(100) NOT NULL, email VARCHAR(100) NOT NULL UNIQUE ); SELECT * FROM user WHERE id = 24; SELECT name, surname FROM user WHERE email = 'picard@starfleet.earth'; CREATE INDEX idx_surname_name ON user (surname, name); SELECT name, surname FROM user WHERE surname LIKE 'B%' ORDER BY surname, name;
  • 10. When Relational is not a good fit ● Heterogeneous data (product catalogue) ● Searchable text ● Graphs ● … However, for simple use cases relational databases include non-relational features, like: ● JSON type and functions ● Arrays (PostgreSQL) ● Fulltext indexes ● ...
  • 12. Primary Key ● Column or set of columns that identifies each row (unique, not null) ● Usually you want to create an artificial column for this: ○ id ○ or uuid
  • 13. Poor Primary Keys ● No primary key! ○ In MySQL this causes many performance problems ○ CDC applications need a way to identify each row ● Wrong columns ○ email ■ An email can change over time ■ An email address can be assigned to another person ■ The primary key is a PII! ○ name (eg: city name, product name…) ■ Quite long, especially if it must be UTF-8 ■ Certain names can change over time ○ timestamp ■ Multiple rows could be created at the same timestamp! ■ Long ○ ...
  • 14. UNIQUE ● An index whose values are distinct, or NULL ● Could theoretically be a primary key, but it’s not
  • 15. Poor UNIQUE keys ● Columns whose values will always be distinct, no matter if there is an index or not ○ Enforcing unicity implies extra reads, possibly on disk ● Columns that could have duplicates, but they’re unlikely ○ timestamp ○ (last_name, first_name)
  • 16. Foreign Keys ● References to another table (user.city_id -> city.id) ● In most cases they are bad for performance ● They create problems for operations (ALTER TABLE) ● In MySQL they are not compatible with some other features ○ They don’t activate triggers ○ Table partitioning ○ Tables not using InnoDB ○ Many bugs
  • 17. Indexing Bad Practices ● Indexing all columns: it won’t work ● Multi-columns indexes in random order ● Indexing columns with few distinct values (eg, boolean) ○ Unless you know what you’re doing ● Indexes contained in other indexes: idx1 (email), idx2 (email, last_name) idx (email, id) UNIQUE unq1 (email), INDEX idx1 (email) ● Non-descriptive index names (like the ones above) Looking at an index name (EXPLAIN), I should know which columns it contains
  • 18. Quick hints ● Learn how indexes work ○ Google: Federico Razzoli indexes bad practices ● Use pt-duplicate-key-checker, from Percona Toolkit
  • 20. Integer Types ● Don’t use bigger types than necessary ● ...but don’t overoptimise when you are not 100% sure. You’ll hardly see a benefit using TINYINT instead of SMALLINT ● MySQL UNSIGNED is good, column’s max is double ● I discourage the use of exotic MySQL syntax like: ○ MEDIUMINT: non-standard, and 3-bytes variables don’t exist in nature ○ INT(length) ○ ZEROFILL
  • 21. Real Numbers ● FLOAT and DOUBLE are fast when aggregating many values ● But they are subject to approximation. Don’t use them for prices, etc ● Instead you can use: ○ DECIMAL ○ INT - Multiply a number by 100, for example ○ DECIMAL is slower if heavy arithmetics is performed on many values ○ But storing a transformed value (price*100) can lead to misunderstandings and bugs
  • 22. Text Values ● Be sure that VARCHAR columns have adequate size for your data ● In PostgreSQL there is no difference between VARCHAR and TEXT, except that for VARCHAR you specify a max size ● In MySQL TEXT and BLOB columns are stored separately ○ Less data read if you often don’t read those columns ○ More read operations if you always use SELECT * ● CHAR is only good for small fixed-size data. The space saving is tiny.
  • 23. Temporal Types ● TIMESTAMP and DATETIME are mostly interchangeable ● MySQL YEAR is weird. 2-digit values meaning changes over time. Use SMALLINT inxtead. ● MySQL TIME is apparently weird and useless. But not if you consider it as an interval. (range: -838:59:59 .. 838:59:59) ● PostgreSQL has a proper INTERVAL type, which is surely better ● PostgreSQL allows to specify a timezone for each value (TIMESTAMP WITH TIMEZONE) ○ Timezones depend on policy, economy and religion. They may vary by 15 mins. Timezones are created, dismissed, and changed. In one case a timezone was changed by skipping a whole calendar day. ○ Never deal with timezones yourself, no one ever succeeded in history. Store all dates as UTC, use an external library for conversion.
  • 24. ENUM, SET ● MySQL weird types that include a list of allowed string values ● With ENUM, any number of values from the list are allowed ● With SET, exactly one value from the list is allowed ● '' is always allowed, because. ● Specifying the value by index is allowed, so 0 could match '1' ● Adding, dropping and changing values requires an ALTER TABLE ○ And possibly a locking table rebuild
  • 25. Instead of ENUM CREATE TABLE account ( state ENUM('active', 'suspended') NOT NULL, ... )
  • 26. Instead of ENUM CREATE TABLE account ( state_id INT UNSIGNED NOT NULL, ... ) CREATE TABLE state ( id INT UNSIGNED PRIMARY KEY, state VARCHAR(100) NOT NULL UNIQUE ) INSERT INTO state (state) VALUES ('active'), ('suspended');
  • 28. NULL anomalies mysql> SELECT NULL = 1 AS a, NULL <> 1 AS b, NULL IS NULL AS c, 1 IS NOT NULL AS d; +------+------+---+---+ | a | b | c | d | +------+------+---+---+ | NULL | NULL | 1 | 1 | +------+------+---+---+ -- This returns TRUE in MySQL: NULL <=> NULL AND 1 <=> 1
  • 29. Problematic queries These queries will not return rows with age = NULL or approved = NULL ● WHERE year != 1994 ● WHERE NOT (year = 1994) ● WHERE year > 2000 ● WHERE NOT (year > 2000) ● WHERE approved != TRUE ● WHERE NOT approved And: SELECT CONCAT(year, ' years old') FROM user ...
  • 30. Bad Reasons for NULL ● Because columns are NULLable by default ● To indicate that a value doesn’t exist ○ Use a special value instead: '' or -1 or 0 or … ○ But this is not always a bad reason: UNIQUE allows multiple NULLs ● Using your tables as spreadsheets
  • 31. Spreadsheet Example CREATE TABLE user ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, first_name VARCHAR(100) NOT NULL, last_name VARCHAR(100) NOT NULL, email VARCHAR(100) NOT NULL, -- if a user may have multiple URL’s, let’s move them -- to a separate table: -- url { id, user_id, url } url_1 VARCHAR(100), url_2 VARCHAR(100), url_3 VARCHAR(100), url_4 VARCHAR(100), url_5 VARCHAR(100) );
  • 32. Spreadsheet Example CREATE TABLE user ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, first_name VARCHAR(100) NOT NULL, last_name VARCHAR(100) NOT NULL, email VARCHAR(100) NOT NULL, -- if we may have users bank data or not, -- let’s move them to another table: -- bank { user_id, account_no, account_holder, ... } bank_account_no VARCHAR(50), bank_account_holder VARCHAR(100), bank_iban VARCHAR(100), bank_swift_code VARCHAR(5) );
  • 34. Category Hierarchies Antipattern: column-per-level TABLE product (id, category_name, subcategory_name, name, price, ..) ----- TABLE category (id, name) TABLE product (id, category_id, subcategory_id, name, price, ...) Possible problems: ● To add or delete a level, we need to add or drop a column ● A subcategory can be erroneously linked to multiple categories ● A category can be erroneously used as subcategory, and vice versa
  • 35. Category Hierarchies A better way: TABLE category (id, parent_id, name) TABLE product (id, category_id, name, price, ...) Possible problems: ● Circular dependencies (must be prevented at application level)
  • 36. Category Networks What if every category can have multiple parents? Antipattern: TABLE category (id, parent_id1, parent_id2, name)
  • 37. Category Graphs If every category can have multiple parents, correct pattern: TABLE category (id, name) TABLE category_relationship (parent_id, child_id)
  • 38. Antipattern: Parent List If every category can have multiple parents, correct pattern: TABLE category (id, name, parent_list) INSERT INTO category (parent_list, name) VALUES ('sports/football/wear', 'football shoes'); ● This antipattern is sometimes used because it simplifies certain aspects ● But it overcomplicates other aspects ● Also, up to recently MySQL and MariaDB did not support recursive queries, but now they do
  • 40. Tags Column ● Suppose you want to store user-typed tags for posts ● You may be tempted to: CREATE TABLE post ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, tags VARCHAR(200) ); INSERT INTO post (tags, ... ) VALUES (sunday,venus, ... );
  • 41. Tags Column ● But what about this query? SELECT id FROM post WHERE tags LIKE '%sun%'; ● Mmm, maybe this is better: INSERT INTO post (tags, ... ) VALUES (',events,diversity,', ... ); SELECT id FROM post WHERE tags LIKE '%,sun,%'; However, this query cannot take advantage of indexes
  • 42. Tag Table CREATE TABLE post ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, ... ); CREATE TABLE tag ( post_id INT UNSIGNED, tag VARCHAR(50), PRIMARY KEY (post_id, tag), INDEX (tag) ); It works. Queries will be able to use indexes.
  • 43. Tag Array -- PostgreSQL CREATE TABLE post ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, tags TEXT[] ); CREATE INDEX idx_tags on post USING GIN (tags); -- MySQL CREATE TABLE post ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, tags JSON DEFAULT JSON_ARRAY(), INDEX idx_tags (tags) ); -- MariaDB can store JSON arrays, -- but since it cannot index them this solution is not viable
  • 45. Not So Different Entities ● Your DB has users, landlords and tenants ● Separate entities with different info ● But sometimes you treat them as one thing ● What to do?
  • 46. Inheritance ● In the simplest case, they are just subclasses ● For example, landlords and tenants could be types of users ● Common properties are in the parent class -- relational way to represent it: TABLE user (id, first_name, last_name, email) TABLE landlord (id, user_id, vat_number) TABLE tenant (id, user_id, landlord_id) PostgreSQL allows to do this in a more object oriented way, with Table Inheritance
  • 47. Different Entities ● But sometimes it’s better to consider them different entities ● Antipattern: Union View CREATE VIEW everyone AS (SELECT id, first_name, last_name FROM landlord) UNION (SELECT id, first_name, last_name FROM tenant) ; This makes some queries less verbose, at the cost of making them potentially very slow
  • 48. Unicity Across Tables /1 ● But maybe both landlords and tenants have emails, and we want to make sure they are UNIQUE ● Question: is there a practical reason?
  • 49. Unicity Across Tables /2 ● If it is necessary, you’re thinking about the problem in a wrong way ● If emails need be unique, they are a whole entity, so you’ll guarantee unicity on a single table TABLE landlord (id, first_name, last_name, vat_number) TABLE tenant (id, first_name, last_name, landlord_id) TABLE email (id, email UNIQUE, landlord_id, tenant_id) Bloody hell! The solution initially looks great, but linking emails to landlords or tenants in that way is horrific!
  • 50. Unicity Across Tables /2bis Why? ● Cannot build foreign keys (I don’t recommend it, but…) ● If in the future we want to link emails to suppliers, employees, etc, we’ll need to add columns to the table
  • 51. Unicity Across Tables /3 Even if we keep the landlord and tenant tables separated, we can create a superset called person. We decided it’s not a parent class, so it can just have an id column. Every landlord, tenant and email is linked to a person. TABLE landlord (id, person_id, first_name, last_name, vat_number) TABLE tenant (id, person_id, first_name, last_name, landlord_id) TABLE person (id) TABLE email (id, person_id, email UNIQUE)
  • 53. Catalog of Products Imagine we have a catalogue of products where: ● Every product has certain common characteristics ● It’s important to be able to run queries on all products ○ SELECT id FROM p WHERE qty = 0; ○ SELECT MAX(price) FROM p GROUP BY vendor; ● Each product type also has a unique set of characteristics
  • 54. Antipattern: Stylesheet Table ● Keep all products in the same table ● Add a column for every characteristic that applies to at least one product ● Where a column doesn’t make sense, set to NULL Problems: ● Too many columns and indexes ○ Generally bad for query performance, especially INSERTs ○ Generally bad for operations (repair, backup, restore, ALTER TABLE…) ● Adding/removing a product type means to add/remove a set of columns ○ But in practice columns will hardly be removed and will remain unused ● NULL means both “no value for this product” and “doesn’t apply to this type of products”, leading to endless confusion
  • 55. Antipattern: Table per Type ● Store products of different types in different tables Problems: ● Metadata become data ○ How to get the list of product types? ● Some queries become overcomplicated ○ Get the id’s of out of stock products ○ Most expensive product for each vendor
  • 56. Hybrid ● A single table for characteristics common to all product types ● A separate table per product type, for non-common characteristics Problems: ● Many JOINs ● Adding/removing product types means to add/remove tables
  • 57. Semi-Structured Data ● A single table for all products ● A regular column for each column common to all product types ● A semi-structured column for all type-specific characteristics ○ JSON, HStore… ○ Not arrays ○ Not CSV ● Proper indexes on unstructured data (depending on your technology) Problems: ● Still a big table ● Queries on semi-structured data may be complicated and not supported by ORMs
  • 58. Antipattern: Entity,Attribute,Value TABLE entity (id, name) TABLE attribute (id, entity_id, name) TABLE value (id, attribute_id, value) ● Each product type is an entity ● Each type characteristics are stored in attribute ● Each product is a set of values Example: Entity { id: 24, name: "Bed" } Attribute [ { id: 123, entity_id: 24, name: "material" }, ... ] Value [ { id: 999, attribute_id: 123, value: "wood" } ]
  • 59. Antipattern: Entity,Attribute,Value Problems: ● We JOIN 3 tables every time we want to get a single value! ● All values must be treated as texts ○ Unless we create multiple value tables: int_value, text_value... ○ Which means, even more JOINs
  • 61. Names Beyond Comprehension ● I saw the following table names in production: ○ marco2015 # Marco was the table’s creation ○ jan2015 # jan was the month ○ tmp_tmp_tmp_fix ○ tmp_fix_fix_fix # Because symmetry is cool I forgot many other examples because... “Ultimate horror often paralyses memory in a merciful way.” ― H.P. Lovecraft
  • 62. Data in Metadata ● Include data in table names ○ invoice_2020, invoice_2019, invoice_2018… ● User a year column instead ● If the table is too big, there are other ways to contain the problem (partitioning)
  • 63. Bad Names in General ● A names should tell everyone what a table or column is ○ Even to new hires! ○ Even to you… in 5 years from now! ● Otherwise people have to look at other documentation sources ○ ….which typically don’t exist ● Names should follow a standard across all company databases ○ singular/plural, long/short names, ... ● So people don’t have to check how a table / column is called exactly
  • 64. Thank you for listening! federico-razzoli.com/services Telegram channel: open_source_databases