It's easy to see antipatterns in production databases. Our schemas should be simple but extensible, and allow fast SQL queries. In this webinar I discuss what most common antipatterns are, and how to correct them.
2. € whoami
● Federico Razzoli
● Freelance consultant
● Working with databases since 2000
hello@federico-razzoli.com
federico-razzoli.com
● I worked as a consultant for Percona and Ibuildings
(mainly MySQL and MariaDB)
● I worked as a DBA for fast-growing companies like
Catawiki, HumanState, TransferWise
3. Agenda
We will talk about…
● The most common design bad practices
● Information that is not easy to represent
● Relational model: why?
● Keys and indexes
● Data types
● Abusing NULL
● Hierarchies (trees)
● Lists
● Inheritance & polymorphism
● Heterogeneous rows
● Misc
7. Specific Use Cases
● Some databases are designed for specific use cases
● In those cases, they may work much better than generic technologies
● Using them when not necessary may lead to use many technologies
● A technology should only be introduced if our company has:
○ Skills
○ Knowledge necessary for troubleshooting
○ Backups
○ High Availability
○ ...
8. Relational is flexible
With the relational model we:
● Are sure that data is written correctly (transactions)
● Can make sure that data is valid (schema, integrity constraints)
● Design tables with access patterns in mind
● To run a query we initially didn’t consider, most of the times we can just add
an index
9. Flexibility example
CREATE TABLE user (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
surname VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL UNIQUE
);
SELECT * FROM user WHERE id = 24;
SELECT name, surname FROM user
WHERE email = 'picard@starfleet.earth';
CREATE INDEX idx_surname_name ON user (surname, name);
SELECT name, surname FROM user
WHERE surname LIKE 'B%'
ORDER BY surname, name;
10. When Relational is not a good fit
● Heterogeneous data (product catalogue)
● Searchable text
● Graphs
● …
However, for simple use cases relational databases include non-relational
features, like:
● JSON type and functions
● Arrays (PostgreSQL)
● Fulltext indexes
● ...
12. Primary Key
● Column or set of columns that identifies each row (unique, not null)
● Usually you want to create an artificial column for this:
○ id
○ or uuid
13. Poor Primary Keys
● No primary key!
○ In MySQL this causes many performance problems
○ CDC applications need a way to identify each row
● Wrong columns
○ email
■ An email can change over time
■ An email address can be assigned to another person
■ The primary key is a PII!
○ name (eg: city name, product name…)
■ Quite long, especially if it must be UTF-8
■ Certain names can change over time
○ timestamp
■ Multiple rows could be created at the same timestamp!
■ Long
○ ...
14. UNIQUE
● An index whose values are distinct, or NULL
● Could theoretically be a primary key, but it’s not
15. Poor UNIQUE keys
● Columns whose values will always be distinct, no matter if there is an index or
not
○ Enforcing unicity implies extra reads, possibly on disk
● Columns that could have duplicates, but they’re unlikely
○ timestamp
○ (last_name, first_name)
16. Foreign Keys
● References to another table (user.city_id -> city.id)
● In most cases they are bad for performance
● They create problems for operations (ALTER TABLE)
● In MySQL they are not compatible with some other features
○ They don’t activate triggers
○ Table partitioning
○ Tables not using InnoDB
○ Many bugs
17. Indexing Bad Practices
● Indexing all columns: it won’t work
● Multi-columns indexes in random order
● Indexing columns with few distinct values (eg, boolean)
○ Unless you know what you’re doing
● Indexes contained in other indexes:
idx1 (email), idx2 (email, last_name)
idx (email, id)
UNIQUE unq1 (email), INDEX idx1 (email)
● Non-descriptive index names (like the ones above)
Looking at an index name (EXPLAIN),
I should know which columns it contains
18. Quick hints
● Learn how indexes work
○ Google: Federico Razzoli indexes bad practices
● Use pt-duplicate-key-checker, from Percona Toolkit
20. Integer Types
● Don’t use bigger types than necessary
● ...but don’t overoptimise when you are not 100% sure. You’ll hardly see a
benefit using TINYINT instead of SMALLINT
● MySQL UNSIGNED is good, column’s max is double
● I discourage the use of exotic MySQL syntax like:
○ MEDIUMINT: non-standard, and 3-bytes variables don’t exist in nature
○ INT(length)
○ ZEROFILL
21. Real Numbers
● FLOAT and DOUBLE are fast when aggregating many values
● But they are subject to approximation. Don’t use them for prices, etc
● Instead you can use:
○ DECIMAL
○ INT - Multiply a number by 100, for example
○ DECIMAL is slower if heavy arithmetics is performed on many values
○ But storing a transformed value (price*100) can lead to
misunderstandings and bugs
22. Text Values
● Be sure that VARCHAR columns have adequate size for your data
● In PostgreSQL there is no difference between VARCHAR and TEXT, except
that for VARCHAR you specify a max size
● In MySQL TEXT and BLOB columns are stored separately
○ Less data read if you often don’t read those columns
○ More read operations if you always use SELECT *
● CHAR is only good for small fixed-size data. The space saving is tiny.
23. Temporal Types
● TIMESTAMP and DATETIME are mostly interchangeable
● MySQL YEAR is weird. 2-digit values meaning changes over time. Use
SMALLINT inxtead.
● MySQL TIME is apparently weird and useless. But not if you consider it as an
interval. (range: -838:59:59 .. 838:59:59)
● PostgreSQL has a proper INTERVAL type, which is surely better
● PostgreSQL allows to specify a timezone for each value (TIMESTAMP WITH
TIMEZONE)
○ Timezones depend on policy, economy and religion. They may vary by 15
mins. Timezones are created, dismissed, and changed. In one case a
timezone was changed by skipping a whole calendar day.
○ Never deal with timezones yourself, no one ever succeeded in history.
Store all dates as UTC, use an external library for conversion.
24. ENUM, SET
● MySQL weird types that include a list of allowed string values
● With ENUM, any number of values from the list are allowed
● With SET, exactly one value from the list is allowed
● '' is always allowed, because.
● Specifying the value by index is allowed, so 0 could match '1'
● Adding, dropping and changing values requires an ALTER TABLE
○ And possibly a locking table rebuild
25. Instead of ENUM
CREATE TABLE account (
state ENUM('active', 'suspended') NOT NULL,
...
)
26. Instead of ENUM
CREATE TABLE account (
state_id INT UNSIGNED NOT NULL,
...
)
CREATE TABLE state (
id INT UNSIGNED PRIMARY KEY,
state VARCHAR(100) NOT NULL UNIQUE
)
INSERT INTO state (state) VALUES ('active'), ('suspended');
28. NULL anomalies
mysql> SELECT
NULL = 1 AS a,
NULL <> 1 AS b,
NULL IS NULL AS c,
1 IS NOT NULL AS d;
+------+------+---+---+
| a | b | c | d |
+------+------+---+---+
| NULL | NULL | 1 | 1 |
+------+------+---+---+
-- This returns TRUE in MySQL:
NULL <=> NULL AND 1 <=> 1
29. Problematic queries
These queries will not return rows with age = NULL or approved = NULL
● WHERE year != 1994
● WHERE NOT (year = 1994)
● WHERE year > 2000
● WHERE NOT (year > 2000)
● WHERE approved != TRUE
● WHERE NOT approved
And:
SELECT CONCAT(year, ' years old') FROM user ...
30. Bad Reasons for NULL
● Because columns are NULLable by default
● To indicate that a value doesn’t exist
○ Use a special value instead: '' or -1 or 0 or …
○ But this is not always a bad reason: UNIQUE allows multiple NULLs
● Using your tables as spreadsheets
31. Spreadsheet Example
CREATE TABLE user (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL,
-- if a user may have multiple URL’s, let’s move them
-- to a separate table:
-- url { id, user_id, url }
url_1 VARCHAR(100),
url_2 VARCHAR(100),
url_3 VARCHAR(100),
url_4 VARCHAR(100),
url_5 VARCHAR(100)
);
32. Spreadsheet Example
CREATE TABLE user (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL,
-- if we may have users bank data or not,
-- let’s move them to another table:
-- bank { user_id, account_no, account_holder, ... }
bank_account_no VARCHAR(50),
bank_account_holder VARCHAR(100),
bank_iban VARCHAR(100),
bank_swift_code VARCHAR(5)
);
34. Category Hierarchies
Antipattern: column-per-level
TABLE product (id, category_name, subcategory_name, name, price, ..)
-----
TABLE category (id, name)
TABLE product (id, category_id, subcategory_id, name, price, ...)
Possible problems:
● To add or delete a level, we need to add or drop a column
● A subcategory can be erroneously linked to multiple categories
● A category can be erroneously used as subcategory, and vice versa
35. Category Hierarchies
A better way:
TABLE category (id, parent_id, name)
TABLE product (id, category_id, name, price, ...)
Possible problems:
● Circular dependencies (must be prevented at application level)
36. Category Networks
What if every category can have multiple parents?
Antipattern:
TABLE category (id, parent_id1, parent_id2, name)
37. Category Graphs
If every category can have multiple parents, correct pattern:
TABLE category (id, name)
TABLE category_relationship (parent_id, child_id)
38. Antipattern: Parent List
If every category can have multiple parents, correct pattern:
TABLE category (id, name, parent_list)
INSERT INTO category (parent_list, name) VALUES
('sports/football/wear', 'football shoes');
● This antipattern is sometimes used because it simplifies certain aspects
● But it overcomplicates other aspects
● Also, up to recently MySQL and MariaDB did not support recursive queries,
but now they do
40. Tags Column
● Suppose you want to store user-typed tags for posts
● You may be tempted to:
CREATE TABLE post (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
tags VARCHAR(200)
);
INSERT INTO post (tags, ... ) VALUES (sunday,venus, ... );
41. Tags Column
● But what about this query?
SELECT id FROM post WHERE tags LIKE '%sun%';
● Mmm, maybe this is better:
INSERT INTO post (tags, ... ) VALUES (',events,diversity,', ... );
SELECT id FROM post WHERE tags LIKE '%,sun,%';
However, this query cannot take advantage of indexes
42. Tag Table
CREATE TABLE post (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
...
);
CREATE TABLE tag (
post_id INT UNSIGNED,
tag VARCHAR(50),
PRIMARY KEY (post_id, tag),
INDEX (tag)
);
It works.
Queries will be able to use indexes.
43. Tag Array
-- PostgreSQL
CREATE TABLE post (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
tags TEXT[]
);
CREATE INDEX idx_tags on post USING GIN (tags);
-- MySQL
CREATE TABLE post (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
tags JSON DEFAULT JSON_ARRAY(),
INDEX idx_tags (tags)
);
-- MariaDB can store JSON arrays,
-- but since it cannot index them this solution is not viable
45. Not So Different Entities
● Your DB has users, landlords and tenants
● Separate entities with different info
● But sometimes you treat them as one thing
● What to do?
46. Inheritance
● In the simplest case, they are just subclasses
● For example, landlords and tenants could be types of users
● Common properties are in the parent class
-- relational way to represent it:
TABLE user (id, first_name, last_name, email)
TABLE landlord (id, user_id, vat_number)
TABLE tenant (id, user_id, landlord_id)
PostgreSQL allows to do this in a more object oriented way, with Table Inheritance
47. Different Entities
● But sometimes it’s better to consider them different entities
● Antipattern: Union View
CREATE VIEW everyone AS
(SELECT id, first_name, last_name FROM landlord)
UNION
(SELECT id, first_name, last_name FROM tenant)
;
This makes some queries less verbose, at the cost of making them
potentially very slow
48. Unicity Across Tables /1
● But maybe both landlords and tenants have emails,
and we want to make sure they are UNIQUE
● Question: is there a practical reason?
49. Unicity Across Tables /2
● If it is necessary, you’re thinking about the problem in a wrong way
● If emails need be unique, they are a whole entity, so you’ll guarantee unicity
on a single table
TABLE landlord (id, first_name, last_name, vat_number)
TABLE tenant (id, first_name, last_name, landlord_id)
TABLE email (id, email UNIQUE, landlord_id, tenant_id)
Bloody hell! The solution initially looks great, but linking emails to landlords or
tenants in that way is horrific!
50. Unicity Across Tables /2bis
Why?
● Cannot build foreign keys (I don’t recommend it, but…)
● If in the future we want to link emails to suppliers, employees, etc, we’ll need
to add columns to the table
51. Unicity Across Tables /3
Even if we keep the landlord and tenant tables separated,
we can create a superset called person.
We decided it’s not a parent class, so it can just have an id column.
Every landlord, tenant and email is linked to a person.
TABLE landlord (id, person_id, first_name, last_name, vat_number)
TABLE tenant (id, person_id, first_name, last_name, landlord_id)
TABLE person (id)
TABLE email (id, person_id, email UNIQUE)
53. Catalog of Products
Imagine we have a catalogue of products where:
● Every product has certain common characteristics
● It’s important to be able to run queries on all products
○ SELECT id FROM p WHERE qty = 0;
○ SELECT MAX(price) FROM p GROUP BY vendor;
● Each product type also has a unique set of characteristics
54. Antipattern: Stylesheet Table
● Keep all products in the same table
● Add a column for every characteristic that applies to at least one product
● Where a column doesn’t make sense, set to NULL
Problems:
● Too many columns and indexes
○ Generally bad for query performance, especially INSERTs
○ Generally bad for operations (repair, backup, restore, ALTER TABLE…)
● Adding/removing a product type means to add/remove a set of columns
○ But in practice columns will hardly be removed and will remain unused
● NULL means both “no value for this product” and “doesn’t apply to this type of
products”, leading to endless confusion
55. Antipattern: Table per Type
● Store products of different types in different tables
Problems:
● Metadata become data
○ How to get the list of product types?
● Some queries become overcomplicated
○ Get the id’s of out of stock products
○ Most expensive product for each vendor
56. Hybrid
● A single table for characteristics common to all product types
● A separate table per product type, for non-common characteristics
Problems:
● Many JOINs
● Adding/removing product types means to add/remove tables
57. Semi-Structured Data
● A single table for all products
● A regular column for each column common to all product types
● A semi-structured column for all type-specific characteristics
○ JSON, HStore…
○ Not arrays
○ Not CSV
● Proper indexes on unstructured data (depending on your technology)
Problems:
● Still a big table
● Queries on semi-structured data may be complicated and not supported by
ORMs
58. Antipattern: Entity,Attribute,Value
TABLE entity (id, name)
TABLE attribute (id, entity_id, name)
TABLE value (id, attribute_id, value)
● Each product type is an entity
● Each type characteristics are stored in attribute
● Each product is a set of values
Example:
Entity { id: 24, name: "Bed" }
Attribute [ { id: 123, entity_id: 24, name: "material" }, ... ]
Value [ { id: 999, attribute_id: 123, value: "wood" } ]
59. Antipattern: Entity,Attribute,Value
Problems:
● We JOIN 3 tables every time we want to get a single value!
● All values must be treated as texts
○ Unless we create multiple value tables: int_value, text_value...
○ Which means, even more JOINs
61. Names Beyond Comprehension
● I saw the following table names in production:
○ marco2015 # Marco was the table’s creation
○ jan2015 # jan was the month
○ tmp_tmp_tmp_fix
○ tmp_fix_fix_fix # Because symmetry is cool
I forgot many other examples because...
“Ultimate horror often paralyses memory in a merciful way.”
― H.P. Lovecraft
62. Data in Metadata
● Include data in table names
○ invoice_2020, invoice_2019, invoice_2018…
● User a year column instead
● If the table is too big, there are other ways to contain the problem
(partitioning)
63. Bad Names in General
● A names should tell everyone what a table or column is
○ Even to new hires!
○ Even to you… in 5 years from now!
● Otherwise people have to look at other documentation sources
○ ….which typically don’t exist
● Names should follow a standard across all company databases
○ singular/plural, long/short names, ...
● So people don’t have to check how a table / column is called exactly
64. Thank you for listening!
federico-razzoli.com/services
Telegram channel:
open_source_databases