Schema Evolution Patterns - Texas Scalability Summit 2019

Schema
Evolution  
Patterns
Alex Rasmussen
alex@bitsondisk.com John Gould (14.Sep.1804 - 3.Feb.1881) [Public domain]

Hi, I’m Alex!
https://www.bitsondisk.com/
LA-based  
Data Engineering  
Consultant
Twitter/GitHub/LinkedIn/…:  
@alexras

A Deceptively Hard Problem
•Classic three-tier web service
- Multiple servers for scalability
- Rolling updates for high availability
- API for extensibility
•How do we make changes to data?
- Let’s focus on one table, people
LB / API Gateway
DB
App App App App
Users API Clients

Deceptively Hard Problem #1
•Add administrative users
- Need to add is_admin to people table
- … but clients with the old schema will fail
to write if they don’t provide is_admin!
Won’t they?

Deceptively Hard Problem #2
•Splitting name into first_name, last_name
- Old clients will keep writing to name
- New clients will expect first_name and
last_name to be defined in old data
•How do we do this update safely?

Schema Evolution
•When your data’s shape (it’s schema) changes
•Why is this hard?
- Schemas can’t change everywhere instantly
- Client code can be very difficult to update
- If client and data schemas don’t agree,  
it can cause serious problems

How Do We Handle This?
•We need to give the illusion of instant
schema change to clients, with minimal
code change.
•In this talk, we’ll look at how.

Goal of This Talk
•Broadly-applicable concepts, techniques, 
and patterns for schema evolution
- Schema compatibility for  
transparent schema change
- Data migration when compatibility  
isn’t possible or practical
•How this looks in practice

1. SCHEMA COMPATIBILITY
2. DATA MIGRATION
a. SINGLE SCHEMA
b. MULTI-SCHEMA
3. TAKEAWAYS

The Illusion of Instant Change
•Instant schema change everywhere isn’t possible,
but we want to give the illusion that it is
- Goal #1: Clients can still read and write safely,
even if their schemas are different
- Goal #2: Code change to clients is minimized
•Schema compatibility makes this easier to do

Schema Compatibility
•If two schemas are compatible, evolving
from one schema to another can be done 
automatically on read
•Clients can be oblivious to schema change
•Two directions: backwards and forwards

Compatibility
X X+1
Backwards-Compatibility
Data written with old schema  
readable by clients with new schema
C
C
X X+1
Forwards-Compatibility
Data written with new schema  
readable by clients with old schema

Add a Field With a Default
name: string,
age: integer,
is_admin: boolean  
(default: false)
name: “Bob Jones”,
age: 42
name: “Tom Peters”,
age: 32,
is_admin: false
X X+1
CX CX+1
Backwards: reading , CX+1 adds is_admin = false
Forwards: reading , CX ignores is_admin

Remove a Field With a Default
name: “Alice Smith”,
age: 29,
is_admin: true
pto_days_left: 16
name: “Carol Danvers”,
age: 34,
is_admin: true
X X+1
CX CX+1
Backwards: reading , CX+1 ignores pto_days_left
Forwards: reading , CX adds pto_days_left = 0
name: string,
age: integer,
is_admin: boolean  
(default: false)
pto_days_left: integer
(default: 0)

Other Types of Changes
•Without defaults:
- Adding a field breaks backwards-compatibility  
(in older data, field value is undefined)
- Removing a field breaks forwards-compatibility  
(for older clients, field value is undefined)
•Renaming (e.g. ssn to social_security_number):  
it depends

In Practice: API Design
•So far, focused on DBs
•Compatibility is especially important for APIs
- Lots of clients you might not control
- API version bumps need to happen when
incompatible schema changes happen

In Practice - Protocol Buffers
message Person {
required string name = 1;
required int32 age = 2;
optional bool is_admin = 3  
[default = false];
}
•Field numbers make renames compatible
•In version 3, no required or optional -  
required broke backwards-compatibility too often

In Practice: Stripe
•Goal: API responses readable by all old clients w/o code change.
•API server has latest schema, but clients keep schema forever
•Solution: Version change modules applied in reverse order from
server’s version to client’s version (they admit: this is hard)
2 31
SC
3to2( )2to1( )

Recap
•Compatibility allows for transparent  
movement between schemas
•Changes can be  
backwards-compatible,  
forwards-compatible, both, or neither
•Ease-of-compatibility drives the design of many
messaging formats

Crossing Compatibility Gaps
•Need a plan for when compatibility  
isn’t an option
- Not all schema changes are compatible
- Not all incompatibilities are simple
- Not all compatible changes are practical

Complex Changes
name: string,
first_name: string,
last_name: string,
age: integer,
is_admin: boolean (default: false)
•Not obvious how to split; code changes required
•Two field additions without defaults: not backwards-compatible
•Field removal without default: not forwards-compatible

Impractical Changes
•e.g. Adding a column in MySQL (<v8)
requires locking/copying the table
- Days to weeks not unheard of for tables
with millions of rows

Crossing Compatibility Gaps
•Compatibility gaps are crossed with  
data migrations - minimally disruptive  
movement between schemas
•We’ll look at:
- Single-schema stores (e.g. RDBMS)
- Multi-schema stores (e.g. MongoDB, Kafka)

Three-Tier Web Architecture
S
C2C1 C3 C4
Load Balancer
name
“Bob Jones”
“Alice Smith”
“Jamie Lee Curtis”
first_name last_name
“Bob” “Jones”
“Alice” “Smith”
“Jamie Lee” “Curtis”

Single-Schema Migration
X X+1
C1 C2 C3 C4
S
Move from X  
to (incompatible) X + 1  
without downtime

Step 1: Create and migrate temporary store S’
C1 C2 C3 C4
S S’
X X+1

Step 2: Create a copier and an updater
C1 C2 C3 C4
S S’
U
C
X X+1

Step 3.1: Move clients over to new schema
C1 C2 C3 C4
S S’
U
C
X X+1

Step 3.2: Copy data, record / apply updates
S S’
U
C
X X+1
C1 C2 C3 C4

Step 4: Cutover - S’ becomes S
C1 C2 C3 C4
SSold
U
X X+1

Step 5: Drain updater, delete Sold
C1 C2 C3 C4
S
X X+1

In Practice - Percona
• pt-online-schema-change
- Copier: scan/copy in timed chunks
- Updater: synchronous table triggers
- Cutover: RENAME TABLE

In Practice - GitHub
•gh-ost
- Copier: chunked reads/writes
- Updater: read binlog, interleave copies
- Cutover: 2-step blocking swap

Recap
•In single-schema stores:
- Migrate clients gradually, maintaining the
illusion of the old schema to old clients
- Migrate data to new schema over time,
applying updates to old and new copies
- When migration complete, then cut over

Multi-Schema Stores
{
“name”: “Alice Smith”,
“age”: 29,
“organization”: “Engineering”
}
{
“name”: “Bob Jones”,
“age”: 42,
}
{
“name”: “Carol Danvers”,
“age”: 34,
“organization”: “Security”
}
•Data with different schemas
coexisting in the same store
•MongoDB: collections of
documents
•Kafka: topics of messages
•Want illusion of single schema

Multi-Schema Migration
C1 C2 C3
X X+1 Move data from  
schema X to 
(backwards-incompatible) 
schema X + 1 
without blocking clients

C1 C2 C3
X X+1
Step 1: Old clients write with new schema,  
continue reading with old schema
(old clients are still compatible!)
C1 C2

C1 C2 C3
X X+1
Step 2: Migrate old data to new schema
C1 C2

C1 C2 C3
X X+1
Step 3: Old clients read and write  
with new schema

In Practice: Kafka (Confluent)
•Schema-aware clients transparently
apply compatible changes
•Backwards-incompatible changes:  
update writers first
•Forwards-incompatible changes:  
update readers first

Recap
•In multi-schema stores:
- Make old clients generate compatible data 
(by writing or reading with new schema)
- Migrate old data to new schema
- Old clients read and write with new schema

Summary
•Schemas can’t change everywhere instantly
•Schema compatibility can transparently
provide the illusion of instant change
•Data migrations fill in compatibility gaps,
carefully keeping clients working

Takeaways
•This applies to DB schema changes and
API versioning, but it also applies to
CSV/JSON/Excel, etc.
•If your data has structure, it probably
has a schema, & these concepts apply

Takeaways
•Reason about schema evolution up-front to guide
your architecture choices
- Prefer compatible changes
•Have a plan for dealing with incompatibility
- Present the illusion of instant schema change
•Remember: this is a hard problem for everyone!

Thank You! Questions?
https://www.bitsondisk.com/
Consulting Inquiries:  
alex@bitsondisk.com
John Gould (14.Sep.1804 - 3.Feb.1881) [Public domain]

Schema Evolution Patterns - Texas Scalability Summit 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Schema Evolution Patterns - Texas Scalability Summit 2019

Similar to Schema Evolution Patterns - Texas Scalability Summit 2019 (20)

More from Alex Rasmussen

More from Alex Rasmussen (6)

Recently uploaded

Recently uploaded (20)

Schema Evolution Patterns - Texas Scalability Summit 2019