This document discusses schema evolution patterns for databases and APIs. It introduces the concept of schema compatibility, which allows transparent changes between schemas. It also discusses data migration techniques for incompatible changes, including single-schema and multi-schema approaches. The key takeaways are to design for compatibility, have a plan to handle incompatibility, and present an illusion of instant schema changes to clients.
3. A Deceptively Hard Problem
•Classic three-tier web service
- Multiple servers for scalability
- Rolling updates for high availability
- API for extensibility
•How do we make changes to data?
- Let’s focus on one table, people
LB / API Gateway
DB
App App App App
Users API Clients
4. Deceptively Hard Problem #1
•Add administrative users
- Need to add is_admin to people table
- … but clients with the old schema will fail
to write if they don’t provide is_admin!
Won’t they?
5. Deceptively Hard Problem #2
•Splitting name into first_name, last_name
- Old clients will keep writing to name
- New clients will expect first_name and
last_name to be defined in old data
•How do we do this update safely?
6. Schema Evolution
•When your data’s shape (it’s schema) changes
•Why is this hard?
- Schemas can’t change everywhere instantly
- Client code can be very difficult to update
- If client and data schemas don’t agree,
it can cause serious problems
7. How Do We Handle This?
•We need to give the illusion of instant
schema change to clients, with minimal
code change.
•In this talk, we’ll look at how.
8. Goal of This Talk
•Broadly-applicable concepts, techniques,
and patterns for schema evolution
- Schema compatibility for
transparent schema change
- Data migration when compatibility
isn’t possible or practical
•How this looks in practice
11. The Illusion of Instant Change
•Instant schema change everywhere isn’t possible,
but we want to give the illusion that it is
- Goal #1: Clients can still read and write safely,
even if their schemas are different
- Goal #2: Code change to clients is minimized
•Schema compatibility makes this easier to do
12. Schema Compatibility
•If two schemas are compatible, evolving
from one schema to another can be done
automatically on read
•Clients can be oblivious to schema change
•Two directions: backwards and forwards
16. Other Types of Changes
•Without defaults:
- Adding a field breaks backwards-compatibility
(in older data, field value is undefined)
- Removing a field breaks forwards-compatibility
(for older clients, field value is undefined)
•Renaming (e.g. ssn to social_security_number):
it depends
17. In Practice: API Design
•So far, focused on DBs
•Compatibility is especially important for APIs
- Lots of clients you might not control
- API version bumps need to happen when
incompatible schema changes happen
18. In Practice - Protocol Buffers
message Person {
required string name = 1;
required int32 age = 2;
optional bool is_admin = 3
[default = false];
}
•Field numbers make renames compatible
•In version 3, no required or optional -
required broke backwards-compatibility too often
19. In Practice: Stripe
•Goal: API responses readable by all old clients w/o code change.
•API server has latest schema, but clients keep schema forever
•Solution: Version change modules applied in reverse order from
server’s version to client’s version (they admit: this is hard)
2 31
SC
3to2( )2to1( )
20. Recap
•Compatibility allows for transparent
movement between schemas
•Changes can be
backwards-compatible,
forwards-compatible, both, or neither
•Ease-of-compatibility drives the design of many
messaging formats
22. Crossing Compatibility Gaps
•Need a plan for when compatibility
isn’t an option
- Not all schema changes are compatible
- Not all incompatibilities are simple
- Not all compatible changes are practical
23. Complex Changes
name: string,
first_name: string,
last_name: string,
age: integer,
is_admin: boolean (default: false)
•Not obvious how to split; code changes required
•Two field additions without defaults: not backwards-compatible
•Field removal without default: not forwards-compatible
24. Impractical Changes
•e.g. Adding a column in MySQL (<v8)
requires locking/copying the table
- Days to weeks not unheard of for tables
with millions of rows
25. Crossing Compatibility Gaps
•Compatibility gaps are crossed with
data migrations - minimally disruptive
movement between schemas
•We’ll look at:
- Single-schema stores (e.g. RDBMS)
- Multi-schema stores (e.g. MongoDB, Kafka)
37. Recap
•In single-schema stores:
- Migrate clients gradually, maintaining the
illusion of the old schema to old clients
- Migrate data to new schema over time,
applying updates to old and new copies
- When migration complete, then cut over
39. Multi-Schema Stores
{
“name”: “Alice Smith”,
“age”: 29,
“organization”: “Engineering”
}
{
“name”: “Bob Jones”,
“age”: 42,
}
{
“name”: “Carol Danvers”,
“age”: 34,
“organization”: “Security”
}
•Data with different schemas
coexisting in the same store
•MongoDB: collections of
documents
•Kafka: topics of messages
•Want illusion of single schema
40. Multi-Schema Migration
C1 C2 C3
X X+1 Move data from
schema X to
(backwards-incompatible)
schema X + 1
without blocking clients
41. C1 C2 C3
X X+1
Step 1: Old clients write with new schema,
continue reading with old schema
(old clients are still compatible!)
C1 C2
42. C1 C2 C3
X X+1
Step 2: Migrate old data to new schema
C1 C2
43. C1 C2 C3
X X+1
Step 3: Old clients read and write
with new schema
44. In Practice: Kafka (Confluent)
•Schema-aware clients transparently
apply compatible changes
•Backwards-incompatible changes:
update writers first
•Forwards-incompatible changes:
update readers first
45. Recap
•In multi-schema stores:
- Make old clients generate compatible data
(by writing or reading with new schema)
- Migrate old data to new schema
- Old clients read and write with new schema
47. Summary
•Schemas can’t change everywhere instantly
•Schema compatibility can transparently
provide the illusion of instant change
•Data migrations fill in compatibility gaps,
carefully keeping clients working
48. Takeaways
•This applies to DB schema changes and
API versioning, but it also applies to
CSV/JSON/Excel, etc.
•If your data has structure, it probably
has a schema, & these concepts apply
49. Takeaways
•Reason about schema evolution up-front to guide
your architecture choices
- Prefer compatible changes
•Have a plan for dealing with incompatibility
- Present the illusion of instant schema change
•Remember: this is a hard problem for everyone!