Heap's analytics infrastructure is built around PostgreSQL. The most important choice to make when building a system this way is the schema you'll use to represent your data. This foundation will determine your write throughput, what sorts of read queries will be fast, what indexing strategies will be available to you, and what data inconsistencies will be possible. With the wrong choice, you won't be able to leverage PostgreSQL's most powerful features.
This talk walks through the different schemas we've used to power Heap over the last three years, their relative strengths and weaknesses, and the mistakes we've made.
22. Challenges
1. Capturing 10x to 100x as much data.
Will never care about 95% of it.
2. Funnels, retention, behavioral cohorts,
grouping, filtering... can't pre-aggregate.
23. Challenges
1. Capturing 10x to 100x as much data.
Will never care about 95% of it.
2. Funnels, retention, behavioral cohorts,
grouping, filtering... can't pre-aggregate.
3. Within a few minutes of real-time.
24.
25. 1. Data is mostly write-once, never update.
2. Queries map nicely to relational model.
3. Events have a natural ordering (time)
which is mostly monotonic.
4. Analyses are always in terms of defined
events.
Possibly Useful Observations
28. CREATE TABLE user (
customer_id BIGINT,
user_id BIGINT,
properties JSONB NOT NULL
);
}PRIMARY KEY
29. CREATE TABLE session (
customer_id BIGINT,
user_id BIGINT,
session_id BIGINT,
time BIGINT NOT NULL,
properties JSONB NOT NULL
);
}PRIMARY KEY{FOREIGN KEY
(user)
30. CREATE TABLE pageview (
customer_id BIGINT,
user_id BIGINT,
session_id BIGINT,
pageview_id BIGINT,
time BIGINT NOT NULL,
properties JSONB NOT NULL
);
}PRIMARY KEY{FOREIGN KEY
(session)
31. CREATE TABLE event (
customer_id BIGINT,
user_id BIGINT,
session_id BIGINT,
pageview_id BIGINT,
event_id BIGINT,
time BIGINT NOT NULL,
properties JSONB NOT NULL
);
}PRIMARY KEY
{FOREIGN KEY
(pageview)
32.
33.
34. 1. Simple, easy to understand.
2. Can express basically all analysis in plain old SQL.
Plays nicely with ORMs. Just works.
3. Not much surface area for data inconsistencies.
Pros Of Schema #1
35. 1. Simple, easy to understand.
2. Can express basically all analysis in plain old SQL.
Plays nicely with ORMs. Just works.
3. Not much surface area for data inconsistencies.
You should basically always start here.
Pros Of Schema #1
36. Pro: got us to launch!
Con: too many joins, even for simple analyses.
Queries too slow for large customers.
37. 1. Data is mostly write-once, never update.
2. Queries map nicely to relational model.
3. Events have a natural ordering (time) which is
mostly monotonic.
4. Analyses are always in terms of defined events.
5. Aggregations partition cleanly at the user level.
Possibly Useful Observations
39. CREATE TABLE user_events (
customer_id BIGINT,
user_id BIGINT,
time_first_seen BIGINT NOT NULL,
properties JSONB NOT NULL,
events JSONB[] NOT NULL
);
}PRIMARY KEY
40. funnel_events(events JSONB[], pattern_array TEXT[]) RETURNS int[]
-- Returns an array with 1s corresponding to steps completed
-- in the funnel, 0s in the other positions
count_events(events JSONB[], pattern TEXT) RETURNS int
-- Returns the number of elements in `events` that
-- match `pattern`.
41. SELECT funnel_events(
ARRAY[
'{"foo": "bar", "baz": 10}', -- first event
'{"foo": "abc", "baz": 30}', -- second event
'{"foo": "dog", "city": "san francisco"}' -- third event
],
ARRAY[
'"foo"=>"abc"', -- matches second event
'"city"=>like "%ancisco"' -- matches third event
]
);
42. SELECT funnel_events(
ARRAY[
'{"foo": "bar", "baz": 10}', -- first event
'{"foo": "abc", "baz": 30}', -- second event
'{"foo": "dog", "city": "san francisco"}' -- third event
],
ARRAY[
'"foo"=>"abc"', -- matches second event
'"city"=>like "%ancisco"' -- matches third event
]
);
--------> emits {1, 1}
43. SELECT funnel_events(
ARRAY[
'{"foo": "bar", "baz": 10}', -- first event
'{"foo": "abc", "baz": 30}', -- second event
'{"foo": "dog", "city": "san francisco"}' -- third event
],
ARRAY[
'"san"=>like "%ancisco"' -- matches third event
'"foo"=>"abc"', -- nothing to match after third event
]
);
--------> emits {1, 0}
45. 1. No joins, just aggregations.
2. Can run pretty sophisticated analysis via extensions
like funnel_events.
3. Easy to distribute.
4. Event arrays are TOASTed, which saves lots of disk
space and I/O.
Pros Of Schema #2
46. 1. Can't index for defined events, or even event fields.
2. Can't index for event times in any meaningful
sense.
3. Arrays keep growing and growing...
Limitations Of Schema #2
47. CREATE TABLE user_events (
customer_id BIGINT,
user_id BIGINT,
properties JSONB NOT NULL,
time_first_seen BIGINT NOT NULL,
time_last_seen BIGINT NOT NULL,
events JSONB[] NOT NULL,
events_last_week JSONB[] NOT NULL
);
}PRIMARY KEY
49. 1. Can't index for defined events, or even event fields.
2. Can't index for event times in any meaningful
sense.
3. Arrays keep growing and growing...
4. Write path is very painful.
Limitations Of Schema #2
50.
51. 1. Adding one event to a user requires rewriting the
whole user. (Cost over time is quadratic in size of
user!)
2. Schema bloats like crazy, requires maxing out
autovacuum.
3. Simple maintenance is expensive.
Write Path Of Schema #2
53. 1. Data is mostly write-once, never update.
2. Queries map nicely to relational model.
3. Events have a natural ordering (time) which is mostly
monotonic.
4. Analyses are always in terms of defined events
which are very sparse.
5. Aggregations partition cleanly at the user level.
Possibly Useful Observations
55. CREATE TABLE user (
customer_id BIGINT,
user_id BIGINT,
properties JSONB NOT NULL
);
}PRIMARY KEY
56. CREATE TABLE event (
customer_id BIGINT,
user_id BIGINT,
event_id BIGINT,
time BIGINT,
data JSONB NOT NULL
);
}PRIMARY KEY{FOREIGN KEY
(user)
57. CREATE INDEX confirmed_checkout_idx ON event (time)
WHERE
(data ->> 'path') = '/checkout' AND
(data ->> 'action') = 'click' AND
(data ->> 'css_hierarchy') LIKE '%div.checkout_modal%a.btn' AND
(data ->> 'target_text') = 'Confirm Order'
58. CREATE INDEX confirmed_checkout_idx ON event (time)
WHERE
(data ->> 'path') = '/checkout' AND
(data ->> 'action') = 'click' AND
(data ->> 'css_hierarchy') LIKE '%div.checkout_modal%a.btn' AND
(data ->> 'target_text') = 'Confirm Order'
...
SELECT
COUNT(*) AS value,
date_trunc('month', to_timestamp(time / 1000) AT TIME ZONE 'UTC') AS
time_bucket
FROM event
WHERE
customer_id = 135 AND
time BETWEEN 1424437200000 AND 1429531200000 AND
(data ->> 'path') = '/checkout' AND
(data ->> 'action') = 'click' AND
(data ->> 'css_hierarchy') LIKE '%div.checkout_modal%a.btn' AND
(data ->> 'target_text') = 'Confirm Order'
GROUP BY time_bucket
59. Partial Index Strategy
• Structure the event table such that every event
definition is a row-level predicate on it.
• Under the hood, Heap maintains one partial index for
each of those predicates.
• The variety of events that Heap captures is massive, so
any individual event definition is very selective.
• Fits perfectly into our "retroactive" analytics framework.
60. General Read-Path Strategy
• All analyses shard cleanly by (customer_id, user_id),
and every query is built from a sparse set of events.
• Simple meta-formula for most analysis queries:
1. Build up an array of relevant events for each user
2. Pass the array to a custom UDF
3. Join arbitrarily for more filtering, grouping, etc
61. 1. Excellent read performance, with a few caveats.
2. Flexible event-level indexing and query tuning makes it
easier to make new analyses fast.
3. Much, much less write-time I/O cost.
4. PostgreSQL manages a lot of complexity for us.
Pros Of Schema #3
62. 1. Expensive to maintain all those indexes!
2. Lack of meaningful statistics for the query planner.
3. Bigger disk footprint by ~2.5x.
4. Some of the assumptions are a bit restrictive / don't
degrade gracefully.
Limitations Of Schema #3
63. 1. Data is mostly write-once, never update.
2. Queries map nicely to relational model.
3. Events have a natural ordering (time) which is mostly
monotonic.
4. Analyses are always in terms of defined events
which are very sparse and predictable to a degree.
5. Aggregations partition cleanly at the user level.
Possibly Useful Observations
65. CREATE TABLE event (
customer_id BIGINT,
user_id BIGINT,
event_id BIGINT,
time BIGINT,
data JSONB NOT NULL
);
}PRIMARY KEY{FOREIGN KEY
(user)
66. CREATE TABLE event (
customer_id BIGINT,
user_id BIGINT,
event_id BIGINT,
time BIGINT,
type TEXT,
hierarchy TEXT,
target_text TEXT,
...
data JSONB NOT NULL
);
}PRIMARY KEY{FOREIGN KEY
(user)
67. 1. Dataset is ~30% smaller on disk.
2. Query planner has much more information to work with,
can use it in more ambitious ways.
Pros Of Schema #4
68. CREATE TABLE event (
customer_id BIGINT,
user_id BIGINT,
event_id BIGINT,
time BIGINT,
type TEXT, -- btree
hierarchy TEXT, -- gin
target_text TEXT,
... -- more btrees in here
data JSONB NOT NULL
);
}PRIMARY KEY{FOREIGN KEY
(user)
Can now combine indexes on these!
{
69. 1. Dataset is ~30% smaller on disk.
2. Query planner has much more information to work with,
can use it in more ambitious ways.
3. Can get rid of ~60% of partial indexes and replace them
with small set of simpler indexes.
Pros Of Schema #4
70. 1. Costs ~50% less CPU on write.
2. Costs ~50% more I/O on write.
3. Eliminates of a lot of edge cases, degrades more
gracefully.
Tradeoffs From Mixed
Indexing Strategy
71. CREATE TABLE user (
customer_id BIGINT,
user_id BIGINT,
properties JSONB NOT NULL,
identity TEXT
);
}PRIMARY KEY
How do you represent
user moves?
72. Future Work
• Partitioning the events table, many options here.
• Supporting a much more heterogeneous dataset.
• New analysis paradigms.
• Many, many others. (Did I mention we're hiring?)
73. PostgreSQL Wishlist
• Ability to move table data with indexes.
• Partial indexes and composite types have lots of
gotchas if you want index-only scans.
• Better ability to keep the visibility map up to date,
without constant VACUUMing.
• Distributed systems features.