3. #MDBE17
THE WIREDTIGER STORAGE ENGINE
Storage engine:
The “storage engine” is the MongoDB database server code
responsible for single-node durability as well as being the primary
enforcer of data consistency guarantees.
• MongoDB has a pluggable storage engine architecture
‒ Three well-known engines: MMAPv1, WiredTiger and RocksDB
• WiredTiger is the default storage engine
6. #MDBE17
PROBLEM 1: LOTS OF COLLECTIONS
• MongoDB applications create collections to hold their documents
• Each collection has some set of indexes
‒ Documents indexed in multiple ways
‒ Hundreds of collections
• But some applications create A LOT of collections:
‒ Avoiding concurrency bottlenecks in the MMAPv1 storage engine
‒ Multi-tenant applications
‒ Creative schema design: time-series data
8. #MDBE17
PROBLEM 2: INVALID ASSUMPTIONS
• WiredTiger designed for applications with known workloads
‒ WiredTiger design based on this assumption
‒ But MongoDB is used for all kinds of things!
• Application writers make assumptions, too!
‒ MMAPv1 built on top of mmap: different performance characteristics
‒ Most MMAPv1 users migrated without problems
• Engineering is a process of continual improvement
9. #MDBE17
WHAT DID WE DO ABOUT IT?
• Got better at measuring applications
‒ Full-time data capture (FTDC)
‒ Identifying bottlenecks
• WiredTiger with lots of collections:
‒ Handle caches didn’t scale
‒ Page cache eviction inefficient with lots of trees
‒ Especially when access patterns are skewed
11. #MDBE17
FIRST, FIND A TUNABLE WORKLOAD
• Runs standalone on a modern server
• 64 client threads doing 10K updates / second (total)
• Keep the data working set constant
• But with a small cache size so eviction is exercised
• Vary the number of collections
‒ And nothing else!
‒ Workload spread across an increasing number of collections
• Stop when average latency > 50ms per operation.
13. #MDBE17
SOLUTION: IMPROVED HANDLE CACHES
• Hash table lookups instead of scanning a list
‒ Assumption: short lists, handle lookup uncommon
‒ Reality: every time a cursor is opened
• Singly-linked lists equal slow deletes
‒ Assumption: deletes uncommon, singly-linked lists smaller, simpler
‒ Reality: deletes common, removing from a singly-linked list requires a scan
14. #MDBE17
SOLUTION: IMPROVED HANDLE CACHES
• Global lock means terrible concurrency
‒ Assumption: short lists, use an exclusive lock
‒ Reality: many operations read-only, shared read-write locks better
15. #MDBE17
SOLUTION: SMARTER EVICTION
• WiredTiger evicts some pages from every tree
‒ Assumption: uniformity of data across collections
• Finding the data is a significant problem
‒ Retrieval data structures are all you have
• Skewed data access
‒ Lots of trees are idle in common applications
‒ Multi-tenant or time-series data are prime examples
‒ Often 1-5 trees dominates a cache of 10K trees
16. #MDBE17
RESULTS – HANDLE CACHE AND EVICTION
0
10
20
30
40
50
1000 10000 100000
Averagelatency(ms)
Number of collections
3.2.0 3.4.0 eviction tweak
17. #MDBE17
SOLUTION: IMPROVE CHECKPOINTS
• Assumptions:
‒ Checkpoints are rare events, high-end applications configure journaling
‒ Exclusive lock while finding handles that need to be checkpointed
‒ Drops are rare events, and scheduled by the application
• Reality:
‒ Checkpoints continuous, every 60 seconds for historic reasons
‒ With 100K trees, exclusive lock held for far too long
‒ Drops happen randomly
18. #MDBE17
SOLUTION: IMPROVE CHECKPOINTS
• Skewed access patterns
‒ Reviewing handles with no data is wasted effort
‒ Won’t hold locks as long if we skip clean handles
• Split checkpoints into two phases
‒ Phase 1: most I/O happens, multithreaded
‒ Phase 2: trees made consistent, metadata updated/flushed, single-threaded
• “Delayed drops” feature to allow drops during checkpoints
19. #MDBE17
SOLUTION: AGGRESSIVE SWEEPING
• Assumption: lazily sweep the handle list
• Reality: 1M handles takes too long to walk
‒ Aggressively discard cached handles we don’t need
21. #MDBE17
SOLUTION: GROUP COLLECTIONS
• Assumption: map each MongoDB collection/index to a table
• Reality:
‒ Makes all handle caches big
‒ Relies on fast caches and a fast filesystem
‒ 1M files in a directory problematic for some filesystems
• Add a “—groupCollections” option to MongoDB
‒ 2 tables per database (collections, indexes)
‒ Adds a prefix to keys
‒ Transparent to applications, although requires configuration
26. #MDBE17
A MILLION COLLECTIONS: SUMMARY
• Got better at measuring performance
• Examined and changed our assumptions
• Tuned data structures and algorithms
• New data representation: grouped collections
27. “It’s not what you don’t know that gets
you into trouble -- it’s what you know
that just isn’t true.”
Said nobody ever
28. #MDBE17
A MILLION COLLECTIONS DELIVERABLES
• All tuning work included in the MongoDB 3.6 release.
• Grouped collections feature pushed out of the 3.6 release
‒ Improvements sufficient without requiring application API change?
‒ Increased focus on new transactional features
• More tuning is happening for the next MongoDB release
‒ Integrating the MongoDB and WiredTiger caching
31. #MDBE17
TO ACCOMMODATE NEW APPLICATIONS
• MongoDB designed for a no-SQL, schema-less world
‒ Transactional semantics less of an application requirement
• MongoDB application domain growing
‒ Supporting more traditional applications
‒ Often, applications surrounding the existing MongoDB space
• Also, simplifying existing applications
32. #MDBE17
TRANSACTIONS: ACID
• Atomicity
‒ All or nothing.
• Consistency
‒ Database constraints aren’t violated (“constraints” is individually defined)
• Isolation
‒ Transaction integrity and visibility
• Durability
‒ Permanence in the face of bad stuff happening
34. #MDBE17
MONGODB’S PRESENT
• ACID, of course
• Single-document transactions
‒ Atomically update multiple fields of a document (and indices)
‒ Transaction cannot span multiple documents or collections
‒ Applications implement some version of two-phase commit
• Single server consistency
‒ Eventual consistency on the secondaries
35. #MDBE17
MONGODB’S FUTURE:
MULTI-DOCUMENT TRANSACTIONS
• Application developers want them:
‒ Some workloads require them
‒ Developers struggle with error handling
‒ Increase application performance, decrease application complexity
• MongoDB developers want them:
‒ Chunk migration to balance content on shards
‒ Changing shard keys
36. #MDBE17
NECESSARY RISK:
INCREASING SHARD ENTANGLEMENT
• Increasing inter-shard entanglement
‒ The wrong answer is easy, the right answer takes more communication
• Chunk balance should not affect correctness
• Shards can’t simply abort transactions to get unstuck
• Additional migration complexity
• Shard entanglement impacts availability
37. #MDBE17
OTHER RISKS AND KNOCK-ON EFFECTS
• Developers use transactions rather than appropriate schemas
‒ Long-running transactions are seductive
• Inevitably, the rate of concurrency collisions increases
• Significant technical complexity
‒ Multi-year project
‒ Every part of the server team: replication, sharding, query, storage
‒ Significantly increases pressure on the storage engines
38. #MDBE17
FEATURES ALONG THE WAY
• Automatically avoid dirty secondary reads (3.6!)
• Retryable writes (3.6!)
‒ Applications don’t have to manage write collisions
• Global point-in-time reads
‒ Single system-wide clock ordering operations
• Multi-document transactions
42. #MDBE17
TRANSACTION INFORMATION
• 8B transaction ID
• Isolation level and snapshot information
‒ Read-uncommitted: everything
‒ Read-committed: committed updates after start
‒ Snapshot: committed updates before start
• Linked list of change records, called “updates”
‒ For logging on commit
‒ For discard on rollback
43. #MDBE17
UPDATE INFORMATION
• Updates include
‒ Transaction ID which embodies “state” (committed or not)
‒ Data package
Transaction ID
+
Data
Key
44. #MDBE17
MULTI-VERSION CONCURRENCY CONTROL
• Key references
‒ Chain of updates in most recently modified order
‒ Original value, the update visible to everybody
Transaction ID
+
Data
Key
Transaction ID
+
Data
Globally
Visible
Data
45. #MDBE17
WIREDTIGER NAMED SNAPSHOTS FEATURE
• Snapshot: a point-in-time
• Snapshots can be named
‒ Transactions can be started “as of” that snapshot
‒ Readers use this to access data as of a point in time.
• But... snapshots keep data pinned in cache
‒ Newer data cannot be discarded
46. #MDBE17
MONGODB ON TOP OF WIREDTIGER MODEL
• MongoDB maps document changes into this model
‒ For example, a single document change involves indexes
‒ Glue layer below the pluggable storage engine API
• Read concern majority
‒ In other words, it won’t disappear
‒ Requires –enableMajorityReadConcern configuration
‒ Built on WiredTiger’s named snapshots
47. #MDBE17
INTRODUCING SYSTEM TIMESTAMPS
• Applications have their own notion of transactions and time
‒ Defines an expected commit order
‒ Defines durability for a set of systems
• WiredTiger takes a fixed-length byte-string transaction ID
‒ Simply increasing (but not necessarily monotonic)
‒ A “most significant bit first” hexadecimal string
‒ 8B but expected to grow to encompass system-wide ordering
‒ Mix-and-match with native WiredTiger transactions
48. #MDBE17
MONGODB USES AN “AS OF” TIMESTAMP
• Updates now include a timestamp transaction ID
‒ Timestamp tracked in WiredTiger’s update
‒ Smaller is better, as a significant overhead for small updates
• Commit “as of” a timestamp
‒ Set during the update or later, at transaction commit
• Read “as of” a timestamp
‒ Set at transaction begin
‒ Point-in-time reads: largest timestamp less than or equal to value
49. #MDBE17
MONGODB SETS THE “OLDEST” TIMESTAMP
• Limit future reads
• The point at which WiredTiger can discard history
• Cannot go backward, must be updated frequently
50. #MDBE17
MONGODB SETS THE “STABLE” TIMESTAMP
• Limits future durability rollbacks
‒ Imagine an election where the primary hasn’t seen a committed update
• WiredTiger writes checkpoints at the stable timestamp
‒ The storage engine can’t write what might be rolled back
• Cannot go backward, must be updated frequently
51. #MDBE17
READ CONCERN MAJORITY FEATURE
• In 3.4 implemented with WiredTiger named snapshots
‒ Every write a named snapshot
‒ Heavy-weight, interacts directly with WiredTiger transaction subsystem
• In 3.6 implemented with read “as of”
‒ Light-weight and fast
‒ Configuration is now a no-op, “always on”
52. #MDBE17
OPLOG IMPROVEMENTS
• MongoDB does replication by copying its “journal”
‒ Oplog is bulk-loaded on secondaries
‒ Oplog is loaded out-of-order for performance
• Scanning cursor has strict visibility order requirements
‒ No skipping records
‒ No updates visible after the oldest uncommitted update
53. #MDBE17
OPLOG IMPROVEMENTS
• In 3.4, implemented using WiredTiger named snapshots
• JIRA ticket:
“Under heavy insert load on a 2-node replica set, WiredTiger eviction
appears to hang on the secondary.”
• In 3.6, implemented using timestamps
54. #MDBE17
A NEW TRANSACTIONAL MODEL SUMMARY
• Significant storage engine changes
• Enhancing transactional consistency for new applications
• Features and improvements in MongoDB 3.6
‒ Retryable writes
‒ Safe secondary reads
‒ Significantly improved performance
Member of the storage group: storage is part of the server development group.
Server is the core MongoDB database product.
Storage underlies the technology and features you’ll hear about today.
define durability and consistency (isolation visibility), as a consequence, storage owns concurrency.
Pluggable architecture: per-workload storage engines
Default: acceptable behavior for all workloads.
Engineering process discussion
WT began as a separate product, was integrated in 2014 as part of MongoDB 3.0
MMAPv1: lots of collections, fast in place updates
Parallel effort at MongoDB to measure performance
FTDC data is heavily compressed where measurement doesn’t change.
Lots of small collections makes it hard to spot pages to discard, especially when few are hot.
Assumed uniformity across large objects, found skewed access across tiny objects
Single-node overview
Layered diagram shows caching at each layer
MongoDB session / cursor cache (next area of work)
WT cursor cache
WT data handle cache
WT file handle cache
10,000 connections * 10,000 tables, add indexes, and that’s a multiplier
1M files is problematic for some filesystems.
Design a workload for tuning: there are too many moving parts in MongoDB.
Risks losing the problem.
In an ideal world, increasing the number of collections would make no difference.
We didn’t know we were making the problem worse.
3.4 degrades much more quickly than 3.2: logarithmic scale!
Data structures assumed we’d never have lots of collections or frequently change them
Modern eviction algorithms don’t have any kind of real queue, it’s too slow.
Pages reside elsewhere, and there’s information that let’s you know the “age”
Assumes uniformity of the data across collections.
Multi-tenant workloads are skewed.
Once idle trees empty out, even looking at them is a waste of time.
Still nowhere near 1M, but at least back to where we were in 3.2
Obvious data structures and tuning changes.
Checkpoints hold exclusive locks and slow everything down.
Note the x axis scale change, we can now see the 1M target
Someday a middle ground, we’ll need to create subdirectories for data
Security becomes more interesting: data is co-resident.
Architecture with Grouped Collections
Revisits main architecture diagram with changes for grouped collections
Assuming a single database here, and now the cursor cache size is only based on the number of connections, plus we have ways to limit how big it gets in practice.
Collections gets us to 1M with < 10ms average latency.
We get to about 250K with < 10ms average latency without changing the API, 800K < 30ms.
Tipping point (great than 50ms)
This graph shows how many collections we can support: more is better.
It’s what you “know” that just isn’t true...
It’s all about changing our assumptions to handle more workloads.
Because the tuning efforts were successful (800,000 collections), reaching 1M less important.
Additionally, there are significant tuning, space and application-API issues with respect to grouped collections: for example, compaction, collection drop, security and so on. Solving without a new feature API is better.
If you change your mind later, and want to split the two files up?
Define the terms and get everybody on the same page
Everybody offers a version of ACID, including MongoDB
Differences generally around relaxing consistency guarantees
MongoDB’s traditional applications have CAP tradeoffs
MongoDB’s original design chose partition tolerance and availability over consistency (JD ???)
Extending to support more consistency rules.
MongoDB supports ACID, but it only applies to individual write operations.
“write operations” is a high-level concept, indexes are kept consistent.
In 3.4: linearizable reads: write the primary, force read from a secondary to block until it sees the write.
Application developers want to shift complexity into the database.
Application developer skill set not suited to building database applications.
Golden Rule: may not impact the performance of applications not using transactions.
safe secondary reads: automatically avoiding dirty reads (?)
global point-in-time reads: applications read as of a single point in the causal chain
retryable writes: retry automatically so applications don’t have to manage write collisions
multi-document transactions: modify multiple documents/collections atomically
storage engine semantics: a relatively standard single-node model.
two types of durability: checkpoint & journalling
standard write-ahead-logging
log records are redo only; entire change record must fit into memory
cursors iterate, remove, standard CRUD operations
key-value store: MongoDB maps to documents & indexes
Updates and inserts
Transaction ID is an identifier into table of information.
Inserts are single entries, with lists of updates
When a cursor encounters a key, compare cursor and key/update transaction IDs
--enableMajorityReadConcern: visible data must have been written to a majority of the replica set
allow the distributed layer to define/order transactions
8B is fast, and lockless on 64b machines; will grow to incorporate cluster-wide clock information
MSB (memcmp)
MongoDB and WiredTiger transactions co-exist:
Allows applications to mix-and-match where threads don’t care about timestamps
you can get into trouble: operations on an item must be in timestamp order
commit timestamps must be ahead of any read timestamp
setting a read timestamp forces snapshot isolation at that timestamp
Oldest possible reader, including replicated state
To avoid caching infinite updates, must move forward the read “as of” timestamp
Moving complexity into the storage engine, particularly around caching.
the distributed engine can roll-back locally “durable” events
write-concern majority: one node might have seen committed event, but if master never saw it, rolled back.
generally expected a well-behaved replica set won’t fall behind
local crash safe because checkpoint happens at stable event
Benefits in 3.6:
complete a read query using the same specified snapshot for its entirety, on a replica set.
every write is a WT “snapshot”
when secondary receives a majority read request, finds a snapshot that’s majority confirmed to use
required all requests over a single socket
oplog is the source of truth
Snapshots pin memory: two nodes running on a 24-cpu box, 32GB RAM, pushing 16 threads with vectored writes of 100 tiny documents at a time.
oplog was created for capped collections
took on a replication role since it looks a lot like a shared log