Scaling and Transaction Futures

#MDBE17
O2 Intercontinental
SCALING AND TRANSACTION
FUTURES

#MDBE17
Senior Staff Engineer, MongoDB Inc.
KEITH BOSTIC
keith.bostic@mongodb.com

#MDBE17
THE WIREDTIGER STORAGE ENGINE
Storage engine:
The “storage engine” is the MongoDB database server code
responsible for single-node durability as well as being the primary
enforcer of data consistency guarantees.
• MongoDB has a pluggable storage engine architecture
‒ Three well-known engines: MMAPv1, WiredTiger and RocksDB
• WiredTiger is the default storage engine

#MDBE17
SCALING MONGODB
A NEW TRANSACTIONAL MODEL

#MDBE17
SCALING MONGODB
... TO A MILLION COLLECTIONS

#MDBE17
PROBLEM 1: LOTS OF COLLECTIONS
• MongoDB applications create collections to hold their documents
• Each collection has some set of indexes
‒ Documents indexed in multiple ways
‒ Hundreds of collections
• But some applications create A LOT of collections:
‒ Avoiding concurrency bottlenecks in the MMAPv1 storage engine
‒ Multi-tenant applications
‒ Creative schema design: time-series data

“640K [of memory]
ought to be enough for anybody.”
Said nobody ever

#MDBE17
PROBLEM 2: INVALID ASSUMPTIONS
• WiredTiger designed for applications with known workloads
‒ WiredTiger design based on this assumption
‒ But MongoDB is used for all kinds of things!
• Application writers make assumptions, too!
‒ MMAPv1 built on top of mmap: different performance characteristics
‒ Most MMAPv1 users migrated without problems
• Engineering is a process of continual improvement

#MDBE17
WHAT DID WE DO ABOUT IT?
• Got better at measuring applications
‒ Full-time data capture (FTDC)
‒ Identifying bottlenecks
• WiredTiger with lots of collections:
‒ Handle caches didn’t scale
‒ Page cache eviction inefficient with lots of trees
‒ Especially when access patterns are skewed

#MDBE17
mongod
WiredTiger
Storage Engine
WiredTiger
Core
collection
files
index
files
collection
tables
index
tables
client
connections
session cache
cursor
cache
C
T
T
T * C

#MDBE17
FIRST, FIND A TUNABLE WORKLOAD
• Runs standalone on a modern server
• 64 client threads doing 10K updates / second (total)
• Keep the data working set constant
• But with a small cache size so eviction is exercised
• Vary the number of collections
‒ And nothing else!
‒ Workload spread across an increasing number of collections
• Stop when average latency > 50ms per operation.

#MDBE17
RESULTS – BASELINES
0
10
20
30
40
50
1000 10000 100000
Averagelatency(ms)
Number of collections
3.2.0 3.4.0

#MDBE17
SOLUTION: IMPROVED HANDLE CACHES
• Hash table lookups instead of scanning a list
‒ Assumption: short lists, handle lookup uncommon
‒ Reality: every time a cursor is opened
• Singly-linked lists equal slow deletes
‒ Assumption: deletes uncommon, singly-linked lists smaller, simpler
‒ Reality: deletes common, removing from a singly-linked list requires a scan

#MDBE17
SOLUTION: IMPROVED HANDLE CACHES
• Global lock means terrible concurrency
‒ Assumption: short lists, use an exclusive lock
‒ Reality: many operations read-only, shared read-write locks better

#MDBE17
SOLUTION: SMARTER EVICTION
• WiredTiger evicts some pages from every tree
‒ Assumption: uniformity of data across collections
• Finding the data is a significant problem
‒ Retrieval data structures are all you have
• Skewed data access
‒ Lots of trees are idle in common applications
‒ Multi-tenant or time-series data are prime examples
‒ Often 1-5 trees dominates a cache of 10K trees

#MDBE17
RESULTS – HANDLE CACHE AND EVICTION
0
10
20
30
40
50
1000 10000 100000
Averagelatency(ms)
3.2.0 3.4.0 eviction tweak

#MDBE17
SOLUTION: IMPROVE CHECKPOINTS
• Assumptions:
‒ Checkpoints are rare events, high-end applications configure journaling
‒ Exclusive lock while finding handles that need to be checkpointed
‒ Drops are rare events, and scheduled by the application
• Reality:
‒ Checkpoints continuous, every 60 seconds for historic reasons
‒ With 100K trees, exclusive lock held for far too long
‒ Drops happen randomly

#MDBE17
SOLUTION: IMPROVE CHECKPOINTS
• Skewed access patterns
‒ Reviewing handles with no data is wasted effort
‒ Won’t hold locks as long if we skip clean handles
• Split checkpoints into two phases
‒ Phase 1: most I/O happens, multithreaded
‒ Phase 2: trees made consistent, metadata updated/flushed, single-threaded
• “Delayed drops” feature to allow drops during checkpoints

#MDBE17
SOLUTION: AGGRESSIVE SWEEPING
• Assumption: lazily sweep the handle list
• Reality: 1M handles takes too long to walk
‒ Aggressively discard cached handles we don’t need

#MDBE17
RESULTS – IMPROVE CHECKPOINTS
0
10
20
30
40
50
1000 10000 100000 1000000
Averagelatency(ms)
3.2.0 3.4.0 eviction tweak eviction + sweep

#MDBE17
SOLUTION: GROUP COLLECTIONS
• Assumption: map each MongoDB collection/index to a table
• Reality:
‒ Makes all handle caches big
‒ Relies on fast caches and a fast filesystem
‒ 1M files in a directory problematic for some filesystems
• Add a “—groupCollections” option to MongoDB
‒ 2 tables per database (collections, indexes)
‒ Adds a prefix to keys
‒ Transparent to applications, although requires configuration

#MDBE17
mongod
WiredTiger
Storage Engine
WiredTiger
Core
collection
files
index
files
collection
tables
index
tables
client
connections
session cache
cursor
cache
C
2
2
2 * C

#MDBE17
RESULTS – GROUP COLLECTIONS
0
10
20
30
40
50
1000 10000 100000 1000000
Averagelatency(ms)
3.2.0 3.4.0 eviction tweak eviction + sweep grouped

#MDBE17
RESULTS – SUMMARY
50,000 10,000
800,000
1,000,000 +
Maximum Collections
MongoDB 3.2.0 3.4.0 3.4 tuned Grouped Collections

#MDBE17
MILLION COLLECTIONS PROGRESS
2014
MongoDB 3.0:
WiredTiger
integration
2015
MongoDB 3.2
Handle cache
Checkpoints
2016
MongoDB 3.4
Concurrency
Smarter Eviction
2017+
Grouped
collections

#MDBE17
A MILLION COLLECTIONS: SUMMARY
• Got better at measuring performance
• Examined and changed our assumptions
• Tuned data structures and algorithms
• New data representation: grouped collections

“It’s not what you don’t know that gets
you into trouble -- it’s what you know
that just isn’t true.”
Said nobody ever

#MDBE17
A MILLION COLLECTIONS DELIVERABLES
• All tuning work included in the MongoDB 3.6 release.
• Grouped collections feature pushed out of the 3.6 release
‒ Improvements sufficient without requiring application API change?
‒ Increased focus on new transactional features
• More tuning is happening for the next MongoDB release
‒ Integrating the MongoDB and WiredTiger caching

#MDBE17
(We’re not done, the second part starts in 5 minutes!)
QUESTIONS?

#MDBE17
THE MONGODB JOURNEY TO
A NEW TRANSACTIONAL MODEL

#MDBE17
TO ACCOMMODATE NEW APPLICATIONS
• MongoDB designed for a no-SQL, schema-less world
‒ Transactional semantics less of an application requirement
• MongoDB application domain growing
‒ Supporting more traditional applications
‒ Often, applications surrounding the existing MongoDB space
• Also, simplifying existing applications

#MDBE17
TRANSACTIONS: ACID
• Atomicity
‒ All or nothing.
• Consistency
‒ Database constraints aren’t violated (“constraints” is individually defined)
• Isolation
‒ Transaction integrity and visibility
• Durability
‒ Permanence in the face of bad stuff happening

#MDBE17
CAP THEOREM
Availability
Partition
Tolerance
Consistency

#MDBE17
MONGODB’S PRESENT
• ACID, of course
• Single-document transactions
‒ Atomically update multiple fields of a document (and indices)
‒ Transaction cannot span multiple documents or collections
‒ Applications implement some version of two-phase commit
• Single server consistency
‒ Eventual consistency on the secondaries

#MDBE17
MONGODB’S FUTURE:
MULTI-DOCUMENT TRANSACTIONS
• Application developers want them:
‒ Some workloads require them
‒ Developers struggle with error handling
‒ Increase application performance, decrease application complexity
• MongoDB developers want them:
‒ Chunk migration to balance content on shards
‒ Changing shard keys

#MDBE17
NECESSARY RISK:
INCREASING SHARD ENTANGLEMENT
• Increasing inter-shard entanglement
‒ The wrong answer is easy, the right answer takes more communication
• Chunk balance should not affect correctness
• Shards can’t simply abort transactions to get unstuck
• Additional migration complexity
• Shard entanglement impacts availability

#MDBE17
OTHER RISKS AND KNOCK-ON EFFECTS
• Developers use transactions rather than appropriate schemas
‒ Long-running transactions are seductive
• Inevitably, the rate of concurrency collisions increases
• Significant technical complexity
‒ Multi-year project
‒ Every part of the server team: replication, sharding, query, storage
‒ Significantly increases pressure on the storage engines

#MDBE17
FEATURES ALONG THE WAY
• Automatically avoid dirty secondary reads (3.6!)
• Retryable writes (3.6!)
‒ Applications don’t have to manage write collisions
• Global point-in-time reads
‒ Single system-wide clock ordering operations
• Multi-document transactions

#MDBE17
WIREDTIGER TRANSACTIONS

#MDBE17
WIREDTIGER: SINGLE-NODE TRANSACTION
• Per-thread “session” structure embodies a transaction
• Session structure references data-sources: cursors
• Transactions are implicit or explicit
‒ session.begin_transaction()
‒ session.commit_transaction()
‒ session.rollback_transaction()
• Transactions can already span objects and data-sources!

#MDBE17
WIREDTIGER SINGLE-NODE TRANSACTION
cursor = session.open_cursor()
session.begin_transaction()
cursor.set_key(“fruit”); cursor.set_value(“apple”); cursor.insert()
cursor.set_key(“fruit”); cursor.set_value(“orange”); cursor.update()
session.commit_transaction()
cursor.close()

#MDBE17
TRANSACTION INFORMATION
• 8B transaction ID
• Isolation level and snapshot information
‒ Read-uncommitted: everything
‒ Read-committed: committed updates after start
‒ Snapshot: committed updates before start
• Linked list of change records, called “updates”
‒ For logging on commit
‒ For discard on rollback

#MDBE17
UPDATE INFORMATION
• Updates include
‒ Transaction ID which embodies “state” (committed or not)
‒ Data package
Transaction ID
+
Data
Key

#MDBE17
MULTI-VERSION CONCURRENCY CONTROL
• Key references
‒ Chain of updates in most recently modified order
‒ Original value, the update visible to everybody
Transaction ID
+
Data
Key
Transaction ID
+
Data
Globally
Visible
Data

#MDBE17
WIREDTIGER NAMED SNAPSHOTS FEATURE
• Snapshot: a point-in-time
• Snapshots can be named
‒ Transactions can be started “as of” that snapshot
‒ Readers use this to access data as of a point in time.
• But... snapshots keep data pinned in cache
‒ Newer data cannot be discarded

#MDBE17
MONGODB ON TOP OF WIREDTIGER MODEL
• MongoDB maps document changes into this model
‒ For example, a single document change involves indexes
‒ Glue layer below the pluggable storage engine API
• Read concern majority
‒ In other words, it won’t disappear
‒ Requires –enableMajorityReadConcern configuration
‒ Built on WiredTiger’s named snapshots

#MDBE17
INTRODUCING SYSTEM TIMESTAMPS
• Applications have their own notion of transactions and time
‒ Defines an expected commit order
‒ Defines durability for a set of systems
• WiredTiger takes a fixed-length byte-string transaction ID
‒ Simply increasing (but not necessarily monotonic)
‒ A “most significant bit first” hexadecimal string
‒ 8B but expected to grow to encompass system-wide ordering
‒ Mix-and-match with native WiredTiger transactions

#MDBE17
MONGODB USES AN “AS OF” TIMESTAMP
• Updates now include a timestamp transaction ID
‒ Timestamp tracked in WiredTiger’s update
‒ Smaller is better, as a significant overhead for small updates
• Commit “as of” a timestamp
‒ Set during the update or later, at transaction commit
• Read “as of” a timestamp
‒ Set at transaction begin
‒ Point-in-time reads: largest timestamp less than or equal to value

#MDBE17
MONGODB SETS THE “OLDEST” TIMESTAMP
• Limit future reads
• The point at which WiredTiger can discard history
• Cannot go backward, must be updated frequently

#MDBE17
MONGODB SETS THE “STABLE” TIMESTAMP
• Limits future durability rollbacks
‒ Imagine an election where the primary hasn’t seen a committed update
• WiredTiger writes checkpoints at the stable timestamp
‒ The storage engine can’t write what might be rolled back
• Cannot go backward, must be updated frequently

#MDBE17
READ CONCERN MAJORITY FEATURE
• In 3.4 implemented with WiredTiger named snapshots
‒ Every write a named snapshot
‒ Heavy-weight, interacts directly with WiredTiger transaction subsystem
• In 3.6 implemented with read “as of”
‒ Light-weight and fast
‒ Configuration is now a no-op, “always on”

#MDBE17
OPLOG IMPROVEMENTS
• MongoDB does replication by copying its “journal”
‒ Oplog is bulk-loaded on secondaries
‒ Oplog is loaded out-of-order for performance
• Scanning cursor has strict visibility order requirements
‒ No skipping records
‒ No updates visible after the oldest uncommitted update

#MDBE17
OPLOG IMPROVEMENTS
• In 3.4, implemented using WiredTiger named snapshots
• JIRA ticket:
“Under heavy insert load on a 2-node replica set, WiredTiger eviction
appears to hang on the secondary.”
• In 3.6, implemented using timestamps

#MDBE17
A NEW TRANSACTIONAL MODEL SUMMARY
• Significant storage engine changes
• Enhancing transactional consistency for new applications
• Features and improvements in MongoDB 3.6
‒ Retryable writes
‒ Safe secondary reads
‒ Significantly improved performance

#MDBE17
keith.bostic@mongodb.com
QUESTIONS?

Scaling and Transaction Futures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Scaling and Transaction Futures

Similar to Scaling and Transaction Futures (20)

More from MongoDB

More from MongoDB (20)

Scaling and Transaction Futures

Editor's Notes