SlideShare a Scribd company logo
1 of 55
#MDBE17
O2 Intercontinental
SCALING AND TRANSACTION
FUTURES
#MDBE17
Senior Staff Engineer, MongoDB Inc.
KEITH BOSTIC
keith.bostic@mongodb.com
#MDBE17
THE WIREDTIGER STORAGE ENGINE
Storage engine:
The “storage engine” is the MongoDB database server code
responsible for single-node durability as well as being the primary
enforcer of data consistency guarantees.
• MongoDB has a pluggable storage engine architecture
‒ Three well-known engines: MMAPv1, WiredTiger and RocksDB
• WiredTiger is the default storage engine
#MDBE17
SCALING MONGODB
A NEW TRANSACTIONAL MODEL
#MDBE17
SCALING MONGODB
... TO A MILLION COLLECTIONS
#MDBE17
PROBLEM 1: LOTS OF COLLECTIONS
• MongoDB applications create collections to hold their documents
• Each collection has some set of indexes
‒ Documents indexed in multiple ways
‒ Hundreds of collections
• But some applications create A LOT of collections:
‒ Avoiding concurrency bottlenecks in the MMAPv1 storage engine
‒ Multi-tenant applications
‒ Creative schema design: time-series data
“640K [of memory]
ought to be enough for anybody.”
Said nobody ever
#MDBE17
PROBLEM 2: INVALID ASSUMPTIONS
• WiredTiger designed for applications with known workloads
‒ WiredTiger design based on this assumption
‒ But MongoDB is used for all kinds of things!
• Application writers make assumptions, too!
‒ MMAPv1 built on top of mmap: different performance characteristics
‒ Most MMAPv1 users migrated without problems
• Engineering is a process of continual improvement
#MDBE17
WHAT DID WE DO ABOUT IT?
• Got better at measuring applications
‒ Full-time data capture (FTDC)
‒ Identifying bottlenecks
• WiredTiger with lots of collections:
‒ Handle caches didn’t scale
‒ Page cache eviction inefficient with lots of trees
‒ Especially when access patterns are skewed
#MDBE17
mongod
WiredTiger
Storage Engine
WiredTiger
Core
collection
files
index
files
collection
tables
index
tables
client
connections
session cache
cursor
cache
C
T
T
T * C
#MDBE17
FIRST, FIND A TUNABLE WORKLOAD
• Runs standalone on a modern server
• 64 client threads doing 10K updates / second (total)
• Keep the data working set constant
• But with a small cache size so eviction is exercised
• Vary the number of collections
‒ And nothing else!
‒ Workload spread across an increasing number of collections
• Stop when average latency > 50ms per operation.
#MDBE17
RESULTS – BASELINES
0
10
20
30
40
50
1000 10000 100000
Averagelatency(ms)
Number of collections
3.2.0 3.4.0
#MDBE17
SOLUTION: IMPROVED HANDLE CACHES
• Hash table lookups instead of scanning a list
‒ Assumption: short lists, handle lookup uncommon
‒ Reality: every time a cursor is opened
• Singly-linked lists equal slow deletes
‒ Assumption: deletes uncommon, singly-linked lists smaller, simpler
‒ Reality: deletes common, removing from a singly-linked list requires a scan
#MDBE17
SOLUTION: IMPROVED HANDLE CACHES
• Global lock means terrible concurrency
‒ Assumption: short lists, use an exclusive lock
‒ Reality: many operations read-only, shared read-write locks better
#MDBE17
SOLUTION: SMARTER EVICTION
• WiredTiger evicts some pages from every tree
‒ Assumption: uniformity of data across collections
• Finding the data is a significant problem
‒ Retrieval data structures are all you have
• Skewed data access
‒ Lots of trees are idle in common applications
‒ Multi-tenant or time-series data are prime examples
‒ Often 1-5 trees dominates a cache of 10K trees
#MDBE17
RESULTS – HANDLE CACHE AND EVICTION
0
10
20
30
40
50
1000 10000 100000
Averagelatency(ms)
Number of collections
3.2.0 3.4.0 eviction tweak
#MDBE17
SOLUTION: IMPROVE CHECKPOINTS
• Assumptions:
‒ Checkpoints are rare events, high-end applications configure journaling
‒ Exclusive lock while finding handles that need to be checkpointed
‒ Drops are rare events, and scheduled by the application
• Reality:
‒ Checkpoints continuous, every 60 seconds for historic reasons
‒ With 100K trees, exclusive lock held for far too long
‒ Drops happen randomly
#MDBE17
SOLUTION: IMPROVE CHECKPOINTS
• Skewed access patterns
‒ Reviewing handles with no data is wasted effort
‒ Won’t hold locks as long if we skip clean handles
• Split checkpoints into two phases
‒ Phase 1: most I/O happens, multithreaded
‒ Phase 2: trees made consistent, metadata updated/flushed, single-threaded
• “Delayed drops” feature to allow drops during checkpoints
#MDBE17
SOLUTION: AGGRESSIVE SWEEPING
• Assumption: lazily sweep the handle list
• Reality: 1M handles takes too long to walk
‒ Aggressively discard cached handles we don’t need
#MDBE17
RESULTS – IMPROVE CHECKPOINTS
0
10
20
30
40
50
1000 10000 100000 1000000
Averagelatency(ms)
Number of collections
3.2.0 3.4.0 eviction tweak eviction + sweep
#MDBE17
SOLUTION: GROUP COLLECTIONS
• Assumption: map each MongoDB collection/index to a table
• Reality:
‒ Makes all handle caches big
‒ Relies on fast caches and a fast filesystem
‒ 1M files in a directory problematic for some filesystems
• Add a “—groupCollections” option to MongoDB
‒ 2 tables per database (collections, indexes)
‒ Adds a prefix to keys
‒ Transparent to applications, although requires configuration
#MDBE17
mongod
WiredTiger
Storage Engine
WiredTiger
Core
collection
files
index
files
collection
tables
index
tables
client
connections
session cache
cursor
cache
C
2
2
2 * C
#MDBE17
RESULTS – GROUP COLLECTIONS
0
10
20
30
40
50
1000 10000 100000 1000000
Averagelatency(ms)
Number of collections
3.2.0 3.4.0 eviction tweak eviction + sweep grouped
#MDBE17
RESULTS – SUMMARY
50,000 10,000
800,000
1,000,000 +
Maximum Collections
MongoDB 3.2.0 3.4.0 3.4 tuned Grouped Collections
#MDBE17
MILLION COLLECTIONS PROGRESS
2014
MongoDB 3.0:
WiredTiger
integration
2015
MongoDB 3.2
Handle cache
Checkpoints
2016
MongoDB 3.4
Concurrency
Smarter Eviction
2017+
Grouped
collections
#MDBE17
A MILLION COLLECTIONS: SUMMARY
• Got better at measuring performance
• Examined and changed our assumptions
• Tuned data structures and algorithms
• New data representation: grouped collections
“It’s not what you don’t know that gets
you into trouble -- it’s what you know
that just isn’t true.”
Said nobody ever
#MDBE17
A MILLION COLLECTIONS DELIVERABLES
• All tuning work included in the MongoDB 3.6 release.
• Grouped collections feature pushed out of the 3.6 release
‒ Improvements sufficient without requiring application API change?
‒ Increased focus on new transactional features
• More tuning is happening for the next MongoDB release
‒ Integrating the MongoDB and WiredTiger caching
#MDBE17
(We’re not done, the second part starts in 5 minutes!)
QUESTIONS?
#MDBE17
THE MONGODB JOURNEY TO
A NEW TRANSACTIONAL MODEL
#MDBE17
TO ACCOMMODATE NEW APPLICATIONS
• MongoDB designed for a no-SQL, schema-less world
‒ Transactional semantics less of an application requirement
• MongoDB application domain growing
‒ Supporting more traditional applications
‒ Often, applications surrounding the existing MongoDB space
• Also, simplifying existing applications
#MDBE17
TRANSACTIONS: ACID
• Atomicity
‒ All or nothing.
• Consistency
‒ Database constraints aren’t violated (“constraints” is individually defined)
• Isolation
‒ Transaction integrity and visibility
• Durability
‒ Permanence in the face of bad stuff happening
#MDBE17
CAP THEOREM
Availability
Partition
Tolerance
Consistency
#MDBE17
MONGODB’S PRESENT
• ACID, of course
• Single-document transactions
‒ Atomically update multiple fields of a document (and indices)
‒ Transaction cannot span multiple documents or collections
‒ Applications implement some version of two-phase commit
• Single server consistency
‒ Eventual consistency on the secondaries
#MDBE17
MONGODB’S FUTURE:
MULTI-DOCUMENT TRANSACTIONS
• Application developers want them:
‒ Some workloads require them
‒ Developers struggle with error handling
‒ Increase application performance, decrease application complexity
• MongoDB developers want them:
‒ Chunk migration to balance content on shards
‒ Changing shard keys
#MDBE17
NECESSARY RISK:
INCREASING SHARD ENTANGLEMENT
• Increasing inter-shard entanglement
‒ The wrong answer is easy, the right answer takes more communication
• Chunk balance should not affect correctness
• Shards can’t simply abort transactions to get unstuck
• Additional migration complexity
• Shard entanglement impacts availability
#MDBE17
OTHER RISKS AND KNOCK-ON EFFECTS
• Developers use transactions rather than appropriate schemas
‒ Long-running transactions are seductive
• Inevitably, the rate of concurrency collisions increases
• Significant technical complexity
‒ Multi-year project
‒ Every part of the server team: replication, sharding, query, storage
‒ Significantly increases pressure on the storage engines
#MDBE17
FEATURES ALONG THE WAY
• Automatically avoid dirty secondary reads (3.6!)
• Retryable writes (3.6!)
‒ Applications don’t have to manage write collisions
• Global point-in-time reads
‒ Single system-wide clock ordering operations
• Multi-document transactions
#MDBE17
WIREDTIGER TRANSACTIONS
#MDBE17
WIREDTIGER: SINGLE-NODE TRANSACTION
• Per-thread “session” structure embodies a transaction
• Session structure references data-sources: cursors
• Transactions are implicit or explicit
‒ session.begin_transaction()
‒ session.commit_transaction()
‒ session.rollback_transaction()
• Transactions can already span objects and data-sources!
#MDBE17
WIREDTIGER SINGLE-NODE TRANSACTION
cursor = session.open_cursor()
session.begin_transaction()
cursor.set_key(“fruit”); cursor.set_value(“apple”); cursor.insert()
cursor.set_key(“fruit”); cursor.set_value(“orange”); cursor.update()
session.commit_transaction()
cursor.close()
#MDBE17
TRANSACTION INFORMATION
• 8B transaction ID
• Isolation level and snapshot information
‒ Read-uncommitted: everything
‒ Read-committed: committed updates after start
‒ Snapshot: committed updates before start
• Linked list of change records, called “updates”
‒ For logging on commit
‒ For discard on rollback
#MDBE17
UPDATE INFORMATION
• Updates include
‒ Transaction ID which embodies “state” (committed or not)
‒ Data package
Transaction ID
+
Data
Key
#MDBE17
MULTI-VERSION CONCURRENCY CONTROL
• Key references
‒ Chain of updates in most recently modified order
‒ Original value, the update visible to everybody
Transaction ID
+
Data
Key
Transaction ID
+
Data
Globally
Visible
Data
#MDBE17
WIREDTIGER NAMED SNAPSHOTS FEATURE
• Snapshot: a point-in-time
• Snapshots can be named
‒ Transactions can be started “as of” that snapshot
‒ Readers use this to access data as of a point in time.
• But... snapshots keep data pinned in cache
‒ Newer data cannot be discarded
#MDBE17
MONGODB ON TOP OF WIREDTIGER MODEL
• MongoDB maps document changes into this model
‒ For example, a single document change involves indexes
‒ Glue layer below the pluggable storage engine API
• Read concern majority
‒ In other words, it won’t disappear
‒ Requires –enableMajorityReadConcern configuration
‒ Built on WiredTiger’s named snapshots
#MDBE17
INTRODUCING SYSTEM TIMESTAMPS
• Applications have their own notion of transactions and time
‒ Defines an expected commit order
‒ Defines durability for a set of systems
• WiredTiger takes a fixed-length byte-string transaction ID
‒ Simply increasing (but not necessarily monotonic)
‒ A “most significant bit first” hexadecimal string
‒ 8B but expected to grow to encompass system-wide ordering
‒ Mix-and-match with native WiredTiger transactions
#MDBE17
MONGODB USES AN “AS OF” TIMESTAMP
• Updates now include a timestamp transaction ID
‒ Timestamp tracked in WiredTiger’s update
‒ Smaller is better, as a significant overhead for small updates
• Commit “as of” a timestamp
‒ Set during the update or later, at transaction commit
• Read “as of” a timestamp
‒ Set at transaction begin
‒ Point-in-time reads: largest timestamp less than or equal to value
#MDBE17
MONGODB SETS THE “OLDEST” TIMESTAMP
• Limit future reads
• The point at which WiredTiger can discard history
• Cannot go backward, must be updated frequently
#MDBE17
MONGODB SETS THE “STABLE” TIMESTAMP
• Limits future durability rollbacks
‒ Imagine an election where the primary hasn’t seen a committed update
• WiredTiger writes checkpoints at the stable timestamp
‒ The storage engine can’t write what might be rolled back
• Cannot go backward, must be updated frequently
#MDBE17
READ CONCERN MAJORITY FEATURE
• In 3.4 implemented with WiredTiger named snapshots
‒ Every write a named snapshot
‒ Heavy-weight, interacts directly with WiredTiger transaction subsystem
• In 3.6 implemented with read “as of”
‒ Light-weight and fast
‒ Configuration is now a no-op, “always on”
#MDBE17
OPLOG IMPROVEMENTS
• MongoDB does replication by copying its “journal”
‒ Oplog is bulk-loaded on secondaries
‒ Oplog is loaded out-of-order for performance
• Scanning cursor has strict visibility order requirements
‒ No skipping records
‒ No updates visible after the oldest uncommitted update
#MDBE17
OPLOG IMPROVEMENTS
• In 3.4, implemented using WiredTiger named snapshots
• JIRA ticket:
“Under heavy insert load on a 2-node replica set, WiredTiger eviction
appears to hang on the secondary.”
• In 3.6, implemented using timestamps
#MDBE17
A NEW TRANSACTIONAL MODEL SUMMARY
• Significant storage engine changes
• Enhancing transactional consistency for new applications
• Features and improvements in MongoDB 3.6
‒ Retryable writes
‒ Safe secondary reads
‒ Significantly improved performance
#MDBE17
keith.bostic@mongodb.com
QUESTIONS?

More Related Content

What's hot

Using Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterUsing Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your Cluster
MongoDB
 
Experian Health: Moving Universal Identity Manager from ANSI SQL to MongoDB
Experian Health: Moving Universal Identity Manager from ANSI SQL to MongoDBExperian Health: Moving Universal Identity Manager from ANSI SQL to MongoDB
Experian Health: Moving Universal Identity Manager from ANSI SQL to MongoDB
MongoDB
 

What's hot (20)

MongoDB Europe 2016 - Ops Manager and Cloud Manager
MongoDB Europe 2016 - Ops Manager and Cloud ManagerMongoDB Europe 2016 - Ops Manager and Cloud Manager
MongoDB Europe 2016 - Ops Manager and Cloud Manager
 
Webinar: Get Started with the MEAN Stack
Webinar: Get Started with the MEAN StackWebinar: Get Started with the MEAN Stack
Webinar: Get Started with the MEAN Stack
 
Power Real Estate Property Analytics with MongoDB + Spark
Power Real Estate Property Analytics with MongoDB + SparkPower Real Estate Property Analytics with MongoDB + Spark
Power Real Estate Property Analytics with MongoDB + Spark
 
Building the Real-Time Performance Panel
Building the Real-Time Performance PanelBuilding the Real-Time Performance Panel
Building the Real-Time Performance Panel
 
MongoDB Europe 2016 - Building WiredTiger
MongoDB Europe 2016 - Building WiredTigerMongoDB Europe 2016 - Building WiredTiger
MongoDB Europe 2016 - Building WiredTiger
 
Transforming a Large Mission-Critical E-Commerce Platform from a Relational A...
Transforming a Large Mission-Critical E-Commerce Platform from a Relational A...Transforming a Large Mission-Critical E-Commerce Platform from a Relational A...
Transforming a Large Mission-Critical E-Commerce Platform from a Relational A...
 
Using Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterUsing Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your Cluster
 
MongoDB Atlas
MongoDB AtlasMongoDB Atlas
MongoDB Atlas
 
Webinar: Gaining Insights into MongoDB with MongoDB Cloud Manager and New Relic
Webinar: Gaining Insights into MongoDB with MongoDB Cloud Manager and New RelicWebinar: Gaining Insights into MongoDB with MongoDB Cloud Manager and New Relic
Webinar: Gaining Insights into MongoDB with MongoDB Cloud Manager and New Relic
 
Experian Health: Moving Universal Identity Manager from ANSI SQL to MongoDB
Experian Health: Moving Universal Identity Manager from ANSI SQL to MongoDBExperian Health: Moving Universal Identity Manager from ANSI SQL to MongoDB
Experian Health: Moving Universal Identity Manager from ANSI SQL to MongoDB
 
Introducing MongoDB Atlas
Introducing MongoDB AtlasIntroducing MongoDB Atlas
Introducing MongoDB Atlas
 
MongoDB Days UK: Building Apps with the MEAN Stack
MongoDB Days UK: Building Apps with the MEAN StackMongoDB Days UK: Building Apps with the MEAN Stack
MongoDB Days UK: Building Apps with the MEAN Stack
 
Socconx12 integrating ibm connections docs 2 and box
Socconx12 integrating ibm connections docs 2 and boxSocconx12 integrating ibm connections docs 2 and box
Socconx12 integrating ibm connections docs 2 and box
 
MongoDB .local Munich 2019: Mastering MongoDB on Kubernetes – MongoDB Enterpr...
MongoDB .local Munich 2019: Mastering MongoDB on Kubernetes – MongoDB Enterpr...MongoDB .local Munich 2019: Mastering MongoDB on Kubernetes – MongoDB Enterpr...
MongoDB .local Munich 2019: Mastering MongoDB on Kubernetes – MongoDB Enterpr...
 
MongoDB Europe 2016 - Powering Microservices with Docker, Kubernetes, and Kafka
MongoDB Europe 2016 - Powering Microservices with Docker, Kubernetes, and KafkaMongoDB Europe 2016 - Powering Microservices with Docker, Kubernetes, and Kafka
MongoDB Europe 2016 - Powering Microservices with Docker, Kubernetes, and Kafka
 
MongoDB Launchpad 2016: Moving Cybersecurity to the Cloud
MongoDB Launchpad 2016: Moving Cybersecurity to the CloudMongoDB Launchpad 2016: Moving Cybersecurity to the Cloud
MongoDB Launchpad 2016: Moving Cybersecurity to the Cloud
 
MongoDB .local Chicago 2019: A MongoDB Journey: Moving from a relational data...
MongoDB .local Chicago 2019: A MongoDB Journey: Moving from a relational data...MongoDB .local Chicago 2019: A MongoDB Journey: Moving from a relational data...
MongoDB .local Chicago 2019: A MongoDB Journey: Moving from a relational data...
 
MongoDB World 2019: Using the MongoDB Enterprise Kubernetes Operator to Scale...
MongoDB World 2019: Using the MongoDB Enterprise Kubernetes Operator to Scale...MongoDB World 2019: Using the MongoDB Enterprise Kubernetes Operator to Scale...
MongoDB World 2019: Using the MongoDB Enterprise Kubernetes Operator to Scale...
 
Responsive & Responsible: Implementing Responsive Design at Scale
Responsive & Responsible: Implementing Responsive Design at ScaleResponsive & Responsible: Implementing Responsive Design at Scale
Responsive & Responsible: Implementing Responsive Design at Scale
 
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
 

Viewers also liked

Creating highly available MongoDB Microservices with Docker Containers and Ku...
Creating highly available MongoDB Microservices with Docker Containers and Ku...Creating highly available MongoDB Microservices with Docker Containers and Ku...
Creating highly available MongoDB Microservices with Docker Containers and Ku...
MongoDB
 
Realizzazione di Microservizi con Docker, Kubernetes, Kafka e Mongodb
Realizzazione di Microservizi con Docker, Kubernetes, Kafka e MongodbRealizzazione di Microservizi con Docker, Kubernetes, Kafka e Mongodb
Realizzazione di Microservizi con Docker, Kubernetes, Kafka e Mongodb
MongoDB
 
MongoDB Basic Concepts
MongoDB Basic ConceptsMongoDB Basic Concepts
MongoDB Basic Concepts
MongoDB
 
Availability and scalability in mongo
Availability and scalability in mongoAvailability and scalability in mongo
Availability and scalability in mongo
Md. Khairul Anam
 
Trading up: Adding Flexibility and Scalability to Bouygues Telecom with MongoDB
Trading up: Adding Flexibility and Scalability to Bouygues Telecom with MongoDBTrading up: Adding Flexibility and Scalability to Bouygues Telecom with MongoDB
Trading up: Adding Flexibility and Scalability to Bouygues Telecom with MongoDB
MongoDB
 

Viewers also liked (20)

Creating highly available MongoDB Microservices with Docker Containers and Ku...
Creating highly available MongoDB Microservices with Docker Containers and Ku...Creating highly available MongoDB Microservices with Docker Containers and Ku...
Creating highly available MongoDB Microservices with Docker Containers and Ku...
 
Realizzazione di Microservizi con Docker, Kubernetes, Kafka e Mongodb
Realizzazione di Microservizi con Docker, Kubernetes, Kafka e MongodbRealizzazione di Microservizi con Docker, Kubernetes, Kafka e Mongodb
Realizzazione di Microservizi con Docker, Kubernetes, Kafka e Mongodb
 
MongoDB Basic Concepts
MongoDB Basic ConceptsMongoDB Basic Concepts
MongoDB Basic Concepts
 
Performance Tuning and Optimization
Performance Tuning and OptimizationPerformance Tuning and Optimization
Performance Tuning and Optimization
 
Availability and scalability in mongo
Availability and scalability in mongoAvailability and scalability in mongo
Availability and scalability in mongo
 
Indexing
IndexingIndexing
Indexing
 
Trading up: Adding Flexibility and Scalability to Bouygues Telecom with MongoDB
Trading up: Adding Flexibility and Scalability to Bouygues Telecom with MongoDBTrading up: Adding Flexibility and Scalability to Bouygues Telecom with MongoDB
Trading up: Adding Flexibility and Scalability to Bouygues Telecom with MongoDB
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
MongoDB: How it Works
MongoDB: How it WorksMongoDB: How it Works
MongoDB: How it Works
 
Inside MongoDB: the Internals of an Open-Source Database
Inside MongoDB: the Internals of an Open-Source DatabaseInside MongoDB: the Internals of an Open-Source Database
Inside MongoDB: the Internals of an Open-Source Database
 
DevRomagna / Golang Intro
DevRomagna / Golang IntroDevRomagna / Golang Intro
DevRomagna / Golang Intro
 
numPYNQ @ NGCLE@e-Novia 15.11.2017
numPYNQ @ NGCLE@e-Novia 15.11.2017numPYNQ @ NGCLE@e-Novia 15.11.2017
numPYNQ @ NGCLE@e-Novia 15.11.2017
 
[若渴計畫] Challenges and Solutions of Window Remote Shellcode
[若渴計畫] Challenges and Solutions of Window Remote Shellcode[若渴計畫] Challenges and Solutions of Window Remote Shellcode
[若渴計畫] Challenges and Solutions of Window Remote Shellcode
 
Communication hardware
Communication hardwareCommunication hardware
Communication hardware
 
Advanced memory allocation
Advanced memory allocationAdvanced memory allocation
Advanced memory allocation
 
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
 
Scale Up with Lock-Free Algorithms @ JavaOne
Scale Up with Lock-Free Algorithms @ JavaOneScale Up with Lock-Free Algorithms @ JavaOne
Scale Up with Lock-Free Algorithms @ JavaOne
 
Graduating To Go - A Jumpstart into the Go Programming Language
Graduating To Go - A Jumpstart into the Go Programming LanguageGraduating To Go - A Jumpstart into the Go Programming Language
Graduating To Go - A Jumpstart into the Go Programming Language
 
Docker Networking
Docker NetworkingDocker Networking
Docker Networking
 
Walk through an enterprise Linux migration
Walk through an enterprise Linux migrationWalk through an enterprise Linux migration
Walk through an enterprise Linux migration
 

Similar to Scaling and Transaction Futures

MongoDB Evening Austin, TX 2017
MongoDB Evening Austin, TX 2017MongoDB Evening Austin, TX 2017
MongoDB Evening Austin, TX 2017
MongoDB
 
Membase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San Francisco
Membase
 
MariaDB 10 and what's new with the project
MariaDB 10 and what's new with the projectMariaDB 10 and what's new with the project
MariaDB 10 and what's new with the project
Colin Charles
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
Adi Challa
 

Similar to Scaling and Transaction Futures (20)

Scaling MongoDB to a Million Collections
Scaling MongoDB to a Million CollectionsScaling MongoDB to a Million Collections
Scaling MongoDB to a Million Collections
 
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
 
Boosting the Performance of your Rails Apps
Boosting the Performance of your Rails AppsBoosting the Performance of your Rails Apps
Boosting the Performance of your Rails Apps
 
MongoDB World 2018: Breaking the Mold - Redesigning Dell's E-Commerce Platform
MongoDB World 2018: Breaking the Mold - Redesigning Dell's E-Commerce PlatformMongoDB World 2018: Breaking the Mold - Redesigning Dell's E-Commerce Platform
MongoDB World 2018: Breaking the Mold - Redesigning Dell's E-Commerce Platform
 
MongoDB Evening Austin, TX 2017
MongoDB Evening Austin, TX 2017MongoDB Evening Austin, TX 2017
MongoDB Evening Austin, TX 2017
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons Learned
 
Chapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choicesChapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choices
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
 
Powering Microservices with MongoDB, Docker, Kubernetes & Kafka – MongoDB Eur...
Powering Microservices with MongoDB, Docker, Kubernetes & Kafka – MongoDB Eur...Powering Microservices with MongoDB, Docker, Kubernetes & Kafka – MongoDB Eur...
Powering Microservices with MongoDB, Docker, Kubernetes & Kafka – MongoDB Eur...
 
Conceptos básicos. Seminario web 6: Despliegue de producción
Conceptos básicos. Seminario web 6: Despliegue de producciónConceptos básicos. Seminario web 6: Despliegue de producción
Conceptos básicos. Seminario web 6: Despliegue de producción
 
Membase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San Francisco
 
Mariadb10 和新项目中有什么
Mariadb10 和新项目中有什么Mariadb10 和新项目中有什么
Mariadb10 和新项目中有什么
 
MariaDB 10 and what's new with the project
MariaDB 10 and what's new with the projectMariaDB 10 and what's new with the project
MariaDB 10 and what's new with the project
 
Microservices - opportunities, dilemmas and problems
Microservices - opportunities, dilemmas and problemsMicroservices - opportunities, dilemmas and problems
Microservices - opportunities, dilemmas and problems
 
Powering Microservices with Docker, Kubernetes, Kafka, and MongoDB
Powering Microservices with Docker, Kubernetes, Kafka, and MongoDBPowering Microservices with Docker, Kubernetes, Kafka, and MongoDB
Powering Microservices with Docker, Kubernetes, Kafka, and MongoDB
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
 
Sizing Your MongoDB Cluster
Sizing Your MongoDB ClusterSizing Your MongoDB Cluster
Sizing Your MongoDB Cluster
 
MongoDB Europe 2016 - The Rise of the Data Lake
MongoDB Europe 2016 - The Rise of the Data LakeMongoDB Europe 2016 - The Rise of the Data Lake
MongoDB Europe 2016 - The Rise of the Data Lake
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 

More from MongoDB

More from MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 

Scaling and Transaction Futures

  • 2. #MDBE17 Senior Staff Engineer, MongoDB Inc. KEITH BOSTIC keith.bostic@mongodb.com
  • 3. #MDBE17 THE WIREDTIGER STORAGE ENGINE Storage engine: The “storage engine” is the MongoDB database server code responsible for single-node durability as well as being the primary enforcer of data consistency guarantees. • MongoDB has a pluggable storage engine architecture ‒ Three well-known engines: MMAPv1, WiredTiger and RocksDB • WiredTiger is the default storage engine
  • 4. #MDBE17 SCALING MONGODB A NEW TRANSACTIONAL MODEL
  • 5. #MDBE17 SCALING MONGODB ... TO A MILLION COLLECTIONS
  • 6. #MDBE17 PROBLEM 1: LOTS OF COLLECTIONS • MongoDB applications create collections to hold their documents • Each collection has some set of indexes ‒ Documents indexed in multiple ways ‒ Hundreds of collections • But some applications create A LOT of collections: ‒ Avoiding concurrency bottlenecks in the MMAPv1 storage engine ‒ Multi-tenant applications ‒ Creative schema design: time-series data
  • 7. “640K [of memory] ought to be enough for anybody.” Said nobody ever
  • 8. #MDBE17 PROBLEM 2: INVALID ASSUMPTIONS • WiredTiger designed for applications with known workloads ‒ WiredTiger design based on this assumption ‒ But MongoDB is used for all kinds of things! • Application writers make assumptions, too! ‒ MMAPv1 built on top of mmap: different performance characteristics ‒ Most MMAPv1 users migrated without problems • Engineering is a process of continual improvement
  • 9. #MDBE17 WHAT DID WE DO ABOUT IT? • Got better at measuring applications ‒ Full-time data capture (FTDC) ‒ Identifying bottlenecks • WiredTiger with lots of collections: ‒ Handle caches didn’t scale ‒ Page cache eviction inefficient with lots of trees ‒ Especially when access patterns are skewed
  • 11. #MDBE17 FIRST, FIND A TUNABLE WORKLOAD • Runs standalone on a modern server • 64 client threads doing 10K updates / second (total) • Keep the data working set constant • But with a small cache size so eviction is exercised • Vary the number of collections ‒ And nothing else! ‒ Workload spread across an increasing number of collections • Stop when average latency > 50ms per operation.
  • 12. #MDBE17 RESULTS – BASELINES 0 10 20 30 40 50 1000 10000 100000 Averagelatency(ms) Number of collections 3.2.0 3.4.0
  • 13. #MDBE17 SOLUTION: IMPROVED HANDLE CACHES • Hash table lookups instead of scanning a list ‒ Assumption: short lists, handle lookup uncommon ‒ Reality: every time a cursor is opened • Singly-linked lists equal slow deletes ‒ Assumption: deletes uncommon, singly-linked lists smaller, simpler ‒ Reality: deletes common, removing from a singly-linked list requires a scan
  • 14. #MDBE17 SOLUTION: IMPROVED HANDLE CACHES • Global lock means terrible concurrency ‒ Assumption: short lists, use an exclusive lock ‒ Reality: many operations read-only, shared read-write locks better
  • 15. #MDBE17 SOLUTION: SMARTER EVICTION • WiredTiger evicts some pages from every tree ‒ Assumption: uniformity of data across collections • Finding the data is a significant problem ‒ Retrieval data structures are all you have • Skewed data access ‒ Lots of trees are idle in common applications ‒ Multi-tenant or time-series data are prime examples ‒ Often 1-5 trees dominates a cache of 10K trees
  • 16. #MDBE17 RESULTS – HANDLE CACHE AND EVICTION 0 10 20 30 40 50 1000 10000 100000 Averagelatency(ms) Number of collections 3.2.0 3.4.0 eviction tweak
  • 17. #MDBE17 SOLUTION: IMPROVE CHECKPOINTS • Assumptions: ‒ Checkpoints are rare events, high-end applications configure journaling ‒ Exclusive lock while finding handles that need to be checkpointed ‒ Drops are rare events, and scheduled by the application • Reality: ‒ Checkpoints continuous, every 60 seconds for historic reasons ‒ With 100K trees, exclusive lock held for far too long ‒ Drops happen randomly
  • 18. #MDBE17 SOLUTION: IMPROVE CHECKPOINTS • Skewed access patterns ‒ Reviewing handles with no data is wasted effort ‒ Won’t hold locks as long if we skip clean handles • Split checkpoints into two phases ‒ Phase 1: most I/O happens, multithreaded ‒ Phase 2: trees made consistent, metadata updated/flushed, single-threaded • “Delayed drops” feature to allow drops during checkpoints
  • 19. #MDBE17 SOLUTION: AGGRESSIVE SWEEPING • Assumption: lazily sweep the handle list • Reality: 1M handles takes too long to walk ‒ Aggressively discard cached handles we don’t need
  • 20. #MDBE17 RESULTS – IMPROVE CHECKPOINTS 0 10 20 30 40 50 1000 10000 100000 1000000 Averagelatency(ms) Number of collections 3.2.0 3.4.0 eviction tweak eviction + sweep
  • 21. #MDBE17 SOLUTION: GROUP COLLECTIONS • Assumption: map each MongoDB collection/index to a table • Reality: ‒ Makes all handle caches big ‒ Relies on fast caches and a fast filesystem ‒ 1M files in a directory problematic for some filesystems • Add a “—groupCollections” option to MongoDB ‒ 2 tables per database (collections, indexes) ‒ Adds a prefix to keys ‒ Transparent to applications, although requires configuration
  • 23. #MDBE17 RESULTS – GROUP COLLECTIONS 0 10 20 30 40 50 1000 10000 100000 1000000 Averagelatency(ms) Number of collections 3.2.0 3.4.0 eviction tweak eviction + sweep grouped
  • 24. #MDBE17 RESULTS – SUMMARY 50,000 10,000 800,000 1,000,000 + Maximum Collections MongoDB 3.2.0 3.4.0 3.4 tuned Grouped Collections
  • 25. #MDBE17 MILLION COLLECTIONS PROGRESS 2014 MongoDB 3.0: WiredTiger integration 2015 MongoDB 3.2 Handle cache Checkpoints 2016 MongoDB 3.4 Concurrency Smarter Eviction 2017+ Grouped collections
  • 26. #MDBE17 A MILLION COLLECTIONS: SUMMARY • Got better at measuring performance • Examined and changed our assumptions • Tuned data structures and algorithms • New data representation: grouped collections
  • 27. “It’s not what you don’t know that gets you into trouble -- it’s what you know that just isn’t true.” Said nobody ever
  • 28. #MDBE17 A MILLION COLLECTIONS DELIVERABLES • All tuning work included in the MongoDB 3.6 release. • Grouped collections feature pushed out of the 3.6 release ‒ Improvements sufficient without requiring application API change? ‒ Increased focus on new transactional features • More tuning is happening for the next MongoDB release ‒ Integrating the MongoDB and WiredTiger caching
  • 29. #MDBE17 (We’re not done, the second part starts in 5 minutes!) QUESTIONS?
  • 30. #MDBE17 THE MONGODB JOURNEY TO A NEW TRANSACTIONAL MODEL
  • 31. #MDBE17 TO ACCOMMODATE NEW APPLICATIONS • MongoDB designed for a no-SQL, schema-less world ‒ Transactional semantics less of an application requirement • MongoDB application domain growing ‒ Supporting more traditional applications ‒ Often, applications surrounding the existing MongoDB space • Also, simplifying existing applications
  • 32. #MDBE17 TRANSACTIONS: ACID • Atomicity ‒ All or nothing. • Consistency ‒ Database constraints aren’t violated (“constraints” is individually defined) • Isolation ‒ Transaction integrity and visibility • Durability ‒ Permanence in the face of bad stuff happening
  • 34. #MDBE17 MONGODB’S PRESENT • ACID, of course • Single-document transactions ‒ Atomically update multiple fields of a document (and indices) ‒ Transaction cannot span multiple documents or collections ‒ Applications implement some version of two-phase commit • Single server consistency ‒ Eventual consistency on the secondaries
  • 35. #MDBE17 MONGODB’S FUTURE: MULTI-DOCUMENT TRANSACTIONS • Application developers want them: ‒ Some workloads require them ‒ Developers struggle with error handling ‒ Increase application performance, decrease application complexity • MongoDB developers want them: ‒ Chunk migration to balance content on shards ‒ Changing shard keys
  • 36. #MDBE17 NECESSARY RISK: INCREASING SHARD ENTANGLEMENT • Increasing inter-shard entanglement ‒ The wrong answer is easy, the right answer takes more communication • Chunk balance should not affect correctness • Shards can’t simply abort transactions to get unstuck • Additional migration complexity • Shard entanglement impacts availability
  • 37. #MDBE17 OTHER RISKS AND KNOCK-ON EFFECTS • Developers use transactions rather than appropriate schemas ‒ Long-running transactions are seductive • Inevitably, the rate of concurrency collisions increases • Significant technical complexity ‒ Multi-year project ‒ Every part of the server team: replication, sharding, query, storage ‒ Significantly increases pressure on the storage engines
  • 38. #MDBE17 FEATURES ALONG THE WAY • Automatically avoid dirty secondary reads (3.6!) • Retryable writes (3.6!) ‒ Applications don’t have to manage write collisions • Global point-in-time reads ‒ Single system-wide clock ordering operations • Multi-document transactions
  • 40. #MDBE17 WIREDTIGER: SINGLE-NODE TRANSACTION • Per-thread “session” structure embodies a transaction • Session structure references data-sources: cursors • Transactions are implicit or explicit ‒ session.begin_transaction() ‒ session.commit_transaction() ‒ session.rollback_transaction() • Transactions can already span objects and data-sources!
  • 41. #MDBE17 WIREDTIGER SINGLE-NODE TRANSACTION cursor = session.open_cursor() session.begin_transaction() cursor.set_key(“fruit”); cursor.set_value(“apple”); cursor.insert() cursor.set_key(“fruit”); cursor.set_value(“orange”); cursor.update() session.commit_transaction() cursor.close()
  • 42. #MDBE17 TRANSACTION INFORMATION • 8B transaction ID • Isolation level and snapshot information ‒ Read-uncommitted: everything ‒ Read-committed: committed updates after start ‒ Snapshot: committed updates before start • Linked list of change records, called “updates” ‒ For logging on commit ‒ For discard on rollback
  • 43. #MDBE17 UPDATE INFORMATION • Updates include ‒ Transaction ID which embodies “state” (committed or not) ‒ Data package Transaction ID + Data Key
  • 44. #MDBE17 MULTI-VERSION CONCURRENCY CONTROL • Key references ‒ Chain of updates in most recently modified order ‒ Original value, the update visible to everybody Transaction ID + Data Key Transaction ID + Data Globally Visible Data
  • 45. #MDBE17 WIREDTIGER NAMED SNAPSHOTS FEATURE • Snapshot: a point-in-time • Snapshots can be named ‒ Transactions can be started “as of” that snapshot ‒ Readers use this to access data as of a point in time. • But... snapshots keep data pinned in cache ‒ Newer data cannot be discarded
  • 46. #MDBE17 MONGODB ON TOP OF WIREDTIGER MODEL • MongoDB maps document changes into this model ‒ For example, a single document change involves indexes ‒ Glue layer below the pluggable storage engine API • Read concern majority ‒ In other words, it won’t disappear ‒ Requires –enableMajorityReadConcern configuration ‒ Built on WiredTiger’s named snapshots
  • 47. #MDBE17 INTRODUCING SYSTEM TIMESTAMPS • Applications have their own notion of transactions and time ‒ Defines an expected commit order ‒ Defines durability for a set of systems • WiredTiger takes a fixed-length byte-string transaction ID ‒ Simply increasing (but not necessarily monotonic) ‒ A “most significant bit first” hexadecimal string ‒ 8B but expected to grow to encompass system-wide ordering ‒ Mix-and-match with native WiredTiger transactions
  • 48. #MDBE17 MONGODB USES AN “AS OF” TIMESTAMP • Updates now include a timestamp transaction ID ‒ Timestamp tracked in WiredTiger’s update ‒ Smaller is better, as a significant overhead for small updates • Commit “as of” a timestamp ‒ Set during the update or later, at transaction commit • Read “as of” a timestamp ‒ Set at transaction begin ‒ Point-in-time reads: largest timestamp less than or equal to value
  • 49. #MDBE17 MONGODB SETS THE “OLDEST” TIMESTAMP • Limit future reads • The point at which WiredTiger can discard history • Cannot go backward, must be updated frequently
  • 50. #MDBE17 MONGODB SETS THE “STABLE” TIMESTAMP • Limits future durability rollbacks ‒ Imagine an election where the primary hasn’t seen a committed update • WiredTiger writes checkpoints at the stable timestamp ‒ The storage engine can’t write what might be rolled back • Cannot go backward, must be updated frequently
  • 51. #MDBE17 READ CONCERN MAJORITY FEATURE • In 3.4 implemented with WiredTiger named snapshots ‒ Every write a named snapshot ‒ Heavy-weight, interacts directly with WiredTiger transaction subsystem • In 3.6 implemented with read “as of” ‒ Light-weight and fast ‒ Configuration is now a no-op, “always on”
  • 52. #MDBE17 OPLOG IMPROVEMENTS • MongoDB does replication by copying its “journal” ‒ Oplog is bulk-loaded on secondaries ‒ Oplog is loaded out-of-order for performance • Scanning cursor has strict visibility order requirements ‒ No skipping records ‒ No updates visible after the oldest uncommitted update
  • 53. #MDBE17 OPLOG IMPROVEMENTS • In 3.4, implemented using WiredTiger named snapshots • JIRA ticket: “Under heavy insert load on a 2-node replica set, WiredTiger eviction appears to hang on the secondary.” • In 3.6, implemented using timestamps
  • 54. #MDBE17 A NEW TRANSACTIONAL MODEL SUMMARY • Significant storage engine changes • Enhancing transactional consistency for new applications • Features and improvements in MongoDB 3.6 ‒ Retryable writes ‒ Safe secondary reads ‒ Significantly improved performance

Editor's Notes

  1. Member of the storage group: storage is part of the server development group. Server is the core MongoDB database product.
  2. Storage underlies the technology and features you’ll hear about today. define durability and consistency (isolation visibility), as a consequence, storage owns concurrency. Pluggable architecture: per-workload storage engines Default: acceptable behavior for all workloads.
  3. Engineering process discussion WT began as a separate product, was integrated in 2014 as part of MongoDB 3.0 MMAPv1: lots of collections, fast in place updates
  4. Parallel effort at MongoDB to measure performance FTDC data is heavily compressed where measurement doesn’t change. Lots of small collections makes it hard to spot pages to discard, especially when few are hot. Assumed uniformity across large objects, found skewed access across tiny objects
  5. Single-node overview Layered diagram shows caching at each layer MongoDB session / cursor cache (next area of work) WT cursor cache WT data handle cache WT file handle cache 10,000 connections * 10,000 tables, add indexes, and that’s a multiplier 1M files is problematic for some filesystems.
  6. Design a workload for tuning: there are too many moving parts in MongoDB. Risks losing the problem. In an ideal world, increasing the number of collections would make no difference.
  7. We didn’t know we were making the problem worse. 3.4 degrades much more quickly than 3.2: logarithmic scale!
  8. Data structures assumed we’d never have lots of collections or frequently change them
  9. Modern eviction algorithms don’t have any kind of real queue, it’s too slow. Pages reside elsewhere, and there’s information that let’s you know the “age” Assumes uniformity of the data across collections. Multi-tenant workloads are skewed. Once idle trees empty out, even looking at them is a waste of time.
  10. Still nowhere near 1M, but at least back to where we were in 3.2
  11. Obvious data structures and tuning changes. Checkpoints hold exclusive locks and slow everything down.
  12. Note the x axis scale change, we can now see the 1M target
  13. Someday a middle ground, we’ll need to create subdirectories for data Security becomes more interesting: data is co-resident.
  14. Architecture with Grouped Collections Revisits main architecture diagram with changes for grouped collections Assuming a single database here, and now the cursor cache size is only based on the number of connections, plus we have ways to limit how big it gets in practice.
  15. Collections gets us to 1M with < 10ms average latency. We get to about 250K with < 10ms average latency without changing the API, 800K < 30ms.
  16. Tipping point (great than 50ms) This graph shows how many collections we can support: more is better.
  17. It’s what you “know” that just isn’t true... It’s all about changing our assumptions to handle more workloads.
  18. Because the tuning efforts were successful (800,000 collections), reaching 1M less important. Additionally, there are significant tuning, space and application-API issues with respect to grouped collections: for example, compaction, collection drop, security and so on. Solving without a new feature API is better. If you change your mind later, and want to split the two files up?
  19. Define the terms and get everybody on the same page Everybody offers a version of ACID, including MongoDB Differences generally around relaxing consistency guarantees
  20. MongoDB’s traditional applications have CAP tradeoffs MongoDB’s original design chose partition tolerance and availability over consistency (JD ???) Extending to support more consistency rules.
  21. MongoDB supports ACID, but it only applies to individual write operations. “write operations” is a high-level concept, indexes are kept consistent. In 3.4: linearizable reads: write the primary, force read from a secondary to block until it sees the write.
  22. Application developers want to shift complexity into the database. Application developer skill set not suited to building database applications.
  23. Golden Rule: may not impact the performance of applications not using transactions.
  24. safe secondary reads: automatically avoiding dirty reads (?) global point-in-time reads: applications read as of a single point in the causal chain retryable writes: retry automatically so applications don’t have to manage write collisions multi-document transactions: modify multiple documents/collections atomically
  25. storage engine semantics: a relatively standard single-node model. two types of durability: checkpoint & journalling standard write-ahead-logging log records are redo only; entire change record must fit into memory
  26. cursors iterate, remove, standard CRUD operations key-value store: MongoDB maps to documents & indexes
  27. Updates and inserts Transaction ID is an identifier into table of information.
  28. Inserts are single entries, with lists of updates When a cursor encounters a key, compare cursor and key/update transaction IDs
  29. --enableMajorityReadConcern: visible data must have been written to a majority of the replica set
  30. allow the distributed layer to define/order transactions 8B is fast, and lockless on 64b machines; will grow to incorporate cluster-wide clock information MSB (memcmp) MongoDB and WiredTiger transactions co-exist: Allows applications to mix-and-match where threads don’t care about timestamps you can get into trouble: operations on an item must be in timestamp order
  31. commit timestamps must be ahead of any read timestamp setting a read timestamp forces snapshot isolation at that timestamp
  32. Oldest possible reader, including replicated state To avoid caching infinite updates, must move forward the read “as of” timestamp Moving complexity into the storage engine, particularly around caching.
  33. the distributed engine can roll-back locally “durable” events write-concern majority: one node might have seen committed event, but if master never saw it, rolled back. generally expected a well-behaved replica set won’t fall behind local crash safe because checkpoint happens at stable event
  34. Benefits in 3.6: complete a read query using the same specified snapshot for its entirety, on a replica set. every write is a WT “snapshot” when secondary receives a majority read request, finds a snapshot that’s majority confirmed to use required all requests over a single socket
  35. oplog is the source of truth
  36. Snapshots pin memory: two nodes running on a 24-cpu box, 32GB RAM, pushing 16 threads with vectored writes of 100 tiny documents at a time. oplog was created for capped collections took on a replication role since it looks a lot like a shared log