MongoDB has adapted transaction feature (ACID Properties) in MongoDB 4.0. This talk focuses on the internals of how MongoDB adapted the ACID properties with Weird Tiger Engine. Weird tiger offers more future possibilities for MongoDB. This tech talk was presented at Mydbops Database Meetup on 27-04-2019 by Manosh Malai Senior Devops/NoSQL Consultant with Mydbops and Ranjith Database Administrator with Mydbops.
Videogame localization & technology_ how to enhance the power of translation.pdf
MongoDB WiredTiger Internals: Journey To Transactions
1. MongoDB WiredTiger Internals: Journey To Transactions
Presenter by
Manosh Malai
Senior Devops and DB/Cloud Consultant @ Mydbops
Ranjith A
DBA @ Mydbops
www.mydbops.com info@mydbops.com
2. Manosh Malai
Senior Devops and DB/Cloud Consultant @Mydbops
mmalai@mydbops.com
@ManoshMalai
Ranjith A
DBA
ranjith@mydbops.com
About Me
3. Mydbops at a Glance
● Founded in 2015, HQ in Bangalore India, 25 Employees.
● Mydbops is on Database Consulting with core specialization on MySQL and MongoDB
Administration and Support.
● Mydbops was created with a motto of developing a Devops model for Database
administration.
● We help organisations to scale in MySQL/Mongo and implement the advanced technologies in
MySQL/Mongo.
9. MongoDB Prominent Features
FULL TEXT SEARCH
AGGREGATION FRAMEWORK and MAP REDUCE
TRANSACTION SUPPORT
BSON STORAGE FORMAT
INBUILD REPLICATION/SHARDING SUPPORT
GRIDFS
11. MongoDB Architecture
IoT Sensor Data
MongoDB Query Language
MongoDB Data Model
Content Repo Ad Service
Real-Time
Analytics
Mobile
WiredTiger MMAPv1 In-Memory Encrypted
Security Management
12. Horizontal Scalable
Sharding
● Sharding Types: Range, Tag-Aware, Hashed
● Increase or Decrease Capacity as you go
● Automatic Balancing
Shard 1 Shard 2 Shard 3 Shard 4 Shard N. . .
Horizontal Scalable
13. Vertical Scalable(Replica Set)
Replica Set
Primary
Secondary
Secondary
Primary
Secondary
Secondary
● Factors:
○ RAM
○ Disk
○ CPU
○ Network
● Redundancy
● High Availability
15. What’s Special with WiredTiger ?
● Transactions use optimistic concurrency control algorithms
● Document Level Locking
● Snapshot and Checkpoint
● Write-ahead transaction log for the journal
● Compression
● Online compaction
● LSM and B-tree Indexing
16. Python API C API Java API
Schema & Cursor
Row Storage
Column
Storage
Cache
Block Management
Transactions
Snapshot
Page read/write
WAL
Database Files Journal
17. WiredTiger Architecture
Python API C API Java API
Schema & Cursor
Row Storage
Column
Storage
Cache
Block Management
Transactions
Snapshot
Page read/write
WAL
Database Files Journal
18. How MongoDB GLUED with WiredTiger
● MongoDB 3.0(March 2015) Introduce Internal Storage API, allowing for new Storage Engine to
be added to MongoDB.
● From MongoDB 3.0, WiredTiger Storage Engine as an option.
● From MongoDB 3.2(Dec 2015), WiredTiger Storage Engine Made as default one.
● WiredTiger Support C API, Java API and Python API.
● MongoDB Storage Engine Layer using C API to communicate with WiredTiger
Further-details:
https://github.com/mongodb/mongo/blob/master/src/mongo/db/storage/README.md
http://source.wiredtiger.com/3.1.0/struct_w_t___c_u_r_s_o_r.html#details
22. Cursor == WT_CURSOR
● WT_CONNECTION, WT_SESSION and WT_CURSOR are the classes use to access and manage data.
● WT_CURSOR will handling all CURD(create, update, read, delete) operations internally.
● WT_CURSOR have all member functions related to CURD operation.
○ WT_CURSOR::reset
○ WT_CURSOR::search
○ WT_CURSOR::search_near
○ WT_CURSOR::remove
● WT_SESSION are open on specification connection, every single connection open one session.
● WT_SESSION and WT_CURSOR not thread safe
● WT_CONNECTION methods are thread safe
23. Thread Safe
int main(int argc, char *argv[])
{
WT_CONNECTION *conn;
WT_SESSION *session;
WT_CURSOR *cursor;
wt_thread_t threads[NUM_THREADS];
int i;
home = example_setup(argc, argv);
error_check(wiredtiger_open(home, NULL, "create", &conn));
error_check(conn->open_session(conn, NULL, NULL, &session));
error_check(session->create(session, "table:access",
"key_format=S,value_format=S"));
error_check(session->open_cursor(
session, "table:access", NULL, "overwrite", &cursor));
cursor->set_key(cursor, "key1");
cursor->set_value(cursor, "value1");
error_check(cursor->insert(cursor));
error_check(session->close(session, NULL));
for (i = 0; i < NUM_THREADS; i++)
error_check(
__wt_thread_create(NULL, &threads[i], scan_thread,
conn));
for (i = 0; i < NUM_THREADS; i++)
error_check(__wt_thread_join(NULL, threads[i]));
error_check(conn->close(conn, NULL));
return (EXIT_SUCCESS);
}
25. WiredTiger Transaction
Python API C API Java API
Schema & Cursor
Row Storage
Column
Storage
Cache
Block Management
Transactions
Snapshot
Page read/write
WAL
Database Files Journal
26. 3 Pillar Helps to Implement Transaction
Journal
MVCC
Snapshot
27. Transaction
● Transaction per Sessions
○ open
○ close
○ commit
○ rollback
● Till MongoDB 3.6, it only emulate transactions by implementing two-phase commit.
● WiredTiger support three different Isolation level
○ read-uncommitted
○ read-committed
○ snapshot
● Durability is supported only when they are part of a checkpoint.
28. Cont..
● From MongoDB 3.6 Transaction feature included:
○ Logical Sessions
○ Global Clock
○ Retryable write
● read-committed is default isolation level WiredTiger 3.1.0, but MongoDB Default is Snapshot.
● MongoDB 4.0 Commands introduced:
● – To starts a multi-statement transaction : Session.startTransaction()
● – To commit a transaction : Session.commitTransaction()
● – To rollback a transaction : Session.abortTransaction()
29. Code
ret = session->open_cursor(session, "table:mytable", NULL, NULL, &cursor);
ret = session->begin_transaction(session, NULL);
cursor->set_key(cursor, "key");
cursor->set_value(cursor, "value");
switch (ret = cursor->update(cursor)) {
case 0: /* Update success */
ret = session->commit_transaction(session, NULL);
/*
* If commit_transaction succeeds, cursors remain positioned; if
* commit_transaction fails, the transaction was rolled-back and
* and all cursors are reset.
*/
break;
case WT_ROLLBACK: /* Update conflict */
default: /* Other error */
ret = session->rollback_transaction(session, NULL);
/* The rollback_transaction call resets all cursors. */
break;
}
30. MVCC Principles in WT Transaction
● MVCC in the WiredTiger is a linked list based.
● The linked list unit store the
○ Transaction-Id of this modified transaction
○ TimeStamp(From 3.6)
○ modified value
● Everytime the value modified , the append is on the list header.
● Every session sees their own consistent snapshot of the database.
● Change made by one session, will not be see by any other sessions until their transaction is committed.
31. MVCC Workflow
MVCC List Tail MVCC List Head
Concurrent Transaction
Initial Val: 10
Read Transaction T0 Submitted to write T1 Rollback write T2 Uncommitted write T3 Read Transaction T4
Changed by T1 to: 11
Changed by T2 to: 12
Marked as Obsolete
Changed by T3 to: 14
32. WiredTiger Transaction Snapshot
snap_min-T1 snap_max-T4
Submit Transaction Interval(0, T1) TRANSACTION INTERVAL BEING EXECUTED [ T1 , T4] Transaction interval to be executed
T5
. . . Commit T1 Rollback T2 Uncommit T3
T4 Moment
33. Transaction Execution Process
1
4
2
3 5
Wt_transaction
Operation Array Journal Buffer
New Update
MVCC LIST
OP
Transaction id
Snapshot_object
Operation array
Redo log buffer
State
34. Wt_transaction data Structure
Wt_transaction{
Transaction_id: The globally unique ID of this transaction, used to indicate the version number of the
transaction modification data
Snapshot_object: The set of transactions that are currently executing and not committed at the beginning
of the current transaction or at the time of the operation, for transaction isolation
Operation_array: A list of operations that have been performed in this transaction for transaction rollback.
Redo_log_buf: Operation log buffer. Used for persistence after transaction commit
State: current state of the transaction
}
35. Wt_mvcc and snapshot_object data Structure
Wt_mvcc{
Transaction_id: ID of this modified transaction
Value: the modified value
}
Snapshot_object {
Snap_min: <Min Transaction number>,
Snap_max: <Max Transaction number>,
Snap_array: Any modification to a transaction that appears in snap_array,
}
36. Transaction Flow
1. Create a value unit object (update) in the MVCC list
2. According to the transaction object's transaction id and transaction status, it is determined whether the
transaction ID of the transaction is created for this transaction. If not, a transaction ID is assigned to the
transaction, and the transaction status is set to the HAS_TXN_ID state.
3. Set the ID of this transaction to the update unit as the MVCC version number.
4. Create an operation object, point the object's value pointer to update, and add the operation to the
operation_array of the transaction object.
5. Add the update unit to the linked list header of the MVCC list.
6. Write a redo log to the redo_log_buf of this transaction object.
37. WiredTiger Transaction Data Flush Time
Python API C API Java API
Schema & Cursor
Row Storage
Column
Storage
Cache
Block Management
Transactions
Snapshot
Page read/write
WAL
Database Files Journal
PDFLUSH 60 S
log_flush 100 MS
Sync
LifeTime 60 S
38. Snapshot for END USER
● To minimize the cache pressure, we can user server Parameter transactionLifeTimeLimitSeconds to some
preferable value.
● Default value is 60.
● Before a transaction updates a document, it will try to acquire a write lock. If the document is already
locked the transaction will fail.
● Before a non-transactional operation tries to update a document, it will try to acquire a write lock. If the
document is already locked, the operation will back off and retry until MaxTimeMS is reached.
39. Snapshot for END USER
● Pass session information to all statements inside your transaction.
● Implement retry logic. MongoDB returns error codes that tell you if a transaction has failed and if it failed
with a retryable error or not.
● To reduce WiredTiger cache pressure, keep transactions short and don’t leave them open, even read only
transactions.
● Take into account that long running DDL operations (e.g. createIndex() ) block transactions and vice versa.
40. Journal and Checkpoint
● Journal writes data first to the journal and then to the core data files.
● MongoDB uses memory mapped files to writes your data to disk.
● In order to improve performance, write will first be written into the memory buffer of the journal log.
● Journal file size limit of 100MB.
● WiredTiger create new Journal file approximately every 100MB of data.
● WiredTiger use snappy compression for the Journal data
○ storage.wiredTiger.engineConfig.journalCompressor default Snappy
○ The minimum journal record size for WiredTiger is 128 bytes.
41. ● When the buffer data reaches 100M or every 100 milliseconds, the data in the Journal buffer Will be
flushed to the journal file on the disk
○ storage.journal.commitIntervalMs Default 100 or 30
○ WriteConcern j:true will cause an immediate sync of the journal.
○ If mongodb exits abnormally, we may loss up to 100M data or the last 100ms data.
● When Journal Data files reached 2Gb or 60 seconds, changes are flushed to
○ storage.syncPeriodSecs Default 60 .
○ The amount of time that can pass before MongoDB flushes data to the data files via an fsync
operation.
○ storage.syncPeriodSecs has no effect to journal files.
42. What I didn’t Covered
1. Block Manager
2. Cache
3. BTree/LSM
4. Compression etc….