The document discusses distributed consensus algorithms in MongoDB. It explains that MongoDB uses a leader-based replicated state machine approach, where servers elect a primary node and replicate the primary's log of state transitions. Elections are triggered if a node does not receive heartbeats from the primary within a timeout period. The upcoming MongoDB 3.2 release aims to improve consensus by taking inspiration from the Raft algorithm, including using term IDs to prevent double voting, monitoring liveness via data replication rather than separate heartbeats, and varying election timeouts randomly to reduce tied votes and speed up failover.
2. Agenda
• Introduction to consensus
• Leader-based replicated state machine
• Elections and data replication in MongoDB
• Improvements coming in MongoDB 3.2
4. What is Consensus?
• Getting multiple processes/servers to agree on something
• Must handle a wide range of failure modes
• Disk failure
• Power outage
• Clock skews
• Network partitions
• Machine freezes
6. Leader Based Consensus
State
Machine
X 3
Y 2
Z 7
Replicated
Log
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
State
Machine
X 3
Y 2
Z 7
Replicated
Log
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
State
Machine
X 3
Y 2
Z 7
Replicated
Log
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
7. Agenda
• Introduction to consensus
• Leader-based replicated state machine
• Elections and data replication in MongoDB
• Improvements coming in MongoDB 3.2
8. Elections
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
9. Elections
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
10. Elections
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
11. Elections
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
12. Elections
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
13. Data Replication
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
14. Agenda
• Introduction to consensus
• Leader-based replicated state machine
• Elections and data replication in MongoDB
• Improvements coming in MongoDB 3.2
15. Agenda
• Introduction to consensus
• Leader-based replicated state machine
• Elections and data replication in MongoDB
• Improvements coming in MongoDB 3.2
• Goals and inspiration from Raft Consensus Algorithm
• Preventing double voting
• Liveness detection
• Calling for elections
16. Goals for MongoDB 3.2
• Decrease failover time
• Speed up detection and resolution of stale primary situations
17. Finding Inspiration in Raft
• “In Search of an Understandable Consensus Algorithm” by Diego
Ongaro: https://ramcloud.stanford.edu/raft.pdf
• Designed to address the shortcomings of Paxos
• Easier to understand
• Easier to implement in real applications
• Provably correct
• Remarkably similar to what we’re doing already
18. Raft Concepts
• Term (election) IDs
• Liveness monitoring using existing data replication channel
• Asymmetric election timeouts
19. Preventing Double Voting
• Can’t vote for 2 nodes in the same election
• Pre-3.2: 30 second vote timeout
• Post-3.2: Term IDs
• Term:
• Monotonically increasing ID
• Incremented on every election *attempt*
• Lets voters distinguish elections so they can vote twice quickly in
different elections.
20. Liveness Detection
• Pre-3.2: Heartbeats
• Sent every two seconds from every node to every other node
• Volume increases quadratically as nodes are added to the replica set
• Post-3.2: Extra metadata sent via existing data replication channel
• Utilizes chained replication
• Faster heartbeats = faster elections and detection of stale primaries
21. Data Replication
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
22. Determining When To Call For An Election
• Tradeoff between failover time and spurious failovers
• Node calls for an election when it hasn’t heard from the primary within
the election timeout
• Election timeout is now configurable – starting in 3.2
23. Determining When To Call For An Election
• Prior to 3.2, election timeout was the same for all nodes
• Starting in 3.2:
• Election timeout is varied randomly for each node
• Varying timeouts help reduce tied votes
• Fewer tied votes = faster failover
24. Conclusion
• MongoDB 3.2 will have
• Faster failovers
• Faster error detection
• More control to prevent spurious failovers
• This means your systems are
• More stable
• More resilient to failure
• Easier to maintain
27. Tell me how I did today on Guidebook and enter for a chance
to win one of these
How to do it:
Download the Guidebook App
Search for MongoDB Silicon Valley
Submit session feedback