SlideShare a Scribd company logo
1 of 73
Download to read offline
Reaching reliable
agreement in an unreliable
world
Heidi Howard
heidi.howard@cl.cam.ac.uk
@heidiann
Research Students Lecture Series
13th October 2015
1
Introducing Alice
Alice is new graduate from
Cambridge off to the world of
work.
She joins a cool new start up,
where she is responsible for a
key value store.
2
Single Server System
Server
Client 2
A 7
B 2
Client 1 Client 3
3
Single Server System
Server
Client 2
A 7
B 2
Client 1 Client 3
A?
7
4
Single Server System
Server
Client 2
A 7
B 2
Client 1 Client 3
B=3 OK
3
5
Single Server System
Server
Client 2
A 7
B 2
Client 1 Client 3
B?
3
3
6
Single Server System
Pros
• linearizable semantics
• durability with write-ahead
logging
• easy to deploy
• low latency (1 RTT in
common case)
• partition tolerance with
retransmission & command
cache
Cons
• system unavailable if server
fails
• throughput limited to one
server
7
Backups
aka Primary backup replication
Primary
Client 2
Client 1
Backup 1
Backup 1
Backup 1
A 7
B 2
A?
7
A 7
B 2
A 7
B 2
A 7
B 2
8
Backups
aka Primary backup replication
Primary
Client 2
Client 1
Backup 1
Backup 1
Backup 1
A 7
B 1
B=1
A 7
B 2
A 7
B 2
A 7
B 2
A 7
B 1
9
Backups
aka Primary backup replication
Primary
Client 2
Client 1
Backup 1
Backup 1
Backup 1
A 7
B 1
OK
A 7
B 1
A 7
B 1
A 7
B 1
OK
OK
OK
10
Big Gotcha
We are assuming total ordered broadcast
11
Totally Ordered Broadcast
(aka atomic broadcast) the guarantee that messages
are received reliably and in the same order by all
nodes.
12
Requirements
• Scalability - High throughout processing of
operations.
• Latency - Low latency commit of operation as
perceived by the client.
• Fault-tolerance - Availability in the face of machine
and network failures.
• Linearizable semantics - Operate as if a single
server system.
13
Doing the Impossible
14
CAP Theorem
Pick 2 of 3:
• Consistency
• Availability
• Partition tolerance
Proposed by Brewer in 1998, still debated and
regarded as misleading. [Brewer’12]
[Kleppmann’15]
15
Eric Brewer
FLP Impossibility
It is impossible to guarantee consensus when
messages may be delay if even one node may fail.
[JACM’85]
16
Consensus is impossible
[PODC’89]
Nancy Lynch
17
Aside from Simon PJ
Don’t drag your reader or listener
through your blood strained
path.
Simon Peyton Jones
18
Paxos
Paxos is at the foundation of (almost)
all distributed consensus protocols.
It is a general approach of using two
phases and majority quorums.
It takes much more to construct a
complete fault-tolerance distributed
systems. Leslie Lamport
19
Beyond Paxos
• Replication techniques - mapping a consensus algorithm to a fault
tolerance system
• Classes of operations - handling of read vs writes, operating on
stale data, soft vs hard state
• Leadership - fixed, variable or leaderless, multi-paxos, how to elect
a leader, how to discover a leader
• Failure detectors - heartbeats & timers
• Dynamic membership, Log compaction & GC, sharding, batching
etc
20
Consensus is hard
21
A raft in the sea of
confusion
22
Case Study: Raft
Raft is the understandable replication algorithm.
Provides linearisable client semantics, 1 RTT best
case latency for clients.
A complete(ish) architecture for making our
application fault-tolerance.
Uses SMR and Paxos
23
State Machine Replication
Server
Client
ServerServer A 7
B 2
A 7
B 2
A 7
B 2
B=3
24
State Machine Replication
Server
Client
ServerServer A 7
B 2
A 7
B 2
A 7
B 2
B=3
25
State Machine Replication
Server
Client
ServerServer A 7
B 2
A 7
B 2
A 7
B 2
B=3
3
3 3
26
Leadership
Follower Candidate Leader
Startup/
Restart
Timeout Win
Timeout
Step down
27
Step down
Ordering
Each node stores is own perspective on a value
known as the term.
Each message includes the sender’s term and this is
checked by the recipient.
The term orders periods of leadership to aid in
avoiding conflict.
Each has one vote per term, thus there is at most one
leader per term.
28
Ordering
29
ID: 1
ID: 2
ID: 3
1
1
1
2
2 2
2
ID: 1
Term: 0
Vote: n
ID: 2
Term: 0
Vote: n
ID: 5
Term: 0
Vote: n
ID: 4
Term: 0
Vote: n
ID: 3
Term: 0
Vote: n
30
Leadership
Follower Candidate Leader
Startup/
Restart
Timeout Win
Timeout
Step down
31
Step down
ID: 1
Term: 0
Vote: n
ID: 2
Term: 0
Vote: n
ID: 5
Term: 0
Vote: n
ID: 4
Term: 1
Vote: me
ID: 3
Term: 0
Vote: n
Vote for me in term 1!
32
ID: 1
Term: 1
Vote: 4
ID: 2
Term: 1
Vote: 4
ID: 5
Term: 1
Vote: 4
ID: 4
Term: 1
Vote: me
ID: 3
Term: 1
Vote: 4
Ok!
33
ID: 1
Term: 1
Vote: 4
ID: 2
Term: 1
Vote: 4
ID: 5
Term: 1
Vote: 4
ID: 4
Term: 1
Vote: me
ID: 3
Term: 1
Vote: 4
Leader fails
34
ID: 1
Term: 2
Vote: 5
ID: 2
Term: 1
Vote: 4
ID: 5
Term: 2
Vote: me
ID: 4
Term: 1
Vote: me
ID: 3
Term: 1
Vote: 4
Vote for me in term 2!
35
ID: 1
Term: 2
Vote: 5
ID: 2
Term: 2
Vote: me
ID: 5
Term: 2
Vote: me
ID: 4
Term: 1
Vote: me
ID: 3
Term: 2
Vote: 2
Vote for me in term 2!
36
ID: 1
Term: 2
Vote: 5
ID: 2
Term: 2
Vote: me
ID: 5
Term: 2
Vote: me
ID: 4
Term: 1
Vote: me
ID: 3
Term: 2
Vote: 2
37
Leadership
Follower Candidate Leader
Startup/
Restart
Timeout Win
Timeout
Step down
38
Step down
ID: 1
Term: 3
Vote: 2
ID: 2
Term: 3
Vote: me
ID: 5
Term: 3
Vote: 2
ID: 4
Term: 3
Vote: 2
ID: 3
Term: 3
Vote: 2
Vote for me in term 3!
39
ID: 1
Term: 3
Vote: 2
ID: 2
Term: 3
Vote: me
ID: 5
Term: 3
Vote: 2
ID: 4
Term: 3
Vote: 2
ID: 3
Term: 3
Vote: 2
40
Replication
Each node has a log of client commands and a index
into this representing which commands have been
committed
41
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4
B 2
A 4
B 2
A 4
B 2
Simple Replication
42
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4
B 2
A 4
B 2
A 4
B 2
B=7
Simple Replication
43
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4
B 2
A 4
B 2
A 4
B 2
B=7
B=7
B=7
Simple Replication
44
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4
B 2
A 4
B 7
A 4
B 2
B=7
B=7
B=7
Simple Replication
45
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4
B 7
A 4
B 7
A 4
B 7
B=7
B=7
B=7
Simple Replication
46
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4
B 7
A 4
B 7
A 4
B 7
B=7
B=7
B=7
B?
Simple Replication
47
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4
B 7
A 4
B 7
A 4
B 7
B=7
B=7
B=7
B?
B?
B?
Simple Replication
48
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4
B 7
A 4
B 7
A 4
B 7
B=7
B=7
B=7
B?
B?
B?
Simple Replication
49
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4
B 2
A 4
B 7
A 4
B 7
B=7
B=7
B?
B?
Catchup
A=6
50
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4
B 2
A 4
B 7
A 4
B 7
B=7
B=7
B?
B?
Catchup
A=6
A=6
:(
51
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4
B 2
A 4
B 7
A 4
B 7
B=7
B=7
B?
B?
Catchup
A=6
A=6
:(
52
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4
B 2
A 4
B 7
A 4
B 7
B=7
B=7
B?
B?
Catchup
A=6
A=6
:)
53
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4
B 2
A 4
B 7
A 4
B 7
B=7
B=7
B?
B?
Catchup
A=6
A=6
B=7 B? A=6
54
Evaluation
• The leader is a serious bottleneck -> limited
scalability
• Can only handle the failure of a minority of nodes
• Some rare network partitions render protocol in
livelock
55
Beyond Raft
56
Case Study: Tango
Tango is designed to be a scalable replication
protocol.
It’s a variant of chain replication + Paxos.
It is leaderless and pushes more work onto clients
57
Simple Replication
Client 1
A 7
B 2
Client 2
A 4
B 2 Sequencer
Server 1 Server 2 Server 3
0
A=4
Next: 1
0
A=4
0
A=4
B=5
58
Simple Replication
Client 1
A 7
B 2
Client 2
A 4
B 2 Sequencer
Server 1 Server 2 Server 3
0
A=4
Next: 2
0
A=4
0
A=4
Next?
1
59
B=5
Simple Replication
Client 1
A 7
B 2
Client 2
A 4
B 2 Sequencer
Server 1 Server 2 Server 3
0
A=4
Next: 2
0
A=4
0
A=4
B=5 @ 1
OK
1
B=5
60
B=5
Simple Replication
Client 1
A 7
B 2
Client 2
A 4
B 2 Sequencer
Server 1 Server 2 Server 3
0
A=4
Next: 2
0
A=4
0
A=4
1
B=5
1
B=5
61
B=5 @ 1 OK
Simple Replication
Client 1
A 7
B 2
Client 2
A 4
B 2 Sequencer
Server 1 Server 2 Server 3
0
A=4
Next: 2
0
A=4
0
A=4
1
B=5
1
B=5
1
B=5
62
B=5 @ 1
OK
Simple Replication
Client 1
A 7
B 2
Client 2
A 4
B 5 Sequencer
Server 1 Server 2 Server 3
0
A=4
Next: 2
0
A=4
0
A=4
1
B=5
1
B=5
1
B=5
63
Fast Read
Client 1
A 7
B 2
Client 2
A 4
B 5 Sequencer
Server 1 Server 2 Server 3
0
A=4
Next: 2
0
A=4
0
A=4
1
B=5
1
B=5
1
B=5
Check?
1
B?
64
Handling Failures
Sequencer is soft-state which can be reconstructed by
querying head server.
Clients failing between receiving token and first read
leaves gaps in the log. Clients can mark these as empty
and space (though not address) is reused.
Clients may fail before completing a write. The next client
can fill-in and complete the operation
Server failures are detected by clients, who initiate a
membership change, using term numbers.
65
Evaluation
Tango is scalable, the leader is not longer the
bottleneck.
Dynamic membership and sharding come for free
with design.
High latency of chain replication
66
Next Steps
67
68
wait… we’re not finished yet!
Requirements
• Scalability - High throughout processing of
operations.
• Latency - Low latency commit of operation as
perceived by the client.
• Fault-tolerance - Availability in the face of machine
and network failures.
• Linearizable semantics - Operate as if a single
server system.
69
Many more examples
• Raft [ATC’14] - Good starting point, understandable
algorithm from SMR + multi-paxos variant
• VRR [MIT-TR’12] - Raft with round-robin leadership & more
distributed load
• Tango [SOSP’13] - Scalable algorithm for f+1 nodes, uses
CR + multi-paxos variant
• Zookeeper [ATC'10] - Primary backup replication + atomic
broadcast protocol (Zab [DSN’11])
• EPaxos [SOSP’13] - leaderless Paxos varient for WANs
70
Can we do even better?
• Self-scaling replication - adapting resources to
maintain resilience level.
• Geo replication - strong consistency between wide
area links
• Auto configuration - adapting timeouts and
configure as network changes
• Integrated with unikernels, virtualisation, containers
and other such deployment tech
71
Evaluation is hard
• few common evaluation metrics.
• often only one experiment setup is used.
• different workloads
• evaluation to demonstrate protocol strength
72
Lessons Learned
• Reaching consensus in distributed
systems is do able
• Exploit domain knowledge
• Raft is a good starting point but we
can do much better!
Questions?
73

More Related Content

Similar to Reaching reliable agreement in an unreliable world

Unveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep DiveUnveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep DiveChieh (Jack) Yu
 
BlueHat v18 || Hardening hyper-v through offensive security research
BlueHat v18 || Hardening hyper-v through offensive security researchBlueHat v18 || Hardening hyper-v through offensive security research
BlueHat v18 || Hardening hyper-v through offensive security researchBlueHat Security Conference
 
Microservices interaction at scale using Apache Kafka
Microservices interaction at scale using Apache KafkaMicroservices interaction at scale using Apache Kafka
Microservices interaction at scale using Apache KafkaIvan Ursul
 
HES2011 - Sebastien Tricaud - Capture me if you can
HES2011 - Sebastien Tricaud - Capture me if you canHES2011 - Sebastien Tricaud - Capture me if you can
HES2011 - Sebastien Tricaud - Capture me if you canHackito Ergo Sum
 
Hackito Ergo Sum 2011: Capture me if you can!
Hackito Ergo Sum 2011: Capture me if you can!Hackito Ergo Sum 2011: Capture me if you can!
Hackito Ergo Sum 2011: Capture me if you can!stricaud
 
Cassandra introduction apache con 2014 budapest
Cassandra introduction apache con 2014 budapestCassandra introduction apache con 2014 budapest
Cassandra introduction apache con 2014 budapestDuyhai Doan
 
Moving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsMoving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsHPCC Systems
 
When DevOps and Networking Intersect by Brent Salisbury of socketplane.io
When DevOps and Networking Intersect by Brent Salisbury of socketplane.ioWhen DevOps and Networking Intersect by Brent Salisbury of socketplane.io
When DevOps and Networking Intersect by Brent Salisbury of socketplane.ioDevOps4Networks
 
2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy
2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy
2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxyBo-Yi Wu
 
Distributed Consensus: Making the Impossible Possible
Distributed Consensus: Making the Impossible PossibleDistributed Consensus: Making the Impossible Possible
Distributed Consensus: Making the Impossible PossibleC4Media
 
Cassandra for the ops dos and donts
Cassandra for the ops   dos and dontsCassandra for the ops   dos and donts
Cassandra for the ops dos and dontsDuyhai Doan
 
The Next Generation of the Consumer Rebalance Protocol with David Jacot
The Next Generation of the Consumer Rebalance Protocol with David JacotThe Next Generation of the Consumer Rebalance Protocol with David Jacot
The Next Generation of the Consumer Rebalance Protocol with David JacotHostedbyConfluent
 
Register Sort Xeon Phi final
Register Sort Xeon Phi finalRegister Sort Xeon Phi final
Register Sort Xeon Phi finalXiaochen Tian
 
Crossing the Boundaries: Development Strategies for (P)SoCs
Crossing the Boundaries: Development Strategies for (P)SoCsCrossing the Boundaries: Development Strategies for (P)SoCs
Crossing the Boundaries: Development Strategies for (P)SoCsAndreas Koschak
 
Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaDataWorks Summit
 
High Secure Password Authentication System
High Secure Password Authentication SystemHigh Secure Password Authentication System
High Secure Password Authentication SystemAkhil Nadh PC
 
Solving Cross-Cutting Concerns in PHP - DutchPHP Conference 2016
Solving Cross-Cutting Concerns in PHP - DutchPHP Conference 2016 Solving Cross-Cutting Concerns in PHP - DutchPHP Conference 2016
Solving Cross-Cutting Concerns in PHP - DutchPHP Conference 2016 Alexander Lisachenko
 
Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices
Life Cycle of Metrics, Alerting, and Performance Monitoring in MicroservicesLife Cycle of Metrics, Alerting, and Performance Monitoring in Microservices
Life Cycle of Metrics, Alerting, and Performance Monitoring in MicroservicesSean Chittenden
 
Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015
Eliminating SAN Congestion Just Got Much Easier-  webinar - Nov 2015 Eliminating SAN Congestion Just Got Much Easier-  webinar - Nov 2015
Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015 Tony Antony
 

Similar to Reaching reliable agreement in an unreliable world (20)

Unveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep DiveUnveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep Dive
 
BlueHat v18 || Hardening hyper-v through offensive security research
BlueHat v18 || Hardening hyper-v through offensive security researchBlueHat v18 || Hardening hyper-v through offensive security research
BlueHat v18 || Hardening hyper-v through offensive security research
 
Microservices interaction at scale using Apache Kafka
Microservices interaction at scale using Apache KafkaMicroservices interaction at scale using Apache Kafka
Microservices interaction at scale using Apache Kafka
 
HES2011 - Sebastien Tricaud - Capture me if you can
HES2011 - Sebastien Tricaud - Capture me if you canHES2011 - Sebastien Tricaud - Capture me if you can
HES2011 - Sebastien Tricaud - Capture me if you can
 
Hackito Ergo Sum 2011: Capture me if you can!
Hackito Ergo Sum 2011: Capture me if you can!Hackito Ergo Sum 2011: Capture me if you can!
Hackito Ergo Sum 2011: Capture me if you can!
 
Cassandra introduction apache con 2014 budapest
Cassandra introduction apache con 2014 budapestCassandra introduction apache con 2014 budapest
Cassandra introduction apache con 2014 budapest
 
Moving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsMoving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC Systems
 
When DevOps and Networking Intersect by Brent Salisbury of socketplane.io
When DevOps and Networking Intersect by Brent Salisbury of socketplane.ioWhen DevOps and Networking Intersect by Brent Salisbury of socketplane.io
When DevOps and Networking Intersect by Brent Salisbury of socketplane.io
 
2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy
2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy
2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy
 
Distributed Consensus: Making the Impossible Possible
Distributed Consensus: Making the Impossible PossibleDistributed Consensus: Making the Impossible Possible
Distributed Consensus: Making the Impossible Possible
 
Cassandra for the ops dos and donts
Cassandra for the ops   dos and dontsCassandra for the ops   dos and donts
Cassandra for the ops dos and donts
 
The Next Generation of the Consumer Rebalance Protocol with David Jacot
The Next Generation of the Consumer Rebalance Protocol with David JacotThe Next Generation of the Consumer Rebalance Protocol with David Jacot
The Next Generation of the Consumer Rebalance Protocol with David Jacot
 
Register Sort Xeon Phi final
Register Sort Xeon Phi finalRegister Sort Xeon Phi final
Register Sort Xeon Phi final
 
Crossing the Boundaries: Development Strategies for (P)SoCs
Crossing the Boundaries: Development Strategies for (P)SoCsCrossing the Boundaries: Development Strategies for (P)SoCs
Crossing the Boundaries: Development Strategies for (P)SoCs
 
Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in Alibaba
 
High Secure Password Authentication System
High Secure Password Authentication SystemHigh Secure Password Authentication System
High Secure Password Authentication System
 
SOLID principles
SOLID principlesSOLID principles
SOLID principles
 
Solving Cross-Cutting Concerns in PHP - DutchPHP Conference 2016
Solving Cross-Cutting Concerns in PHP - DutchPHP Conference 2016 Solving Cross-Cutting Concerns in PHP - DutchPHP Conference 2016
Solving Cross-Cutting Concerns in PHP - DutchPHP Conference 2016
 
Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices
Life Cycle of Metrics, Alerting, and Performance Monitoring in MicroservicesLife Cycle of Metrics, Alerting, and Performance Monitoring in Microservices
Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices
 
Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015
Eliminating SAN Congestion Just Got Much Easier-  webinar - Nov 2015 Eliminating SAN Congestion Just Got Much Easier-  webinar - Nov 2015
Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Reaching reliable agreement in an unreliable world

  • 1. Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk @heidiann Research Students Lecture Series 13th October 2015 1
  • 2. Introducing Alice Alice is new graduate from Cambridge off to the world of work. She joins a cool new start up, where she is responsible for a key value store. 2
  • 3. Single Server System Server Client 2 A 7 B 2 Client 1 Client 3 3
  • 4. Single Server System Server Client 2 A 7 B 2 Client 1 Client 3 A? 7 4
  • 5. Single Server System Server Client 2 A 7 B 2 Client 1 Client 3 B=3 OK 3 5
  • 6. Single Server System Server Client 2 A 7 B 2 Client 1 Client 3 B? 3 3 6
  • 7. Single Server System Pros • linearizable semantics • durability with write-ahead logging • easy to deploy • low latency (1 RTT in common case) • partition tolerance with retransmission & command cache Cons • system unavailable if server fails • throughput limited to one server 7
  • 8. Backups aka Primary backup replication Primary Client 2 Client 1 Backup 1 Backup 1 Backup 1 A 7 B 2 A? 7 A 7 B 2 A 7 B 2 A 7 B 2 8
  • 9. Backups aka Primary backup replication Primary Client 2 Client 1 Backup 1 Backup 1 Backup 1 A 7 B 1 B=1 A 7 B 2 A 7 B 2 A 7 B 2 A 7 B 1 9
  • 10. Backups aka Primary backup replication Primary Client 2 Client 1 Backup 1 Backup 1 Backup 1 A 7 B 1 OK A 7 B 1 A 7 B 1 A 7 B 1 OK OK OK 10
  • 11. Big Gotcha We are assuming total ordered broadcast 11
  • 12. Totally Ordered Broadcast (aka atomic broadcast) the guarantee that messages are received reliably and in the same order by all nodes. 12
  • 13. Requirements • Scalability - High throughout processing of operations. • Latency - Low latency commit of operation as perceived by the client. • Fault-tolerance - Availability in the face of machine and network failures. • Linearizable semantics - Operate as if a single server system. 13
  • 15. CAP Theorem Pick 2 of 3: • Consistency • Availability • Partition tolerance Proposed by Brewer in 1998, still debated and regarded as misleading. [Brewer’12] [Kleppmann’15] 15 Eric Brewer
  • 16. FLP Impossibility It is impossible to guarantee consensus when messages may be delay if even one node may fail. [JACM’85] 16
  • 18. Aside from Simon PJ Don’t drag your reader or listener through your blood strained path. Simon Peyton Jones 18
  • 19. Paxos Paxos is at the foundation of (almost) all distributed consensus protocols. It is a general approach of using two phases and majority quorums. It takes much more to construct a complete fault-tolerance distributed systems. Leslie Lamport 19
  • 20. Beyond Paxos • Replication techniques - mapping a consensus algorithm to a fault tolerance system • Classes of operations - handling of read vs writes, operating on stale data, soft vs hard state • Leadership - fixed, variable or leaderless, multi-paxos, how to elect a leader, how to discover a leader • Failure detectors - heartbeats & timers • Dynamic membership, Log compaction & GC, sharding, batching etc 20
  • 22. A raft in the sea of confusion 22
  • 23. Case Study: Raft Raft is the understandable replication algorithm. Provides linearisable client semantics, 1 RTT best case latency for clients. A complete(ish) architecture for making our application fault-tolerance. Uses SMR and Paxos 23
  • 26. State Machine Replication Server Client ServerServer A 7 B 2 A 7 B 2 A 7 B 2 B=3 3 3 3 26
  • 28. Ordering Each node stores is own perspective on a value known as the term. Each message includes the sender’s term and this is checked by the recipient. The term orders periods of leadership to aid in avoiding conflict. Each has one vote per term, thus there is at most one leader per term. 28
  • 29. Ordering 29 ID: 1 ID: 2 ID: 3 1 1 1 2 2 2 2
  • 30. ID: 1 Term: 0 Vote: n ID: 2 Term: 0 Vote: n ID: 5 Term: 0 Vote: n ID: 4 Term: 0 Vote: n ID: 3 Term: 0 Vote: n 30
  • 32. ID: 1 Term: 0 Vote: n ID: 2 Term: 0 Vote: n ID: 5 Term: 0 Vote: n ID: 4 Term: 1 Vote: me ID: 3 Term: 0 Vote: n Vote for me in term 1! 32
  • 33. ID: 1 Term: 1 Vote: 4 ID: 2 Term: 1 Vote: 4 ID: 5 Term: 1 Vote: 4 ID: 4 Term: 1 Vote: me ID: 3 Term: 1 Vote: 4 Ok! 33
  • 34. ID: 1 Term: 1 Vote: 4 ID: 2 Term: 1 Vote: 4 ID: 5 Term: 1 Vote: 4 ID: 4 Term: 1 Vote: me ID: 3 Term: 1 Vote: 4 Leader fails 34
  • 35. ID: 1 Term: 2 Vote: 5 ID: 2 Term: 1 Vote: 4 ID: 5 Term: 2 Vote: me ID: 4 Term: 1 Vote: me ID: 3 Term: 1 Vote: 4 Vote for me in term 2! 35
  • 36. ID: 1 Term: 2 Vote: 5 ID: 2 Term: 2 Vote: me ID: 5 Term: 2 Vote: me ID: 4 Term: 1 Vote: me ID: 3 Term: 2 Vote: 2 Vote for me in term 2! 36
  • 37. ID: 1 Term: 2 Vote: 5 ID: 2 Term: 2 Vote: me ID: 5 Term: 2 Vote: me ID: 4 Term: 1 Vote: me ID: 3 Term: 2 Vote: 2 37
  • 39. ID: 1 Term: 3 Vote: 2 ID: 2 Term: 3 Vote: me ID: 5 Term: 3 Vote: 2 ID: 4 Term: 3 Vote: 2 ID: 3 Term: 3 Vote: 2 Vote for me in term 3! 39
  • 40. ID: 1 Term: 3 Vote: 2 ID: 2 Term: 3 Vote: me ID: 5 Term: 3 Vote: 2 ID: 4 Term: 3 Vote: 2 ID: 3 Term: 3 Vote: 2 40
  • 41. Replication Each node has a log of client commands and a index into this representing which commands have been committed 41
  • 42. ID: 1 ID: 2 ID: 3 A=4 A=4 A=4 A 4 B 2 A 4 B 2 A 4 B 2 Simple Replication 42
  • 43. ID: 1 ID: 2 ID: 3 A=4 A=4 A=4 A 4 B 2 A 4 B 2 A 4 B 2 B=7 Simple Replication 43
  • 44. ID: 1 ID: 2 ID: 3 A=4 A=4 A=4 A 4 B 2 A 4 B 2 A 4 B 2 B=7 B=7 B=7 Simple Replication 44
  • 45. ID: 1 ID: 2 ID: 3 A=4 A=4 A=4 A 4 B 2 A 4 B 7 A 4 B 2 B=7 B=7 B=7 Simple Replication 45
  • 46. ID: 1 ID: 2 ID: 3 A=4 A=4 A=4 A 4 B 7 A 4 B 7 A 4 B 7 B=7 B=7 B=7 Simple Replication 46
  • 47. ID: 1 ID: 2 ID: 3 A=4 A=4 A=4 A 4 B 7 A 4 B 7 A 4 B 7 B=7 B=7 B=7 B? Simple Replication 47
  • 48. ID: 1 ID: 2 ID: 3 A=4 A=4 A=4 A 4 B 7 A 4 B 7 A 4 B 7 B=7 B=7 B=7 B? B? B? Simple Replication 48
  • 49. ID: 1 ID: 2 ID: 3 A=4 A=4 A=4 A 4 B 7 A 4 B 7 A 4 B 7 B=7 B=7 B=7 B? B? B? Simple Replication 49
  • 50. ID: 1 ID: 2 ID: 3 A=4 A=4 A=4 A 4 B 2 A 4 B 7 A 4 B 7 B=7 B=7 B? B? Catchup A=6 50
  • 51. ID: 1 ID: 2 ID: 3 A=4 A=4 A=4 A 4 B 2 A 4 B 7 A 4 B 7 B=7 B=7 B? B? Catchup A=6 A=6 :( 51
  • 52. ID: 1 ID: 2 ID: 3 A=4 A=4 A=4 A 4 B 2 A 4 B 7 A 4 B 7 B=7 B=7 B? B? Catchup A=6 A=6 :( 52
  • 53. ID: 1 ID: 2 ID: 3 A=4 A=4 A=4 A 4 B 2 A 4 B 7 A 4 B 7 B=7 B=7 B? B? Catchup A=6 A=6 :) 53
  • 54. ID: 1 ID: 2 ID: 3 A=4 A=4 A=4 A 4 B 2 A 4 B 7 A 4 B 7 B=7 B=7 B? B? Catchup A=6 A=6 B=7 B? A=6 54
  • 55. Evaluation • The leader is a serious bottleneck -> limited scalability • Can only handle the failure of a minority of nodes • Some rare network partitions render protocol in livelock 55
  • 57. Case Study: Tango Tango is designed to be a scalable replication protocol. It’s a variant of chain replication + Paxos. It is leaderless and pushes more work onto clients 57
  • 58. Simple Replication Client 1 A 7 B 2 Client 2 A 4 B 2 Sequencer Server 1 Server 2 Server 3 0 A=4 Next: 1 0 A=4 0 A=4 B=5 58
  • 59. Simple Replication Client 1 A 7 B 2 Client 2 A 4 B 2 Sequencer Server 1 Server 2 Server 3 0 A=4 Next: 2 0 A=4 0 A=4 Next? 1 59 B=5
  • 60. Simple Replication Client 1 A 7 B 2 Client 2 A 4 B 2 Sequencer Server 1 Server 2 Server 3 0 A=4 Next: 2 0 A=4 0 A=4 B=5 @ 1 OK 1 B=5 60 B=5
  • 61. Simple Replication Client 1 A 7 B 2 Client 2 A 4 B 2 Sequencer Server 1 Server 2 Server 3 0 A=4 Next: 2 0 A=4 0 A=4 1 B=5 1 B=5 61 B=5 @ 1 OK
  • 62. Simple Replication Client 1 A 7 B 2 Client 2 A 4 B 2 Sequencer Server 1 Server 2 Server 3 0 A=4 Next: 2 0 A=4 0 A=4 1 B=5 1 B=5 1 B=5 62 B=5 @ 1 OK
  • 63. Simple Replication Client 1 A 7 B 2 Client 2 A 4 B 5 Sequencer Server 1 Server 2 Server 3 0 A=4 Next: 2 0 A=4 0 A=4 1 B=5 1 B=5 1 B=5 63
  • 64. Fast Read Client 1 A 7 B 2 Client 2 A 4 B 5 Sequencer Server 1 Server 2 Server 3 0 A=4 Next: 2 0 A=4 0 A=4 1 B=5 1 B=5 1 B=5 Check? 1 B? 64
  • 65. Handling Failures Sequencer is soft-state which can be reconstructed by querying head server. Clients failing between receiving token and first read leaves gaps in the log. Clients can mark these as empty and space (though not address) is reused. Clients may fail before completing a write. The next client can fill-in and complete the operation Server failures are detected by clients, who initiate a membership change, using term numbers. 65
  • 66. Evaluation Tango is scalable, the leader is not longer the bottleneck. Dynamic membership and sharding come for free with design. High latency of chain replication 66
  • 68. 68 wait… we’re not finished yet!
  • 69. Requirements • Scalability - High throughout processing of operations. • Latency - Low latency commit of operation as perceived by the client. • Fault-tolerance - Availability in the face of machine and network failures. • Linearizable semantics - Operate as if a single server system. 69
  • 70. Many more examples • Raft [ATC’14] - Good starting point, understandable algorithm from SMR + multi-paxos variant • VRR [MIT-TR’12] - Raft with round-robin leadership & more distributed load • Tango [SOSP’13] - Scalable algorithm for f+1 nodes, uses CR + multi-paxos variant • Zookeeper [ATC'10] - Primary backup replication + atomic broadcast protocol (Zab [DSN’11]) • EPaxos [SOSP’13] - leaderless Paxos varient for WANs 70
  • 71. Can we do even better? • Self-scaling replication - adapting resources to maintain resilience level. • Geo replication - strong consistency between wide area links • Auto configuration - adapting timeouts and configure as network changes • Integrated with unikernels, virtualisation, containers and other such deployment tech 71
  • 72. Evaluation is hard • few common evaluation metrics. • often only one experiment setup is used. • different workloads • evaluation to demonstrate protocol strength 72
  • 73. Lessons Learned • Reaching consensus in distributed systems is do able • Exploit domain knowledge • Raft is a good starting point but we can do much better! Questions? 73