3. Introduction
SRP: The Totem Single-Ring Ordering and Membership
Protocol
Supports high-performance fault-tolerance distributed systems that
continue to operate despite network partitioning and remerging, and
processors fail and restart.
Provides totally ordered message delivery with low overhead, high
throughput and low latency using a logical token-passing ring.
Provides rapid detection of network partitioning and processor failure
together with reconfiguration and membership services
RRP: The Totem Redundant Ring Protocol
Based on SRP(You can think this layer modified recv/send of SRP)
Make it more reliable even a node if offline by configuring an extra
network interface
4. Introduction
Processor:
A corosync node which is member in the CPG(Closed Process Group)
Application:
Programs that uses corosync to communicate, for example pacemaker,
dlm, sheepdog
7. Reliable Ordered Delivery Services
Reliable Delivery for Configuration C
each message has unique identifier
if processor p delivers message m, p delivers m once only. If p delivers
two different messages, the p delivers 1 of those messages strictly
before delivers the other.
if p originates message m, then p will deliver m or will fail before
delivering a Configuration Change message to install a new regular
configuration
if p is a member of regular configuration C, and no configuration
change occurs, then p will deliver in C all the messages originated in C
if p delivers message m originated in C, then p is a member of C
if p and q are both members of configurations C1 and C2 then p and q
deliver the same set of messages in C1 before delivering a
Configuration Change message that terminates C1 and starts C2.
8. Reliable Ordered Delivery Services
Delivery in Causal Order
delivery order should respect Lamport causality within a configuration
Delivery in Agreed Order
guarantees that processors deliver messages in a consistent total order.
When a processor delivers a message, it has delivered all preceding
messages in the same total order
Delivery in Safe Order
When processor delivers a message, it has determined that every
processor in the current configuration has received the message and
will deliver that message unless that processor fails.
10. Totem Single Ring Protocol
The Totem Ordering Protocol (OP)
The Membership Protocol (MP)
The Recovery Protocol (RP)
11. The Totem Ordering Protocol
Under Operational state
To ensure messages are delivered to Application in Agreed Order or
Safe Order
The Application can specify in Agreed Order or Safe Order
A processor uses token to deliver messages on total order one by one
12. An Example
P2
P3
P4
P1
M3M2M1
A1 asks P1 to deliver 3 pieces of message: M1, M2, M3(in P1’ s request queue)
Suppose that P1 has got the token, it will transmit:M1, M2, M3
Of course P1 will save in its receive queue
A1
M3M2M1
13. An Example
P2
P3
P4
P1
P2 only received M2M1, while P3 and P4 received M3M2M1
P1 transmits Token to P2,in the Token, seq indicates the max seq
is 3 in P1 ‘s receive queue
P2 will compare seq with its messages, and find M3 got lost。
A1
M3M2M1
Recv: M2M1
Recv: M3M2M1
Recv: M3M2M1
Token
seq:3
aru:3
aru_id:P1
rtr:
Recv: M3M2M1
14. An Example
P2
P3
P4
P1
P2 updates aru(all-received-up-to) to 2 in the token, and set rtr to 2
Then transmits the token to P3
On receiving the token, P3 will broadcast M3 to the cluster
After clearing rtr, P3 will transmit the token to P4
A1
M3M2M1
Recv: M2M1
Recv: M3M2M1
Recv: M3M2M1
Token
seq:3
aru:2
aru_id:P2
rtr:3
Recv: M3M2M1
M3
15. An Example
P2
P3
P4
P1
P2 received message M3 broadcast by P3, others will ignore M3
P4 got the token transmit by P3, nothing to do, and transmits the
token to P1
A1
M3M2M1
Recv: M3M2M1
Recv: M3M2M1
Recv: M3M2M1
Token
seq:3
aru:2
aru_id:P2
rtr:
Recv: M3M2M1
16. An Example
P2
P3
P4
P1
P1 received the token transmit by P4, nothing to do, and transmits
it to P2
A1
M3M2M1
Recv: M3M2M1
Recv: M3M2M1
Recv: M3M2M1
Token
seq:3
aru:2
aru_id:P2
rtr:
Recv: M3M2M1
17. An Example
P2
P3
P4
P1
P1 received the token transmit by P4, nothing to do, and transmits it to P2
P2 finds the aru_id in token is itself, and it already got M3
Then updates aru to 3, and P2 knows that all node has got M3M2M1
P2 transmits the token to P3
A1
M3M2M1
Recv: M3M2M1
Recv: M3M2M1
Recv: M3M2M1
Token
seq:3
aru:2
aru_id:P2
rtr:
Recv: M3M2M1
If P2 delivers M3M2M1 to
application,it is in Safe
Order
18. In Agreed/Safe Order?
Agreed Order
If the processor got the token delivers messages to the application in
order, then the messages are in Agreed Order。
Safe Order
If aru in token are greater than a seq in two successive transmits, then
the massages are in Safe Order.
19. The Membership Protocol
Under the Gather state and Commit state
When a new processor joins the cluster or an old processor leaves the
cluster, it will form a new Single-Ring
20. An Example: new node join
P2
P3
P4
P1
P4 is a new node that joins the cluster
The old ring is {P1,P2,P3}, and its seq is 100. For nodes in old ring,
my_proc_set stores the member list
When P4 joins cluster, it will broadcast a join msg。
Upon P1,P2,P3 receiving the join msg, they enter Gather state
sender_id:P4
proc_set: P4
fail_set:
ring_seq:xx
my_proc_set:P1P2P3
my_proc_set:P1P2P3
my_proc_set:P1P2P3
my_proc_set:P4
21. An Example: new node join
P2
P3
P4
P1
When P1,P2,P3 received JoinMsg from P4, they merge the proc_set
from JoinMsg into their own my_proc_set
P1,P2,P3 will broadcast a new JoinMsg
Upon receiving JoinMsg from other nodes, every node will compare
proc_set in JoinMsg and my_proc_set, and mark consensus if they are the same
sender_id:P1
proc_set: P[1-4]
fail_set:
ring_seq:xmy_proc_set:P1P2P3P4
my_proc_set:P1P2P3P4
my_proc_set:P1P2P3P4
my_proc_set:P4
sender_id:P2
proc_set: P[1-4]
fail_set:
ring_seq:x
sender_id:P3
proc_set: P[1-4]
fail_set:
ring_seq:x
22. An Example: new node join
P2
P3
P4
P1
When a node find all members in its my_proc_set reached consensus,if the
node has minimum id, it will send Commit Token and enter commit state,
CommitToken’s ring_id.seq = max(old ring_id and JoinMsg’s ring_id) + 4
Based on the above slide, after serveral times, we suppose P1,P3,P4 reached
consensus
P2 did not receive message from P3, in P2’s consensus list, consensus[P3]=false。
my_proc_set:P1P2P3P4
consensus[All]=true
my_proc_set:P1P2P3P4
consensus[P3]=false
consensu[P1,2,4]=true
my_proc_set:P1P2P3P4
Consensus[All]=true
my_proc_set:P1P2P3P4
consensus[All]=true
Commit Token
ring_id: 104/p1
memb_list:{P1}
memb_idx:P1
P1 has the minimum
id,and transmits
commit token;but
token is discard by
P2,which trigers
token loss,and re-
send JoinMsg
P2 my_proc_set did not
reach consensus,discard
commit token,will triger
consensus timeout and re-
send JoinMsg
memb: {
P1,
old ring_id,
old my_aru,
high_delivered,
received_flg
}
23. An Example: new node join
P2
P3
P4
P1
The normal situation
After serveral times of receiving and sending JoinMsg, all Processors’
my_proc_set are marked as consensus。
P2 received Commit Token from P1, updates memb_list and memb_idx,
then transmits the Commit Token,and enters Commit state
Commit Token
ring_id: 104/p1
memb_list:{P1,P2}
memb_idx:P2
24. An Example: new node join
P2
P3
P4
P1
P3 received Commit Token from P2, updates memb_list and memb_idx,
then transmits the Commit Token,and enters Commit state
Commit Token
ring_id: 104/P1
memb_list:{P1,P2,P3}
memb_idx:P3
25. An Example: new node join
P2
P3
P4
P1
P4 received Commit Token from P3, updates memb_list and memb_idx,
then transmits the Commit Token,and enters Commit state
Commit Token
ring_id: 104/P1
memb_list:{P1,P2,P3,P4}
memb_idx:P4
26. An Example: new node join
P2
P3
P4
P1
P1 received the Commit Token from P4,as P1 is in Commit state,
P1 knows that all members are in Commit state
P1 transmits the Commit Token again, and enters Recovery state,
and set the ring_id (my_ring_id=CommitToken’s ring_id)
state: Recovery
my_ring_id: 104/P1
my_new_memb: {P1,P2,P3,P4}
my_trans_memb: {P1,P2,P3}
…
Commit Token
ring_id: 104/P1
memb_list:{P1,P2,P3,P4}
memb_idx:P1
state: commit
my_ring_id: 100/P1
my_new_memb: {}
my_trans_memb: {}
…
state: commit
my_ring_id: 100/P1
my_new_memb: {}
my_trans_memb: {}
…
state: commit
my_ring_id: 100/P1
my_new_memb: {}
my_trans_memb: {}
…
27. An Example: new node join
P2 transmits the Commit Token again, and enters Recovery state,
and set the ring_id (my_ring_id=CommitToken’s ring_id)
state: Recovery
my_ring_id: 104/P1
my_new_memb: {P1,P2,P3,P4}
my_trans_memb: {P1,P2,P3}
…
P2
P3
P4
P1
state: Recovery
my_ring_id: 104/P1
my_new_memb: {P1,P2,P3,P4}
my_trans_memb: {P1,P2,P3}
…
Commit Token
ring_id: 104/P1
memb_list:{P1,P2,P3,P4}
memb_idx:P2
state: commit
my_ring_id: 100/P1
my_new_memb: {}
my_trans_memb: {}
…
state: commit
my_ring_id: 100/P1
my_new_memb: {}
my_trans_memb: {}
…
28. An Example: new node join
P3 transmits the Commit Token again, and enters Recovery state,
and set the ring_id (my_ring_id=CommitToken’s ring_id)
P2
P3
P4
P1
state: Recovery
my_ring_id: 104/P1
my_new_memb: {P1,P2,P3,P4}
my_trans_memb: {P1,P2,P3}
…
Commit Token
ring_id: 104/P1
memb_list:{P1,P2,P3,P4}
memb_idx:P3
state: Commit
my_ring_id: 100/P1
my_new_memb: {}
my_trans_memb: {}
…
state: Recovery
my_ring_id: 104/P1
my_new_memb: {P1,P2,P3,P4}
my_trans_memb: {P1,P2,P3}
…
state: Recovery
my_ring_id: 104/P1
my_new_memb: {P1,P2,P3,P4}
my_trans_memb: {P1,P2,P3}
…
29. An Example: new node join
P4 transmits the Commit Token again, and enters Recovery state,
and set the ring_id (my_ring_id=CommitToken’s ring_id)
As P4 is a new member, there is only itself in its my_trans_memb
When P1 received the Commit Token the 3rd time, every node reached
Reovery state
state: Recovery
my_ring_id: 104/P1
my_new_memb: {P1,P2,P3,P4}
my_trans_memb: {P4}
…
P2
P3
P4
P1state: Recovery
my_ring_id: 104/P1
my_new_memb: {P1,P2,P3,P4}
my_trans_memb: {P1,P2,P3}
…
Commit Token
ring_id: 104/P1
memb_list:{P1,P2,P3,P4}
memb_idx:P4
state: Recovery
my_ring_id: 104/P1
my_new_memb: {P1,P2,P3,P4}
my_trans_memb: {P1,P2,P3}
…
state: Recovery
my_ring_id: 104/P1
my_new_memb: {P1,P2,P3,P4}
my_trans_memb: {P1,P2,P3}
…
30. The Recovery Protocol
Under Recovery state
Transition from Old Ring to New Ring,recovery message from Old
Ring to make the messages in Agreed order or Safe Order
In Rcovery state, messages delivered by Application to the New Ring
can not be broadcast(must be in Operational state)
31. The Recovery Protocol
Step1:
Exchange messages with other processors those are from the same
Old Ring(similar to Operational State)
Notice: there might be multiple Old Rings in one New Ring
Step2:
Deliver the messages those in Agreed/Safe Order to the Application in
the Old configuration(message.seq<=high_ring_delivered)
32. The Recovery Protocol
Step3:
Deliver the 1st ConfingChange Msg(Transitional Configuration) to the
Application
The 1st ConfingChange Msg contains member list of the Old Ring that
belong to the New Ring.
Step4:
Deliver messages(in Transitional Configuration) those are in
Agreed/Safe Order to the Application
33. The Recovery Protocol
Step5:
Deliver the 2nd ConfingChange Msg(New Configuration) to the
Application
The 2nd ConfingChange Msg contains member list of the New Ring
Step6:
Enter Operational State from Recovery State
Step2-Step6 does not need to exchange messages with other
Processors , it is an atomic operation
34. Totem Redundant Ring Protocol
Based on SRP(You can think this layer modified recv/send of SRP)
Make it more reliable even a node if offline by configuring an extra
network interface
35. Totem Redundant Ring Protocol
Active replication
All messages are transmit by N channels
Every message is received N times
The more channel(larger N), the higher bandwidth cost for a Processor
Passive replication
Every message is transmit by 1 of the N channels
Every message is received N times
The bandwidth is the same with Single-Ring for a Processor
Active-passive replication
The mixture of Active and passive, all messages are transmit by K
channels(1<K<N)