SlideShare a Scribd company logo
1 of 47
Teaching material 
based on Distributed 
Systems: Concepts and 
Design, Edition 3, 
Addison-Wesley 2001. 
Distributed Systems Course 
Replication 
Copyright © George 
Coulouris, Jean Dollimore, 
Tim Kindberg 2001 
email: authors@cdk2.net 
This material is made 
available for private study 
and for direct use by 
individual teachers. 
It may not be included in any 
product or employed in any 
service without the written 
permission of the authors. 
Viewing: These slides 
must be viewed in 
slide show mode. 
14.1 Introduction to replication 
14.2 System model and group 
communication 
14.3 Fault-tolerant services 
14.4 Highly available services 
14.4.1 Gossip architecture 
14.5 Transactions with replicated data
Replication of data :- the maintenance of 
copies of data at multiple computers 
Introduction to replication 
 replication can provide the following 
 performance enhancement 
– e.g. several web servers can have the same DNS name and the servers are 
selected in turn. To share the load. 
– replication of read-only data is simple, but replication of changing data has 
overheads 
e.g. a user on a train with a laptop with no access to a 
network will prepare by copying data to the laptop, e.g. a 
shared diary. If they update the diary they risk missing 
updates by other people. 
2 
 fault-tolerant service 
– guarantees correct behaviour in spite of certain faults (can include timeliness) 
– if f of f+1 servers crash then 1 remains to supply the service 
– if f of 2f+1 servers have byzantine faults then they can supply a correct service 
 availability is hindered by 
– server failures 
 replicate data at failure- independent servers and when one fails, client may use 
another. Note that caches do not help with availability(they are incomplete). 
– network partitions and disconnected operation 
 Users of mobile computers deliberately disconnect, and then on re-connection, 
resolve conflicts •
Requirements for replicated data 
3 
 Replication transparency 
– clients see logical objects (not several physical copies) 
 they access one logical item and receive a single result 
 Consistency 
– specified to suit the application, 
 e.g. when a user of a diary disconnects, their local copy may be 
inconsistent with the others and will need to be reconciled when they 
connect again. But connected clients using different copies should get 
consistent results. These issues are addressed in Bayou and Coda. 
• 
What is replication transparency?
14.2.1 System model 
 each logical object is implemented by a collection of physical 
copies called replicas 
– the replicas are not necessarily consistent all the time (some may 
have received updates, not yet conveyed to the others) 
 we assume an asynchronous system where processes fail 
only by crashing and generally assume no network partitions 
 replica managers 
– an RM contains replicas on a computer and access them directly 
– RMs apply operations to replicas recoverably 
 i.e. they do not leave inconsistent results if they crash 
– objects are copied at all RMs unless we state otherwise 
– static systems are based on a fixed set of RMs 
– in a dynamic system: RMs may join or leave (e.g. when they crash) 
– an RM can be a state machine, which has the following properties: 
4 
• 
State machine 
applies operations atomically 
its state is a deterministic function of its initial state and the operations applied 
all replicas start identical and carry out the same operations 
Its operations must not be affected by clock readings etc.
A basic architectural model for the management of 
replicated data 
A collection of RMs provides a service to clients 
5 
Requests and 
FE 
replies 
C 
Service 
C Replica 
Clients 
Front ends 
managers 
RM 
FE RM 
RM 
Figure 14.1 
• 
Clients see a service that gives them access to logical 
objects, which are in fact replicated at the RMs 
Clients request operations: those without updates are called read-only 
requests the others are called update requests (they may include reads) 
Clients request are handled by front ends. A 
front end makes replication transparent. What can the FE hide from a client?
Five phases in performing a request 
FIFO ordering: if a FE issues r then r', then any correct RM TCoatauls oarl doerdrienrgin: gif :a if cor r®re cr't, RthMen h aannyd lceosr rre bcte RfoMre hr'a, nthdelens a rn by eh cfaoonrrderel erc'st Rr bMe fore r' 
handles r before r' 
Bayou sometimes executes responses tentatively so as to be able to reorder them 
RMs agree - I.e. reach a consensus as to effect of the request. In Gossip, all RMs 
eventually receive updates. 
6 
 issue request 
– the FE either 
 sends the request to a single RM that passes it on to the others 
 or multicasts the request to all of the RMs (in state machine approach) 
 coordination 
– the RMs decide whether to apply the request; and decide on its ordering 
relative to other requests (according to FIFO, causal or total ordering) 
 execution 
– the RMs execute the request (sometimes tentatively) 
 agreement 
– RMs agree on the effect of the request, .e.g perform 'lazily' or immediately 
 response 
– one or more RMs reply to FE. e.g. 
 for high availability give first response to client. 
 to tolerate byzantine faults, take a vote 
•
We require a membership service to 
allow dynamic membership of groups 
Section 11.4 discussed multicast communication (also known as group communication 
there we took group membership to be static 7 
(although members may crash) 
14.2.2 Group communication 
 process groups are useful for managing replicated data 
– but replication systems need to be able to add/remove RMs 
 group membership service provides: 
– interface for adding/removing members 
 create, destroy process groups, add/remove members. A process can generally 
belong to several groups. 
– implements a failure detector (section 11.1 - not studied in this course) 
 which monitors members for failures (crashes/communication), 
 and excludes them when unreachable 
– notifies members of changes in membership 
– expands group addresses 
 multicasts addressed to group identifiers, 
 coordinates delivery when membership is changing 
 e.g. IP multicast allows members to join/leave and performs 
address expansion, but not the other features 
•
Services provided for process groups 
8 
Join 
Group 
address 
expansion 
Group 
send 
Multicast 
communication 
Fail Group membership 
management 
Leave 
Process group 
Figure 14.2 
• 
A process outside the 
group sends to the 
group without knowing 
the membership 
Membership service 
provides leave and join 
operations 
Failure detector notes failures and 
evicts failed processes from the group 
The group address is 
expanded 
Members are informed 
when processes 
join/leave
We will leave out the details of view delivery and view 
synchronous group communication 
 A full membership service maintains group views, which are 
lists of group members, ordered e.g. as members join group. 
 A new group view is generated each time a process joins or 
leaves the group. 
 View delivery p 561. The idea is that processes can 'deliver 
views' (like delivering multicast messages). 
– ideally we would like all processes to get the same information in the same 
order relative to the messages. 
 view synchronous group communication (p562) with 
reliability. 
– Illustrated in Fig 14.3 
– all processes agree on the ordering of messages and membership changes, 
– a joining process can safely get state from another member. 
– or if one crashes, another will know which operations it had already performed 
– This work was done in the ISIS system (Birman) 
9 
•
Figure 14.3 
View-synchronous group communication 
a (allowed). b (allowed). 
c (disallowed). d (disallowed). 
10 
p 
q 
r 
p crashes 
view (p, q, r) view (q, r) 
p 
q 
r 
p crashes 
view (p, q, r) view (q, r) 
p 
q 
r 
view (p, q, r) 
p 
q 
r 
p crashes 
view (p, q, r) view (q, r) 
p crashes 
view (q, r)
11 
14.3 Fault-tolerant services 
 provision of a service that is correct even if f 
processes fail 
– by replicating data and functionality at RMs 
– assume communication reliable and no partitions 
– RMs are assumed to behave according to specification or to crash 
– intuitively, a service is correct if it responds despite failures and clients 
can’t tell the difference between replicated data and a single copy 
– but care is needed to ensure that a set of replicas produce the same 
result as a single one would. 
– e.g (next slide). 
•
Example of a naive replication system 
RMs at A and B maintain copies of x and y 
12 
Client 1: Client 2: 
setBalanceB(x,1) 
setBalanceA(y,2) 
getBalanceA(y) ®2 
getBalanceA(x) ®0 
 initial balance of x and y is $0 
clients use local RM when available, otherwise 
the other one 
RMs propagate updates to one another after 
replying to client 
– client 1 updates X at B (local) then finds B has failed, so uses A 
– client 2 reads balances at A (local) 
 as client 1 updates y after x, client 2 should see $1 for x 
– not the behaviour that would occur if A and B were implemented at a single 
server 
 Systems can be constructed to replicate objects without 
producing this anomalous behaviour. 
 We now discuss what counts as correct behaviour in a 
replication system. 
•
Linearizability (p566) the strictest criterion for a 
replication system 
 The correctness criteria for replicated objects are defined by 
referring to a virtual interleaving which would be correct 
linearizability is not intended to be used with transactional replication systems 
–The real-time requirement means clients should receive up-to-date information 
but may not be practical due to difficulties of synchronizing clocks 
a weaker criterion is sequential consistency 
Consider operations. a replicated a replicated object service service with is two linearizable clients, that if perform for any read execution 
and update 
Client there is operations some A client interleaving waits for one operation to complete before doing o, o, oof and clients’ o, operations o, oat a such single that: 
another. 
server 
101112 202122 are interleaved in some order e.g. o, o, o, o, o, o(client 
20211022 1112 1 does oetc) 
10 –the interleaved sequence of operations meets the specification of a (single) 
correct copy of the objects 
–the order of operations in the interleaving is consistent with the real time at 
which they occurred 
– For any set of client operations there is a virtual interleaving (which would be 
13 
• 
correct for a set of single objects). 
– Each client sees a view of the objects that is consistent with this, that is, the 
results of clients operations make sense within the interleaving 
 the bank example did not make sense: if the second update is observed,the first 
update should be observed too.
Sequential consistency (p567) 
it is not linearizable because client2’s getBalance is after client 1’s setBalance in real time. 
 a replicated shared object service is sequentially consistent if for any execution 
there is some interleaving of clients’ operations such that: 
– the interleaved sequence of operations meets the specification of a (single) 
correct copy of the objects 
– the order of operations in the interleaving is consistent with the program order 
in which each client executed them 
the following is sequentially consistent but not linearizable 
14 
Client 1: Client 2: 
setBalanceB(x,1) 
getBalanceA(y) ®0 
getBalanceA(x) ®0 
setBalanceA(y,2) 
this is possible under a naive replication 
strategy, even if neither A or B fails - 
the update at B has not yet been 
propagated to A when client 2 reads it 
• 
but the following interleaving satisfies both criteria for sequential consistency : 
getBalanceA(y) ®  0; getBalanceA(x ) ®   0; setBalanceB(x,1); setBalanceA(y,2)
The FE has to find the primary, e.g. after it crashes and another takes over 
The passive (primary-backup) model for fault tolerance 
Backup 
 There is at any time a single primary RM and one or more secondary 
(backup, slave) RMs 
 FEs communicate with the primary which executes the operation and 
sends copies of the updated data to the result to backups 
 if the primary fails, one of the backups is promoted to act as the primary 
15 
C FE 
C FE 
RM 
Primary 
Backup 
RM 
RM 
Figure 14.4 
•
Passive (primary-backup) replication. Five phases. 
 The five phases in performing a client request are as follows: 
 1. Request: 
– a FE issues the request, containing a unique identifier, to the primary RM 
16 
 2. Coordination: 
– the primary performs each request atomically, in the order in which it receives 
it relative to other requests 
– it checks the unique id; if it has already done the request it re-sends the 
response. 
 3. Execution: 
– The primary executes the request and stores the response. 
 4. Agreement: 
– If the request is an update the primary sends the updated state, the response 
and the unique identifier to all the backups. The backups send an 
acknowledgement. 
 5. Response: 
– The primary responds to the FE, which hands the response back to the client. 
•
Passive (primary-backup) replication (discussion) 
 This system implements linearizability, since the primary 
sequences all the operations on the shared objects 
 If the primary fails, the system is linearizable, if a single 
backup takes over exactly where the primary left off, i.e.: 
– the primary is replaced by a unique backup 
– surviving RMs agree which operations had been performed at take over 
 view-synchronous group communication can achieve this 
– when surviving backups receive a view without the primary, they use an 
agreed function to calculate which is the new primary. 
– The new primary registers with name service 
– view synchrony also allows the processes to agree which operations were 
performed before the primary failed. 
– E.g. when a FE does not get a response, it retransmits it to the new primary 
– The new primary continues from phase 2 (coordination -uses the unique 
identifier to discover whether the request has already been performed. 
17 
•
Discussion of passive replication 
 To survive f process crashes, f+1 RMs are required 
– it cannot deal with byzantine failures because the client can't get replies from 
the backup RMs 
 To design passive replication that is linearizable 
– View synchronous communication has relatively large overheads 
– Several rounds of messages per multicast 
– After failure of primary, there is latency due to delivery of group view 
 variant in which clients can read from backups 
– which reduces the work for the primary 
– get sequential consistency but not linearizability 
 Sun NIS uses passive replication with weaker guarantees 
– Weaker than sequential consistency, but adequate to the type of data stored 
– achieves high availability and good performance 
– Master receives updates and propagates them to slaves using 1-1 
communication. Clients can uses either master or slave 
– updates are not done via RMs - they are made on the files at the master • 
18
What sort of system do we need to perform totally ordered reliable multicast? 
13.3.2. Active replication for fault tolerance 
 the RMs are state machines all playing the same role and organised as a 
group. 
– all start in the same state and perform the same operations in the same order so that 
RM 
19 
their state remains identical 
 If an RM crashes it has no effect on performance of the service because 
the others continue as normal 
 It can tolerate byzantine failures because the FE can collect and compare 
the replies it receives 
C FE RM FE C 
RM Figure 14.5 
• 
a FE multicasts each request 
to the group of RMs 
the RMs process each request 
identically and reply 
Requires totally ordered reliable 
multicast so that all RMs perfrom 
the same operations in the same 
order
Active replication - five phases in performing a client 
request 
20 
 Request 
– FE attaches a unique id and uses totally ordered reliable multicast to send 
request to RMs. FE can at worst, crash. It does not issue requests in parallel 
 Coordination 
– the multicast delivers requests to all the RMs in the same (total) order. 
 Execution 
– every RM executes the request. They are state machines and receive 
requests in the same order, so the effects are identical. The id is put in the 
response 
 Agreement 
– no agreement is required because all RMs execute the same operations in 
the same order, due to the properties of the totally ordered multicast. 
 Response 
– FEs collect responses from RMs. FE may just use one or more responses. If it 
is only trying to tolerate crash failures, it gives the client the first response. 
•
Active replication - discussion 
 As RMs are state machines we have sequential consistency 
– due to reliable totally ordered multicast, the RMs collectively do the same as a 
single copy would do 
– it works in a synchronous system 
– in an asynchronous system reliable totally ordered multicast is impossible – but 
failure detectors can be used to work around this problem. How to do that is 
beyond the scope of this course. 
 this replication scheme is not linearizable 
– because total order is not necessarily the same as real-time order 
21 
 To deal with byzantine failures 
– For up to f byzantine failures, use 2f+1 RMs 
– FE collects f+1 identical responses 
 To improve performance, 
– FEs send read-only requests to just one RM 
•
Summary for Sections 14.1-14.3 
 Replicating objects helps services to provide good 
performance, high availability and fault tolerance. 
 system model - each logical object is implemented by a set of 
physical replicas 
 linearizability and sequential consistency can be used as 
correctness criteria 
– sequential consistency is less strict and more practical to use 
 fault tolerance can be provided by: 
– passive replication - using a primary RM and backups, 
 but to achieve linearizability when the primary crashes, view-synchronous 
communication is used, which is expensive. Less strict variants can be useful. 
– active replication - in which all RMs process all requests identically 
 needs totally ordered and reliable multicast, which can be achieved in a 
synchronous system 
22 
•
23 
Highly available services 
 we discuss the application of replication techniques to make 
services highly available. 
– we aim to give clients access to the service with: 
 reasonable response times for as much of the time as possible 
 even if some results do not conform to sequential consistency 
 e.g. a disconnected user may accept temporarily inconsistent results if they can 
continue to work and fix inconsistencies later 
 eager versus lazy updates 
– fault-tolerant systems send updates to RMs in an ‘eager’ fashion (as soon as 
possible) and reach agreement before replying to the client 
– for high availability, clients should: 
 only need to contact a minimum number of RMs and 
 be tied up for a minimum time while RMs coordinate their actions 
– weaker consistency generally requires less agreement and makes data more 
available. Updates are propagated 'lazily'. 
•
14.4.1 The gossip architecture 
 the gossip architecture is a framework for implementing highly available 
services 
– data is replicated close to the location of clients 
– RMs periodically exchange ‘gossip’ messages containing updates 
 gossip service provides two types of operations 
– queries - read only operations 
– updates - modify (but do not read) the state 
 FE sends queries and updates to any chosen RM 
– one that is available and gives reasonable response times 
 Two guarantees (even if RMs are temporarily unable to communicate 
– each client gets a consistent service over time ( i.e. data reflects the updates seen by 
client, even if the use different RMs). Vector timestamps are used – with one entry per 
RM. 
– relaxed consistency between replicas. All RMs eventually receive all updates. RMs use 
ordering guarantees to suit the needs of the application (generally causal ordering). 
Client may observe stale data. 
24 
•
Query and update operations in a gossip service 
 The service consists of a collection of RMs that exchange gossip messages 
 Queries and updates are sent by a client via an FE to an RM 
prev is a vector timestamp for the latest version seen by the FE (and client) 
RM RM 
Update, prev Update id 
25 
• 
FE 
Query Val 
RM 
Query, prev Val, new 
FE 
Update 
Service 
Clients 
Figure 14.6 
new is the vector 
timestamp of the 
resulting value, val 
update id is the vector 
timestamp of the update 
Gossip 
Causal ordering
Gossip processing of queries and updates 
 The five phases in performing a client request are: 
26 
– request 
Causal ordering 
 FEs normally use the same RM and may be blocked on queries 
 update operations return to the client as soon as the operation is passed to the FE 
– update response - the RM replies as soon as it has seen the update 
– coordination 
 the RM waits to apply the request until the ordering constraints apply. 
 this may involve receiving updates from other RMs in gossip messages 
– execution - the RM executes the request 
– query response - if the request is a query the RM now replies: 
– agreement 
 RMs update one another by exchanging gossip messages (lazily) 
• e.g. when several updates have been collected 
• or when an RM discovers it is missing an update 
•
Front ends propagate their timestamps whenever clients 
communicate directly 
 each FE keeps a vector timestamp of the latest value seen (prev) 
– which it sends in every request 
– clients communicate with one another via FEs which pass vector 
timestamps 
RM 
gossip 
RM RM 
27 
FE 
Clients 
FE 
Service 
Vector 
timestamps 
client-to-client communication 
can lead to causal 
relationships between 
operations. 
Figure 14.7 
•
A gossip replica manager, showing its main state 
components 
timestamp table -a collection of vector timestamps 
received from other RMs in gossip messages. It is used 
to know when RMs have received updates 
executed operation table - prevents an operation being 
applied twice e.g. if received from other RMs as well as FE 
update log - held-back until ordering allows it to be applied (when it becomes stable) also held until 
updates have been received by all other RMs 
28 
• 
Replica t imestamp 
Update log 
Value timestamp 
Value 
Executed operation table 
Stable 
updates 
Updates 
Gossip 
messages 
FE 
Replica 
timestamp 
Replica log 
OperationID Update Prev 
FE 
Replica manager 
Other replica managers 
Timestamp table 
Figure 14.8 
value - application 
state (each RM is a 
state machine) we 
are only talking 
about one value 
here 
value timestamp (updated each time an update is 
applied to the value) 
replica timestamp - indicates updates accepted by RM in log (different from 
value’s timestamp if some updates are not yet stable)
Processing of query and update operations 
 Vector timestamp held by RM i consists of: 
– ith element holds updates received from FEs by that RM 
– jth element holds updates received by RM j and propagated to RM i 
 Query operations contain q.prev 
– they can be applied if q.prev ≤ valueTS (value timestamp) 
– failing this, the RM can wait for gossip message or initiate them 
 e.g. if valueTS = (2,5,5) and q.prev = (2,4,6) - RM 0 has missed an update 
from RM 2 
– Once the query can be applied, the RM returns valueTS (new) to the 
FE. The FE merges new with its vector timestamp 
29 
• 
RMs are numbered 0, 1, 2,… 
e.g. in a gossip system with 3 RMs a value of (2,4,5) at RM 0 means that the 
value there reflects the first 2 updates accepted from FEs at RM 0, the first 4 at 
RM 1 and the first 5 at RM 2.
30 
Gossip update operations 
 Update operations are processed in causal order 
– A FE sends update operation u.op, u.prev, u.id to RM i 
 A FE can send a request to several RMs, using same id 
– When RM i receives an update request, it checks whether it is new, by looking for the id 
in its executed ops table and its log 
– if it is new, the RM 
 increments by 1 the ith element of its replica timestamp, 
 assigns a unique vector timestamp ts to the update 
 and stores the update in its log 
logRecord = <i, ts, u.op, u.prev, u.id> 
– The timestamp ts is calculated from u.prev by replacing its ith element by the ith 
element of the replica timestamp. 
– The RM returns ts to the FE,which merges it with its vector timestamp 
– For stability u.prev ≤ valueTS 
– That is, the valueTS reflects all updates seen by the FE. 
– When stable, the RM applies the operation u.op to the value,updates valueTS and adds 
u.id to the executed operation table. 
•
31 
Gossip messages 
 an RM uses entries in its timestamp table to estimate which 
updates another RM has not yet received 
– The timestamp table contains a vector timestamp for each other replica, 
collected from gossip messages 
 gossip message, m contains log m.log and replica timestamp 
m.ts 
 an RM receiving gossip message m has the following main 
tasks 
– merge the arriving log with its own (omit those with ts ≤ replicaTS) 
– apply in causal order updates that are new and have become stable 
– remove redundant entries from the log and executed operation table when it is 
known that they have been applied by all RMs 
– merge its replica timestamp with m.ts, so that it corresponds to the additions in 
the log 
•
Discussion of Gossip architecture 
 the gossip architecture is designed to provide a highly 
available service 
 clients with access to a single RM can work when other RMs 
are inaccessible 
– but it is not suitable for data such as bank accounts 
– it is inappropriate for updating replicas in real time (e.g. a conference) 
32 
 scalability 
– as the number of RMs grow, so does the number of gossip messages 
– for R RMs, the number of messages per request (2 for the request and the 
rest for gossip) = 2 + (R-1)/G 
 G is the number of updates per gossip message 
 increase G and improve number of gossip messages, but make latency worse 
 for applications where queries are more frequent than updates, use some read-only 
replicas, which are updated only by gossip messages 
•
14.5 Transactions with replicated data 
 objects in transactional systems are replicated to enhance availability and 
performance 
– the effect of transactions on replicated objects should be the same as if they had been 
performed one at a time on a single set of objects. 
– this property is called one-copy serializability. 
– it is similar to, but not to be confused with, sequential consistency. 
 sequential consistency does not take transactions into account 
– each RM provides concurrency control and recovery of its own objects 
 we assume two-phase locking in this section 
– replication makes recovery more complicated 
 when an RM recovers, it restores its objects with information from other RMs 
34 
•
14.5.1 Architectures for replicated transactions 
 We assume that an FE sends requests to one of a group of RMs 
– in the primary copy approach, all FEs communicate with a single RM which propagates 
35 
updates to back-ups. 
– In other schemes, FEs may communicate with any RM and coordination between RMs 
is more complex 
– an RM that receives a request is responsible for getting cooperation from the other RMs 
 rules as to how many RMs are involved vary with the replication scheme 
• e.g. in the read one/write all scheme, one RM is required for a read request and all RMs 
for a write request 
 propagate requests immediately or at the end of a transaction? 
– in the primary copy scheme, we can wait until end of transaction (concurrency control is 
applied at the primary) 
– but if transactions access the same objects at different RMs, we need to propagate the 
requests so that concurrency control can be applied 
 two-phase commit protocol 
– becomes a two-level nested 2PC. If a coordinator or worker is an RM it will 
communicate with other RMs that it passed requests to during the transaction 
•
Consider pairs of operations by different 
transactions on the same object. 
Any pair of write operations will require 
conflicting locks at all of the RMs 
a read operation and a write operation will 
require conflicting locks at a single RM. 
This one-copy serializability is achieved 
Transactions on replicated data 
 in read one/write all replication, one RM is required for a 
read request and all RMs for a write request 
T U 
36 
• 
B 
Client + front end 
A 
A A B B B 
getBalance(A) 
Client + front end 
Replica managers 
Replica managers 
deposit(B,3); 
Figure 14.10 
every write operation must be performed at all 
RMs, each of which applies a write lock 
each read operation is performed by a 
single RM, which sets a read lock
14.5.2 Available copies replication 
 the simple read one/write all scheme is not realistic 
– because it cannot be carried out if some of the RMs are unavailable, 
– either because the have crashed or because of a communication failure 
 the available copies replication scheme is designed to allow some RMs to 
be temporarily unavailable 
– a read request can be performed by any available RM 
– writes requests are performed by the receiving RM and all other available RMs in the 
37 
group 
•
Available copies – read one/ write all available 
 local concurrency control achieves one-copy serializability provided the set of RMs 
whereas Ts deposit is performed by M, N and P. 
38 
does not change. 
 but we have RMs failing and recovering 
T’s getBalance is performed by X 
Client + front end 
A 
X 
Client + front end 
deposit(A,3); 
P 
B 
Replica managers 
T U 
deposit(B,3); 
getBalance(B) 
getBalance(A) 
Replica managers 
Y 
M 
B 
N 
A 
B 
Figure 14.11 
• 
At X T has read A and has locked it. Therefore U’s deposit is delayed until T finishes 
delay
39 
Available copies 
 Replica manager failure 
– An RM can fail by crashing and is replaced by a new process 
 the new process restores its state from its recovery file 
– FEs use timeouts in case an RM fails 
 then try the request at another RM 
 in the case of a write, the RM passing it on may observe failures 
– If an RM is doing recovery, it rejects requests (& FE tries another RM) 
– For one-copy serializability, failures and recoveries are serialized with 
respect to transactions 
 that is, if a transaction observes that a failure occurs, it must be observed 
before it started or after it finished 
 one-copy serializability is not achieved if different transactions make 
conflicting failure observations 
•
therefore additional concurrency control is required to prevent inconsistent results between a 
read in one transaction and a write in another transaction 
Available copies replication: RM failure example 
• both RMs fail before T and U have performed their deposit operations 
– Therefore T’s deposit will be performed at RMs M and P (all available) 
– and U’s deposit will be performed at RM Y. (all available). 
assume that RM X fails just after T has performed getBalance 
and RM N fails just after U has performed getBalance. 
concurrency control at X does not prevent U from updating A at Y 
concurrency control at M does not prevent T from updating B at M & P 
40 
Client + front end 
A 
X 
Client + front end 
deposit(A,3); 
P 
B 
Replica managers 
T U 
deposit(B,3); 
getBalance(B) 
getBalance(A) 
Replica managers 
Y 
M 
B 
N 
A 
B 
Figure 14.11 
•
–after all the operations of a transaction have been carried out 
the FE will inform the coordinator of failed RMs it knows about 
the coordinator can attempt to communicate with any RMs noted to have failed 
then in doing the 2PC it will discover whether any RMs involved in the transaction 
have subsequently failed 
41 
Available copies replication 
 Local validation (the additional concurrency control) 
– before a transaction commits, it checks for failures and recoveries of the RMs it has 
contacted 
 e.g. T would check if N is still unavailable and that X, M and P are still available. 
 If this is the case, T can commit. 
this implies that X failed after T validated and before U validated 
• i.e. we have N fails ® T commits ®  X fails ®  U validates 
• (above, we said X fails before T’s deposit, in which case,T would have to abort) 
 U checks if N is still available (no) and X still unavailable 
• therefore U must abort 
• 
P 
B 
M 
B 
N 
B 
Y 
A 
X A
Network partitions divide RMs into subgroups 
 the subgroups cannot communicate with one another 
e.g. the RMs doing the deposit can’t communicate with those doing the withdraw 
Reading – during replication a partition schemes would not cause assume inconsistency, partitions writing will be might. 
repaired 
Optimistic schemes e.g available copies with validation - resolve consistencies when a partition is 
repaired. We have to be able to do compensating actions, otherwise the scheme is unsuitable 
e.g. unsuitable for banking. We are not studying this. See section 14.5.4 
 therefore operations done during a partition must not cause inconsistency 
 pessimistic schemes (e.g. quorum consensus) prevent inconsistency 
T Network U 
42 
• 
Client + front end 
B 
withdraw(B, 4) 
Client + front end 
Replica managers 
deposit(B,3); 
partition 
B 
B B 
Figure 14.12
14.5.5 Quorum consensus methods 
 To prevent transactions in different partitions from producing 
inconsistent results 
– make a rule that operations can be performed in only one of the partitions. 
 RMs in different partitions cannot communicate: 
– each subgroup decides independently whether they can perform operations. 
 A quorum is a subgroup of RMs whose size gives it the right 
to perform operations. 
– e.g. if having the majority of the RMs could be the criterion 
 in quorum consensus schemes 
– update operations may be performed by a subset of the RMs 
 and the other RMs have out-of-date copies 
 version numbers or timestamps are used to determine which copies are up-to-date 
 operations are applied only to copies with the current version number 
43 
•
Gifford’s quorum consensus file replication scheme 
 a number of ‘votes’ is assigned to each physical copy of a logical file at an 
RM 
– a vote is a weighting giving the desirability of using a particular copy. 
– each read operation must obtain a read quorum of R votes before it can read from any 
44 
up-to-date copy 
– each write operation must obtain a write quorum of W votes before it can do an update 
operation. 
– R and W are set for a group of replica managers such that 
 W > half the total votes 
 R + W > total number of votes for the group 
– ensuring that any pair contain common copies (i.e. a read quorum and a write quorum 
or two write quora) 
– therefore in a partition it is not possible to perform conflicting operations on the same 
file, but in different partitions. 
•
Gifford’s quorum consensus - 
performing read and write operations 
 before a read operation, a read quorum is collected 
– by making version number enquiries at RMs to find a set of copies, the sum of whose 
votes is not less than R (not all of these copies need be up to date). 
– as each read quorum overlaps with every write quorum, every read quorum is certain to 
45 
include at least one current copy. 
– the read operation may be applied to any up-to-date copy. 
 before a write operation, a write quorum is collected 
– by making version number enquiries at RMs to find a set with up-to-date copies, the 
sum of whose votes is not less than W. 
– if there are insufficient up-to-date copies, then an out-of-date file is replaced with a 
current one, to enable the quorum to be established. 
– the write operation is then applied by each RM in the write quorum, the version number 
is incremented and completion is reported to the client. 
– the files at the remaining available RMs are then updated in the background. 
 Two-phase read/write locking is used for concurrency control 
– the version number enquiry sets read locks (read and write quora overlap) 
•
Gifford’s quorum consensus: configurability of groups 
of replica managers 
 groups of RMs can be configured to give different 
performance or reliability characteristics 
– once the R and W have been chosen for a set of RMs: 
– the reliability and performance of write operations may be increased by 
decreasing W 
– and similarly for reads by decreasing R 
 the performance of read operations is degraded by the need 
to collect a read consensus 
 examples from Gifford 
– three examples show the range of properties that can be achieved by 
allocating weights to the various RMs in a group and assigning R and W 
appropriately 
– weak representatives (on local disk) have zero votes, get a read quorum from 
RMs with votes and then read from the local copy 
46 
•
Gifford’s quorum consensus examples (1979) 
Example 1 Example 2 Example 3 
Latency Replica 1 75 75 75 
(milliseconds) Replica 2 65 100 750 
Replica 3 65 750 750 
Voting Replica 1 1 2 1 
configuration Replica 2 0 1 1 
Replica 3 0 1 1 
Quorum R 1 2 1 
sizes W 1 3 3 
Example 1 is configured for a file with high read to write ratio 
with several weak representatives and a single RM. 
Replication is used for performance, not reliability. 
The RM can be accessed in 75 ms and the two clients can access 
their weak representatives in 65 ms, resulting in lower latency and 
less network traffic 
Example 2 is configured for a file with a moderate read to write ratio 
which is accessed mainly from one local network. Local RM has 2 
votes and remote RMs 1 vote each. 
Reads can be done at the local RM, but writes must access one local 
RM and one remote RM. If the local RM fails only reads are allowed 
Example 3 is configured for a file with a very high read to write ratio. 
Reads can be done at any RM and the probability of the file being 
unavailable is small. But writes must access all RMs. 
47 
Derived performance of file suite: 
Read Latency 65 75 75 
Blocking probability 0.01 0.0002 0.000001 
Write Latency 75 100 750 
Blocking probability 0.01 0.0101 0.03 
Derived performance 
latency 
blocking probability - probability 
that a quorum cannot be 
obtained, assuming probability of 
0.01 that any single RM is 
unavailable 
•
Summary for Gossip and replication in transactions 
 the Gossip architecture is designed for highly 
available services 
– it uses a lazy form of replication in which RMs update one another 
from time to time by means of gossip messages 
– it allows clients to make updates to local replicas while partitioned 
– RMs exchange updates with one another when reconnected 
52 
 replication in transactions 
– primary-backup architectures can be used 
– other architectures allow FMs to use any RM 
 available copies allows RMs to fail, but cannot deal with partitions 
 quorum consensus does allow transactions to progress in the presence of 
partitions, but the performance of read operations is degraded by the need 
to collect a read consensus 
•

More Related Content

What's hot

remote procedure calls
  remote procedure calls  remote procedure calls
remote procedure callsAshish Kumar
 
Inter-Process Communication in distributed systems
Inter-Process Communication in distributed systemsInter-Process Communication in distributed systems
Inter-Process Communication in distributed systemsAya Mahmoud
 
Chapter 4 a interprocess communication
Chapter 4 a interprocess communicationChapter 4 a interprocess communication
Chapter 4 a interprocess communicationAbDul ThaYyal
 
Group Communication (Distributed computing)
Group Communication (Distributed computing)Group Communication (Distributed computing)
Group Communication (Distributed computing)Sri Prasanna
 
Clock synchronization in distributed system
Clock synchronization in distributed systemClock synchronization in distributed system
Clock synchronization in distributed systemSunita Sahu
 
4.file service architecture
4.file service architecture4.file service architecture
4.file service architectureAbDul ThaYyal
 
Distributed datababase Transaction and concurrency control
Distributed datababase Transaction and concurrency controlDistributed datababase Transaction and concurrency control
Distributed datababase Transaction and concurrency controlbalamurugan.k Kalibalamurugan
 
Synchronization in distributed computing
Synchronization in distributed computingSynchronization in distributed computing
Synchronization in distributed computingSVijaylakshmi
 
Consistency protocols
Consistency protocolsConsistency protocols
Consistency protocolsZongYing Lyu
 
Design Goals of Distributed System
Design Goals of Distributed SystemDesign Goals of Distributed System
Design Goals of Distributed SystemAshish KC
 
RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure callSunita Sahu
 
clock synchronization in Distributed System
clock synchronization in Distributed System clock synchronization in Distributed System
clock synchronization in Distributed System Harshita Ved
 
Agreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared MemoryAgreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared MemorySHIKHA GAUTAM
 
Distributed shared memory shyam soni
Distributed shared memory shyam soniDistributed shared memory shyam soni
Distributed shared memory shyam soniShyam Soni
 

What's hot (20)

Process Management-Process Migration
Process Management-Process MigrationProcess Management-Process Migration
Process Management-Process Migration
 
remote procedure calls
  remote procedure calls  remote procedure calls
remote procedure calls
 
Inter-Process Communication in distributed systems
Inter-Process Communication in distributed systemsInter-Process Communication in distributed systems
Inter-Process Communication in distributed systems
 
Chapter 4 a interprocess communication
Chapter 4 a interprocess communicationChapter 4 a interprocess communication
Chapter 4 a interprocess communication
 
Group Communication (Distributed computing)
Group Communication (Distributed computing)Group Communication (Distributed computing)
Group Communication (Distributed computing)
 
Clock synchronization in distributed system
Clock synchronization in distributed systemClock synchronization in distributed system
Clock synchronization in distributed system
 
Message passing in Distributed Computing Systems
Message passing in Distributed Computing SystemsMessage passing in Distributed Computing Systems
Message passing in Distributed Computing Systems
 
4.file service architecture
4.file service architecture4.file service architecture
4.file service architecture
 
Distributed datababase Transaction and concurrency control
Distributed datababase Transaction and concurrency controlDistributed datababase Transaction and concurrency control
Distributed datababase Transaction and concurrency control
 
message passing
 message passing message passing
message passing
 
Synchronization in distributed computing
Synchronization in distributed computingSynchronization in distributed computing
Synchronization in distributed computing
 
Consistency protocols
Consistency protocolsConsistency protocols
Consistency protocols
 
Design Goals of Distributed System
Design Goals of Distributed SystemDesign Goals of Distributed System
Design Goals of Distributed System
 
RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure call
 
Chapter 6 os
Chapter 6 osChapter 6 os
Chapter 6 os
 
clock synchronization in Distributed System
clock synchronization in Distributed System clock synchronization in Distributed System
clock synchronization in Distributed System
 
Agreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared MemoryAgreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared Memory
 
data replication
data replicationdata replication
data replication
 
Distributed shared memory shyam soni
Distributed shared memory shyam soniDistributed shared memory shyam soni
Distributed shared memory shyam soni
 
Peer to Peer services and File systems
Peer to Peer services and File systemsPeer to Peer services and File systems
Peer to Peer services and File systems
 

Similar to Chapter 14 replication

Client Server Model and Distributed Computing
Client Server Model and Distributed ComputingClient Server Model and Distributed Computing
Client Server Model and Distributed ComputingAbhishek Jaisingh
 
Distributed systems principles and paradigms.pdf
Distributed systems principles and paradigms.pdfDistributed systems principles and paradigms.pdf
Distributed systems principles and paradigms.pdfBizuayehuDesalegn
 
OPERATING SYSTEMS PRESENTATION.pptx
OPERATING SYSTEMS PRESENTATION.pptxOPERATING SYSTEMS PRESENTATION.pptx
OPERATING SYSTEMS PRESENTATION.pptxPokormanZuda
 
19. Distributed Databases in DBMS
19. Distributed Databases in DBMS19. Distributed Databases in DBMS
19. Distributed Databases in DBMSkoolkampus
 
24. Advanced Transaction Processing in DBMS
24. Advanced Transaction Processing in DBMS24. Advanced Transaction Processing in DBMS
24. Advanced Transaction Processing in DBMSkoolkampus
 
Chapter 7-Fault Tolerance.ppt
Chapter 7-Fault Tolerance.pptChapter 7-Fault Tolerance.ppt
Chapter 7-Fault Tolerance.pptHabib246314
 
Talon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategyTalon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategySaptarshi Chatterjee
 
CHP-4.pptx
CHP-4.pptxCHP-4.pptx
CHP-4.pptxFamiDan
 
Distributed system
Distributed systemDistributed system
Distributed systemchirag patil
 
Java Abs Dynamic Server Replication
Java Abs   Dynamic Server ReplicationJava Abs   Dynamic Server Replication
Java Abs Dynamic Server Replicationncct
 
18 philbe replication stanford99
18 philbe replication stanford9918 philbe replication stanford99
18 philbe replication stanford99ashish61_scs
 
MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DE
MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DEMC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DE
MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DEAravind NC
 
What is active-active
What is active-activeWhat is active-active
What is active-activeSaif Ahmad
 
New background processes in 11g r2
New background processes in 11g r2New background processes in 11g r2
New background processes in 11g r2Kiran Gali
 
Active-Active Multi-Region Architectures.pdf
Active-Active Multi-Region Architectures.pdfActive-Active Multi-Region Architectures.pdf
Active-Active Multi-Region Architectures.pdfDavid Rostcheck
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systemsguest61205606
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systemsguest61205606
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systemsguest0f5a7d
 

Similar to Chapter 14 replication (20)

Client Server Model and Distributed Computing
Client Server Model and Distributed ComputingClient Server Model and Distributed Computing
Client Server Model and Distributed Computing
 
Distributed systems principles and paradigms.pdf
Distributed systems principles and paradigms.pdfDistributed systems principles and paradigms.pdf
Distributed systems principles and paradigms.pdf
 
SVCC-2014
SVCC-2014SVCC-2014
SVCC-2014
 
OPERATING SYSTEMS PRESENTATION.pptx
OPERATING SYSTEMS PRESENTATION.pptxOPERATING SYSTEMS PRESENTATION.pptx
OPERATING SYSTEMS PRESENTATION.pptx
 
19. Distributed Databases in DBMS
19. Distributed Databases in DBMS19. Distributed Databases in DBMS
19. Distributed Databases in DBMS
 
24. Advanced Transaction Processing in DBMS
24. Advanced Transaction Processing in DBMS24. Advanced Transaction Processing in DBMS
24. Advanced Transaction Processing in DBMS
 
Chapter 7-Fault Tolerance.ppt
Chapter 7-Fault Tolerance.pptChapter 7-Fault Tolerance.ppt
Chapter 7-Fault Tolerance.ppt
 
Talon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategyTalon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategy
 
CHP-4.pptx
CHP-4.pptxCHP-4.pptx
CHP-4.pptx
 
Distributed system
Distributed systemDistributed system
Distributed system
 
Java Abs Dynamic Server Replication
Java Abs   Dynamic Server ReplicationJava Abs   Dynamic Server Replication
Java Abs Dynamic Server Replication
 
18 philbe replication stanford99
18 philbe replication stanford9918 philbe replication stanford99
18 philbe replication stanford99
 
MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DE
MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DEMC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DE
MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DE
 
What is active-active
What is active-activeWhat is active-active
What is active-active
 
New background processes in 11g r2
New background processes in 11g r2New background processes in 11g r2
New background processes in 11g r2
 
Active-Active Multi-Region Architectures.pdf
Active-Active Multi-Region Architectures.pdfActive-Active Multi-Region Architectures.pdf
Active-Active Multi-Region Architectures.pdf
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systems
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systems
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
 
master_seminar
master_seminarmaster_seminar
master_seminar
 

More from AbDul ThaYyal

Chapter 15 distributed mm systems
Chapter 15 distributed mm systemsChapter 15 distributed mm systems
Chapter 15 distributed mm systemsAbDul ThaYyal
 
Chapter 11d coordination agreement
Chapter 11d coordination agreementChapter 11d coordination agreement
Chapter 11d coordination agreementAbDul ThaYyal
 
Chapter 11c coordination agreement
Chapter 11c coordination agreementChapter 11c coordination agreement
Chapter 11c coordination agreementAbDul ThaYyal
 
Chapter 8 distributed file systems
Chapter 8 distributed file systemsChapter 8 distributed file systems
Chapter 8 distributed file systemsAbDul ThaYyal
 
Chapter 3 networking and internetworking
Chapter 3 networking and internetworkingChapter 3 networking and internetworking
Chapter 3 networking and internetworkingAbDul ThaYyal
 
Chapter 1 characterisation of distributed systems
Chapter 1 characterisation of distributed systemsChapter 1 characterisation of distributed systems
Chapter 1 characterisation of distributed systemsAbDul ThaYyal
 
Chapter 2 system models
Chapter 2 system modelsChapter 2 system models
Chapter 2 system modelsAbDul ThaYyal
 
4.file service architecture (1)
4.file service architecture (1)4.file service architecture (1)
4.file service architecture (1)AbDul ThaYyal
 
4. concurrency control
4. concurrency control4. concurrency control
4. concurrency controlAbDul ThaYyal
 
3. distributed file system requirements
3. distributed file system requirements3. distributed file system requirements
3. distributed file system requirementsAbDul ThaYyal
 

More from AbDul ThaYyal (20)

Chapter 17 corba
Chapter 17 corbaChapter 17 corba
Chapter 17 corba
 
Chapter 15 distributed mm systems
Chapter 15 distributed mm systemsChapter 15 distributed mm systems
Chapter 15 distributed mm systems
 
Chapter 11d coordination agreement
Chapter 11d coordination agreementChapter 11d coordination agreement
Chapter 11d coordination agreement
 
Chapter 11c coordination agreement
Chapter 11c coordination agreementChapter 11c coordination agreement
Chapter 11c coordination agreement
 
Chapter 11b
Chapter 11bChapter 11b
Chapter 11b
 
Chapter 11
Chapter 11Chapter 11
Chapter 11
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
 
Chapter 9 names
Chapter 9 namesChapter 9 names
Chapter 9 names
 
Chapter 8 distributed file systems
Chapter 8 distributed file systemsChapter 8 distributed file systems
Chapter 8 distributed file systems
 
Chapter 7 security
Chapter 7 securityChapter 7 security
Chapter 7 security
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
 
Chapter 3 networking and internetworking
Chapter 3 networking and internetworkingChapter 3 networking and internetworking
Chapter 3 networking and internetworking
 
Chapter 1 characterisation of distributed systems
Chapter 1 characterisation of distributed systemsChapter 1 characterisation of distributed systems
Chapter 1 characterisation of distributed systems
 
Chapter 2 system models
Chapter 2 system modelsChapter 2 system models
Chapter 2 system models
 
4.file service architecture (1)
4.file service architecture (1)4.file service architecture (1)
4.file service architecture (1)
 
4. system models
4. system models4. system models
4. system models
 
4. concurrency control
4. concurrency control4. concurrency control
4. concurrency control
 
3. distributed file system requirements
3. distributed file system requirements3. distributed file system requirements
3. distributed file system requirements
 
3. challenges
3. challenges3. challenges
3. challenges
 
2. microkernel new
2. microkernel new2. microkernel new
2. microkernel new
 

Chapter 14 replication

  • 1. Teaching material based on Distributed Systems: Concepts and Design, Edition 3, Addison-Wesley 2001. Distributed Systems Course Replication Copyright © George Coulouris, Jean Dollimore, Tim Kindberg 2001 email: authors@cdk2.net This material is made available for private study and for direct use by individual teachers. It may not be included in any product or employed in any service without the written permission of the authors. Viewing: These slides must be viewed in slide show mode. 14.1 Introduction to replication 14.2 System model and group communication 14.3 Fault-tolerant services 14.4 Highly available services 14.4.1 Gossip architecture 14.5 Transactions with replicated data
  • 2. Replication of data :- the maintenance of copies of data at multiple computers Introduction to replication  replication can provide the following  performance enhancement – e.g. several web servers can have the same DNS name and the servers are selected in turn. To share the load. – replication of read-only data is simple, but replication of changing data has overheads e.g. a user on a train with a laptop with no access to a network will prepare by copying data to the laptop, e.g. a shared diary. If they update the diary they risk missing updates by other people. 2  fault-tolerant service – guarantees correct behaviour in spite of certain faults (can include timeliness) – if f of f+1 servers crash then 1 remains to supply the service – if f of 2f+1 servers have byzantine faults then they can supply a correct service  availability is hindered by – server failures  replicate data at failure- independent servers and when one fails, client may use another. Note that caches do not help with availability(they are incomplete). – network partitions and disconnected operation  Users of mobile computers deliberately disconnect, and then on re-connection, resolve conflicts •
  • 3. Requirements for replicated data 3  Replication transparency – clients see logical objects (not several physical copies)  they access one logical item and receive a single result  Consistency – specified to suit the application,  e.g. when a user of a diary disconnects, their local copy may be inconsistent with the others and will need to be reconciled when they connect again. But connected clients using different copies should get consistent results. These issues are addressed in Bayou and Coda. • What is replication transparency?
  • 4. 14.2.1 System model  each logical object is implemented by a collection of physical copies called replicas – the replicas are not necessarily consistent all the time (some may have received updates, not yet conveyed to the others)  we assume an asynchronous system where processes fail only by crashing and generally assume no network partitions  replica managers – an RM contains replicas on a computer and access them directly – RMs apply operations to replicas recoverably  i.e. they do not leave inconsistent results if they crash – objects are copied at all RMs unless we state otherwise – static systems are based on a fixed set of RMs – in a dynamic system: RMs may join or leave (e.g. when they crash) – an RM can be a state machine, which has the following properties: 4 • State machine applies operations atomically its state is a deterministic function of its initial state and the operations applied all replicas start identical and carry out the same operations Its operations must not be affected by clock readings etc.
  • 5. A basic architectural model for the management of replicated data A collection of RMs provides a service to clients 5 Requests and FE replies C Service C Replica Clients Front ends managers RM FE RM RM Figure 14.1 • Clients see a service that gives them access to logical objects, which are in fact replicated at the RMs Clients request operations: those without updates are called read-only requests the others are called update requests (they may include reads) Clients request are handled by front ends. A front end makes replication transparent. What can the FE hide from a client?
  • 6. Five phases in performing a request FIFO ordering: if a FE issues r then r', then any correct RM TCoatauls oarl doerdrienrgin: gif :a if cor r®re cr't, RthMen h aannyd lceosr rre bcte RfoMre hr'a, nthdelens a rn by eh cfaoonrrderel erc'st Rr bMe fore r' handles r before r' Bayou sometimes executes responses tentatively so as to be able to reorder them RMs agree - I.e. reach a consensus as to effect of the request. In Gossip, all RMs eventually receive updates. 6  issue request – the FE either  sends the request to a single RM that passes it on to the others  or multicasts the request to all of the RMs (in state machine approach)  coordination – the RMs decide whether to apply the request; and decide on its ordering relative to other requests (according to FIFO, causal or total ordering)  execution – the RMs execute the request (sometimes tentatively)  agreement – RMs agree on the effect of the request, .e.g perform 'lazily' or immediately  response – one or more RMs reply to FE. e.g.  for high availability give first response to client.  to tolerate byzantine faults, take a vote •
  • 7. We require a membership service to allow dynamic membership of groups Section 11.4 discussed multicast communication (also known as group communication there we took group membership to be static 7 (although members may crash) 14.2.2 Group communication  process groups are useful for managing replicated data – but replication systems need to be able to add/remove RMs  group membership service provides: – interface for adding/removing members  create, destroy process groups, add/remove members. A process can generally belong to several groups. – implements a failure detector (section 11.1 - not studied in this course)  which monitors members for failures (crashes/communication),  and excludes them when unreachable – notifies members of changes in membership – expands group addresses  multicasts addressed to group identifiers,  coordinates delivery when membership is changing  e.g. IP multicast allows members to join/leave and performs address expansion, but not the other features •
  • 8. Services provided for process groups 8 Join Group address expansion Group send Multicast communication Fail Group membership management Leave Process group Figure 14.2 • A process outside the group sends to the group without knowing the membership Membership service provides leave and join operations Failure detector notes failures and evicts failed processes from the group The group address is expanded Members are informed when processes join/leave
  • 9. We will leave out the details of view delivery and view synchronous group communication  A full membership service maintains group views, which are lists of group members, ordered e.g. as members join group.  A new group view is generated each time a process joins or leaves the group.  View delivery p 561. The idea is that processes can 'deliver views' (like delivering multicast messages). – ideally we would like all processes to get the same information in the same order relative to the messages.  view synchronous group communication (p562) with reliability. – Illustrated in Fig 14.3 – all processes agree on the ordering of messages and membership changes, – a joining process can safely get state from another member. – or if one crashes, another will know which operations it had already performed – This work was done in the ISIS system (Birman) 9 •
  • 10. Figure 14.3 View-synchronous group communication a (allowed). b (allowed). c (disallowed). d (disallowed). 10 p q r p crashes view (p, q, r) view (q, r) p q r p crashes view (p, q, r) view (q, r) p q r view (p, q, r) p q r p crashes view (p, q, r) view (q, r) p crashes view (q, r)
  • 11. 11 14.3 Fault-tolerant services  provision of a service that is correct even if f processes fail – by replicating data and functionality at RMs – assume communication reliable and no partitions – RMs are assumed to behave according to specification or to crash – intuitively, a service is correct if it responds despite failures and clients can’t tell the difference between replicated data and a single copy – but care is needed to ensure that a set of replicas produce the same result as a single one would. – e.g (next slide). •
  • 12. Example of a naive replication system RMs at A and B maintain copies of x and y 12 Client 1: Client 2: setBalanceB(x,1) setBalanceA(y,2) getBalanceA(y) ®2 getBalanceA(x) ®0  initial balance of x and y is $0 clients use local RM when available, otherwise the other one RMs propagate updates to one another after replying to client – client 1 updates X at B (local) then finds B has failed, so uses A – client 2 reads balances at A (local)  as client 1 updates y after x, client 2 should see $1 for x – not the behaviour that would occur if A and B were implemented at a single server  Systems can be constructed to replicate objects without producing this anomalous behaviour.  We now discuss what counts as correct behaviour in a replication system. •
  • 13. Linearizability (p566) the strictest criterion for a replication system  The correctness criteria for replicated objects are defined by referring to a virtual interleaving which would be correct linearizability is not intended to be used with transactional replication systems –The real-time requirement means clients should receive up-to-date information but may not be practical due to difficulties of synchronizing clocks a weaker criterion is sequential consistency Consider operations. a replicated a replicated object service service with is two linearizable clients, that if perform for any read execution and update Client there is operations some A client interleaving waits for one operation to complete before doing o, o, oof and clients’ o, operations o, oat a such single that: another. server 101112 202122 are interleaved in some order e.g. o, o, o, o, o, o(client 20211022 1112 1 does oetc) 10 –the interleaved sequence of operations meets the specification of a (single) correct copy of the objects –the order of operations in the interleaving is consistent with the real time at which they occurred – For any set of client operations there is a virtual interleaving (which would be 13 • correct for a set of single objects). – Each client sees a view of the objects that is consistent with this, that is, the results of clients operations make sense within the interleaving  the bank example did not make sense: if the second update is observed,the first update should be observed too.
  • 14. Sequential consistency (p567) it is not linearizable because client2’s getBalance is after client 1’s setBalance in real time.  a replicated shared object service is sequentially consistent if for any execution there is some interleaving of clients’ operations such that: – the interleaved sequence of operations meets the specification of a (single) correct copy of the objects – the order of operations in the interleaving is consistent with the program order in which each client executed them the following is sequentially consistent but not linearizable 14 Client 1: Client 2: setBalanceB(x,1) getBalanceA(y) ®0 getBalanceA(x) ®0 setBalanceA(y,2) this is possible under a naive replication strategy, even if neither A or B fails - the update at B has not yet been propagated to A when client 2 reads it • but the following interleaving satisfies both criteria for sequential consistency : getBalanceA(y) ® 0; getBalanceA(x ) ® 0; setBalanceB(x,1); setBalanceA(y,2)
  • 15. The FE has to find the primary, e.g. after it crashes and another takes over The passive (primary-backup) model for fault tolerance Backup  There is at any time a single primary RM and one or more secondary (backup, slave) RMs  FEs communicate with the primary which executes the operation and sends copies of the updated data to the result to backups  if the primary fails, one of the backups is promoted to act as the primary 15 C FE C FE RM Primary Backup RM RM Figure 14.4 •
  • 16. Passive (primary-backup) replication. Five phases.  The five phases in performing a client request are as follows:  1. Request: – a FE issues the request, containing a unique identifier, to the primary RM 16  2. Coordination: – the primary performs each request atomically, in the order in which it receives it relative to other requests – it checks the unique id; if it has already done the request it re-sends the response.  3. Execution: – The primary executes the request and stores the response.  4. Agreement: – If the request is an update the primary sends the updated state, the response and the unique identifier to all the backups. The backups send an acknowledgement.  5. Response: – The primary responds to the FE, which hands the response back to the client. •
  • 17. Passive (primary-backup) replication (discussion)  This system implements linearizability, since the primary sequences all the operations on the shared objects  If the primary fails, the system is linearizable, if a single backup takes over exactly where the primary left off, i.e.: – the primary is replaced by a unique backup – surviving RMs agree which operations had been performed at take over  view-synchronous group communication can achieve this – when surviving backups receive a view without the primary, they use an agreed function to calculate which is the new primary. – The new primary registers with name service – view synchrony also allows the processes to agree which operations were performed before the primary failed. – E.g. when a FE does not get a response, it retransmits it to the new primary – The new primary continues from phase 2 (coordination -uses the unique identifier to discover whether the request has already been performed. 17 •
  • 18. Discussion of passive replication  To survive f process crashes, f+1 RMs are required – it cannot deal with byzantine failures because the client can't get replies from the backup RMs  To design passive replication that is linearizable – View synchronous communication has relatively large overheads – Several rounds of messages per multicast – After failure of primary, there is latency due to delivery of group view  variant in which clients can read from backups – which reduces the work for the primary – get sequential consistency but not linearizability  Sun NIS uses passive replication with weaker guarantees – Weaker than sequential consistency, but adequate to the type of data stored – achieves high availability and good performance – Master receives updates and propagates them to slaves using 1-1 communication. Clients can uses either master or slave – updates are not done via RMs - they are made on the files at the master • 18
  • 19. What sort of system do we need to perform totally ordered reliable multicast? 13.3.2. Active replication for fault tolerance  the RMs are state machines all playing the same role and organised as a group. – all start in the same state and perform the same operations in the same order so that RM 19 their state remains identical  If an RM crashes it has no effect on performance of the service because the others continue as normal  It can tolerate byzantine failures because the FE can collect and compare the replies it receives C FE RM FE C RM Figure 14.5 • a FE multicasts each request to the group of RMs the RMs process each request identically and reply Requires totally ordered reliable multicast so that all RMs perfrom the same operations in the same order
  • 20. Active replication - five phases in performing a client request 20  Request – FE attaches a unique id and uses totally ordered reliable multicast to send request to RMs. FE can at worst, crash. It does not issue requests in parallel  Coordination – the multicast delivers requests to all the RMs in the same (total) order.  Execution – every RM executes the request. They are state machines and receive requests in the same order, so the effects are identical. The id is put in the response  Agreement – no agreement is required because all RMs execute the same operations in the same order, due to the properties of the totally ordered multicast.  Response – FEs collect responses from RMs. FE may just use one or more responses. If it is only trying to tolerate crash failures, it gives the client the first response. •
  • 21. Active replication - discussion  As RMs are state machines we have sequential consistency – due to reliable totally ordered multicast, the RMs collectively do the same as a single copy would do – it works in a synchronous system – in an asynchronous system reliable totally ordered multicast is impossible – but failure detectors can be used to work around this problem. How to do that is beyond the scope of this course.  this replication scheme is not linearizable – because total order is not necessarily the same as real-time order 21  To deal with byzantine failures – For up to f byzantine failures, use 2f+1 RMs – FE collects f+1 identical responses  To improve performance, – FEs send read-only requests to just one RM •
  • 22. Summary for Sections 14.1-14.3  Replicating objects helps services to provide good performance, high availability and fault tolerance.  system model - each logical object is implemented by a set of physical replicas  linearizability and sequential consistency can be used as correctness criteria – sequential consistency is less strict and more practical to use  fault tolerance can be provided by: – passive replication - using a primary RM and backups,  but to achieve linearizability when the primary crashes, view-synchronous communication is used, which is expensive. Less strict variants can be useful. – active replication - in which all RMs process all requests identically  needs totally ordered and reliable multicast, which can be achieved in a synchronous system 22 •
  • 23. 23 Highly available services  we discuss the application of replication techniques to make services highly available. – we aim to give clients access to the service with:  reasonable response times for as much of the time as possible  even if some results do not conform to sequential consistency  e.g. a disconnected user may accept temporarily inconsistent results if they can continue to work and fix inconsistencies later  eager versus lazy updates – fault-tolerant systems send updates to RMs in an ‘eager’ fashion (as soon as possible) and reach agreement before replying to the client – for high availability, clients should:  only need to contact a minimum number of RMs and  be tied up for a minimum time while RMs coordinate their actions – weaker consistency generally requires less agreement and makes data more available. Updates are propagated 'lazily'. •
  • 24. 14.4.1 The gossip architecture  the gossip architecture is a framework for implementing highly available services – data is replicated close to the location of clients – RMs periodically exchange ‘gossip’ messages containing updates  gossip service provides two types of operations – queries - read only operations – updates - modify (but do not read) the state  FE sends queries and updates to any chosen RM – one that is available and gives reasonable response times  Two guarantees (even if RMs are temporarily unable to communicate – each client gets a consistent service over time ( i.e. data reflects the updates seen by client, even if the use different RMs). Vector timestamps are used – with one entry per RM. – relaxed consistency between replicas. All RMs eventually receive all updates. RMs use ordering guarantees to suit the needs of the application (generally causal ordering). Client may observe stale data. 24 •
  • 25. Query and update operations in a gossip service  The service consists of a collection of RMs that exchange gossip messages  Queries and updates are sent by a client via an FE to an RM prev is a vector timestamp for the latest version seen by the FE (and client) RM RM Update, prev Update id 25 • FE Query Val RM Query, prev Val, new FE Update Service Clients Figure 14.6 new is the vector timestamp of the resulting value, val update id is the vector timestamp of the update Gossip Causal ordering
  • 26. Gossip processing of queries and updates  The five phases in performing a client request are: 26 – request Causal ordering  FEs normally use the same RM and may be blocked on queries  update operations return to the client as soon as the operation is passed to the FE – update response - the RM replies as soon as it has seen the update – coordination  the RM waits to apply the request until the ordering constraints apply.  this may involve receiving updates from other RMs in gossip messages – execution - the RM executes the request – query response - if the request is a query the RM now replies: – agreement  RMs update one another by exchanging gossip messages (lazily) • e.g. when several updates have been collected • or when an RM discovers it is missing an update •
  • 27. Front ends propagate their timestamps whenever clients communicate directly  each FE keeps a vector timestamp of the latest value seen (prev) – which it sends in every request – clients communicate with one another via FEs which pass vector timestamps RM gossip RM RM 27 FE Clients FE Service Vector timestamps client-to-client communication can lead to causal relationships between operations. Figure 14.7 •
  • 28. A gossip replica manager, showing its main state components timestamp table -a collection of vector timestamps received from other RMs in gossip messages. It is used to know when RMs have received updates executed operation table - prevents an operation being applied twice e.g. if received from other RMs as well as FE update log - held-back until ordering allows it to be applied (when it becomes stable) also held until updates have been received by all other RMs 28 • Replica t imestamp Update log Value timestamp Value Executed operation table Stable updates Updates Gossip messages FE Replica timestamp Replica log OperationID Update Prev FE Replica manager Other replica managers Timestamp table Figure 14.8 value - application state (each RM is a state machine) we are only talking about one value here value timestamp (updated each time an update is applied to the value) replica timestamp - indicates updates accepted by RM in log (different from value’s timestamp if some updates are not yet stable)
  • 29. Processing of query and update operations  Vector timestamp held by RM i consists of: – ith element holds updates received from FEs by that RM – jth element holds updates received by RM j and propagated to RM i  Query operations contain q.prev – they can be applied if q.prev ≤ valueTS (value timestamp) – failing this, the RM can wait for gossip message or initiate them  e.g. if valueTS = (2,5,5) and q.prev = (2,4,6) - RM 0 has missed an update from RM 2 – Once the query can be applied, the RM returns valueTS (new) to the FE. The FE merges new with its vector timestamp 29 • RMs are numbered 0, 1, 2,… e.g. in a gossip system with 3 RMs a value of (2,4,5) at RM 0 means that the value there reflects the first 2 updates accepted from FEs at RM 0, the first 4 at RM 1 and the first 5 at RM 2.
  • 30. 30 Gossip update operations  Update operations are processed in causal order – A FE sends update operation u.op, u.prev, u.id to RM i  A FE can send a request to several RMs, using same id – When RM i receives an update request, it checks whether it is new, by looking for the id in its executed ops table and its log – if it is new, the RM  increments by 1 the ith element of its replica timestamp,  assigns a unique vector timestamp ts to the update  and stores the update in its log logRecord = <i, ts, u.op, u.prev, u.id> – The timestamp ts is calculated from u.prev by replacing its ith element by the ith element of the replica timestamp. – The RM returns ts to the FE,which merges it with its vector timestamp – For stability u.prev ≤ valueTS – That is, the valueTS reflects all updates seen by the FE. – When stable, the RM applies the operation u.op to the value,updates valueTS and adds u.id to the executed operation table. •
  • 31. 31 Gossip messages  an RM uses entries in its timestamp table to estimate which updates another RM has not yet received – The timestamp table contains a vector timestamp for each other replica, collected from gossip messages  gossip message, m contains log m.log and replica timestamp m.ts  an RM receiving gossip message m has the following main tasks – merge the arriving log with its own (omit those with ts ≤ replicaTS) – apply in causal order updates that are new and have become stable – remove redundant entries from the log and executed operation table when it is known that they have been applied by all RMs – merge its replica timestamp with m.ts, so that it corresponds to the additions in the log •
  • 32. Discussion of Gossip architecture  the gossip architecture is designed to provide a highly available service  clients with access to a single RM can work when other RMs are inaccessible – but it is not suitable for data such as bank accounts – it is inappropriate for updating replicas in real time (e.g. a conference) 32  scalability – as the number of RMs grow, so does the number of gossip messages – for R RMs, the number of messages per request (2 for the request and the rest for gossip) = 2 + (R-1)/G  G is the number of updates per gossip message  increase G and improve number of gossip messages, but make latency worse  for applications where queries are more frequent than updates, use some read-only replicas, which are updated only by gossip messages •
  • 33. 14.5 Transactions with replicated data  objects in transactional systems are replicated to enhance availability and performance – the effect of transactions on replicated objects should be the same as if they had been performed one at a time on a single set of objects. – this property is called one-copy serializability. – it is similar to, but not to be confused with, sequential consistency.  sequential consistency does not take transactions into account – each RM provides concurrency control and recovery of its own objects  we assume two-phase locking in this section – replication makes recovery more complicated  when an RM recovers, it restores its objects with information from other RMs 34 •
  • 34. 14.5.1 Architectures for replicated transactions  We assume that an FE sends requests to one of a group of RMs – in the primary copy approach, all FEs communicate with a single RM which propagates 35 updates to back-ups. – In other schemes, FEs may communicate with any RM and coordination between RMs is more complex – an RM that receives a request is responsible for getting cooperation from the other RMs  rules as to how many RMs are involved vary with the replication scheme • e.g. in the read one/write all scheme, one RM is required for a read request and all RMs for a write request  propagate requests immediately or at the end of a transaction? – in the primary copy scheme, we can wait until end of transaction (concurrency control is applied at the primary) – but if transactions access the same objects at different RMs, we need to propagate the requests so that concurrency control can be applied  two-phase commit protocol – becomes a two-level nested 2PC. If a coordinator or worker is an RM it will communicate with other RMs that it passed requests to during the transaction •
  • 35. Consider pairs of operations by different transactions on the same object. Any pair of write operations will require conflicting locks at all of the RMs a read operation and a write operation will require conflicting locks at a single RM. This one-copy serializability is achieved Transactions on replicated data  in read one/write all replication, one RM is required for a read request and all RMs for a write request T U 36 • B Client + front end A A A B B B getBalance(A) Client + front end Replica managers Replica managers deposit(B,3); Figure 14.10 every write operation must be performed at all RMs, each of which applies a write lock each read operation is performed by a single RM, which sets a read lock
  • 36. 14.5.2 Available copies replication  the simple read one/write all scheme is not realistic – because it cannot be carried out if some of the RMs are unavailable, – either because the have crashed or because of a communication failure  the available copies replication scheme is designed to allow some RMs to be temporarily unavailable – a read request can be performed by any available RM – writes requests are performed by the receiving RM and all other available RMs in the 37 group •
  • 37. Available copies – read one/ write all available  local concurrency control achieves one-copy serializability provided the set of RMs whereas Ts deposit is performed by M, N and P. 38 does not change.  but we have RMs failing and recovering T’s getBalance is performed by X Client + front end A X Client + front end deposit(A,3); P B Replica managers T U deposit(B,3); getBalance(B) getBalance(A) Replica managers Y M B N A B Figure 14.11 • At X T has read A and has locked it. Therefore U’s deposit is delayed until T finishes delay
  • 38. 39 Available copies  Replica manager failure – An RM can fail by crashing and is replaced by a new process  the new process restores its state from its recovery file – FEs use timeouts in case an RM fails  then try the request at another RM  in the case of a write, the RM passing it on may observe failures – If an RM is doing recovery, it rejects requests (& FE tries another RM) – For one-copy serializability, failures and recoveries are serialized with respect to transactions  that is, if a transaction observes that a failure occurs, it must be observed before it started or after it finished  one-copy serializability is not achieved if different transactions make conflicting failure observations •
  • 39. therefore additional concurrency control is required to prevent inconsistent results between a read in one transaction and a write in another transaction Available copies replication: RM failure example • both RMs fail before T and U have performed their deposit operations – Therefore T’s deposit will be performed at RMs M and P (all available) – and U’s deposit will be performed at RM Y. (all available). assume that RM X fails just after T has performed getBalance and RM N fails just after U has performed getBalance. concurrency control at X does not prevent U from updating A at Y concurrency control at M does not prevent T from updating B at M & P 40 Client + front end A X Client + front end deposit(A,3); P B Replica managers T U deposit(B,3); getBalance(B) getBalance(A) Replica managers Y M B N A B Figure 14.11 •
  • 40. –after all the operations of a transaction have been carried out the FE will inform the coordinator of failed RMs it knows about the coordinator can attempt to communicate with any RMs noted to have failed then in doing the 2PC it will discover whether any RMs involved in the transaction have subsequently failed 41 Available copies replication  Local validation (the additional concurrency control) – before a transaction commits, it checks for failures and recoveries of the RMs it has contacted  e.g. T would check if N is still unavailable and that X, M and P are still available.  If this is the case, T can commit. this implies that X failed after T validated and before U validated • i.e. we have N fails ® T commits ® X fails ® U validates • (above, we said X fails before T’s deposit, in which case,T would have to abort)  U checks if N is still available (no) and X still unavailable • therefore U must abort • P B M B N B Y A X A
  • 41. Network partitions divide RMs into subgroups  the subgroups cannot communicate with one another e.g. the RMs doing the deposit can’t communicate with those doing the withdraw Reading – during replication a partition schemes would not cause assume inconsistency, partitions writing will be might. repaired Optimistic schemes e.g available copies with validation - resolve consistencies when a partition is repaired. We have to be able to do compensating actions, otherwise the scheme is unsuitable e.g. unsuitable for banking. We are not studying this. See section 14.5.4  therefore operations done during a partition must not cause inconsistency  pessimistic schemes (e.g. quorum consensus) prevent inconsistency T Network U 42 • Client + front end B withdraw(B, 4) Client + front end Replica managers deposit(B,3); partition B B B Figure 14.12
  • 42. 14.5.5 Quorum consensus methods  To prevent transactions in different partitions from producing inconsistent results – make a rule that operations can be performed in only one of the partitions.  RMs in different partitions cannot communicate: – each subgroup decides independently whether they can perform operations.  A quorum is a subgroup of RMs whose size gives it the right to perform operations. – e.g. if having the majority of the RMs could be the criterion  in quorum consensus schemes – update operations may be performed by a subset of the RMs  and the other RMs have out-of-date copies  version numbers or timestamps are used to determine which copies are up-to-date  operations are applied only to copies with the current version number 43 •
  • 43. Gifford’s quorum consensus file replication scheme  a number of ‘votes’ is assigned to each physical copy of a logical file at an RM – a vote is a weighting giving the desirability of using a particular copy. – each read operation must obtain a read quorum of R votes before it can read from any 44 up-to-date copy – each write operation must obtain a write quorum of W votes before it can do an update operation. – R and W are set for a group of replica managers such that  W > half the total votes  R + W > total number of votes for the group – ensuring that any pair contain common copies (i.e. a read quorum and a write quorum or two write quora) – therefore in a partition it is not possible to perform conflicting operations on the same file, but in different partitions. •
  • 44. Gifford’s quorum consensus - performing read and write operations  before a read operation, a read quorum is collected – by making version number enquiries at RMs to find a set of copies, the sum of whose votes is not less than R (not all of these copies need be up to date). – as each read quorum overlaps with every write quorum, every read quorum is certain to 45 include at least one current copy. – the read operation may be applied to any up-to-date copy.  before a write operation, a write quorum is collected – by making version number enquiries at RMs to find a set with up-to-date copies, the sum of whose votes is not less than W. – if there are insufficient up-to-date copies, then an out-of-date file is replaced with a current one, to enable the quorum to be established. – the write operation is then applied by each RM in the write quorum, the version number is incremented and completion is reported to the client. – the files at the remaining available RMs are then updated in the background.  Two-phase read/write locking is used for concurrency control – the version number enquiry sets read locks (read and write quora overlap) •
  • 45. Gifford’s quorum consensus: configurability of groups of replica managers  groups of RMs can be configured to give different performance or reliability characteristics – once the R and W have been chosen for a set of RMs: – the reliability and performance of write operations may be increased by decreasing W – and similarly for reads by decreasing R  the performance of read operations is degraded by the need to collect a read consensus  examples from Gifford – three examples show the range of properties that can be achieved by allocating weights to the various RMs in a group and assigning R and W appropriately – weak representatives (on local disk) have zero votes, get a read quorum from RMs with votes and then read from the local copy 46 •
  • 46. Gifford’s quorum consensus examples (1979) Example 1 Example 2 Example 3 Latency Replica 1 75 75 75 (milliseconds) Replica 2 65 100 750 Replica 3 65 750 750 Voting Replica 1 1 2 1 configuration Replica 2 0 1 1 Replica 3 0 1 1 Quorum R 1 2 1 sizes W 1 3 3 Example 1 is configured for a file with high read to write ratio with several weak representatives and a single RM. Replication is used for performance, not reliability. The RM can be accessed in 75 ms and the two clients can access their weak representatives in 65 ms, resulting in lower latency and less network traffic Example 2 is configured for a file with a moderate read to write ratio which is accessed mainly from one local network. Local RM has 2 votes and remote RMs 1 vote each. Reads can be done at the local RM, but writes must access one local RM and one remote RM. If the local RM fails only reads are allowed Example 3 is configured for a file with a very high read to write ratio. Reads can be done at any RM and the probability of the file being unavailable is small. But writes must access all RMs. 47 Derived performance of file suite: Read Latency 65 75 75 Blocking probability 0.01 0.0002 0.000001 Write Latency 75 100 750 Blocking probability 0.01 0.0101 0.03 Derived performance latency blocking probability - probability that a quorum cannot be obtained, assuming probability of 0.01 that any single RM is unavailable •
  • 47. Summary for Gossip and replication in transactions  the Gossip architecture is designed for highly available services – it uses a lazy form of replication in which RMs update one another from time to time by means of gossip messages – it allows clients to make updates to local replicas while partitioned – RMs exchange updates with one another when reconnected 52  replication in transactions – primary-backup architectures can be used – other architectures allow FMs to use any RM  available copies allows RMs to fail, but cannot deal with partitions  quorum consensus does allow transactions to progress in the presence of partitions, but the performance of read operations is degraded by the need to collect a read consensus •

Editor's Notes

  1. Times taken to present these slides Sections 14.1-14.2 about 1 hour 15 minutes (but omitted the details of view synchronous group communication on page 563) Then sections 14.4 and 14.5 took about 1 hour 15 minutes, but omitted 14.4.2 Bayou and 14.4.3 Coda and 14.5.6 virtual partition algorithm.
  2. note- caching is a form of replication (e.g. web servers and browsers) DNS (chapter 9) uses replication extensively caches have incomplete data sets, so no use for availability mention user that disconnects and takes laptop on a train, accepts inconsistency, then reconnects and resolves conflicts
  3. replication transparency makes a set of replicas appear as a single logical object to a client.
  4. replica managers may be servers or may be part of an application state machine not effected by external stimuli such as clocks, sensors
  5. The front ends may be in client&amp;apos;s address space or in a separate process.(Like the client module in NFS) The RMs are in different computers
  6. Five phases are involving a single requests on a replicated object. The actions vary according to the type of replication (e.g. in a transaction, for FT in a state machine, Gossip). coordination - may not apply request in case of failure
  7. Group communication requires a membership service
  8. e.g. specification of bank account does deposits. withdrawals properly, does not lose money x and y are bank accounts
  9. virtual interleaving does not necessarily occur at any one RM
  10. possible under naive replication strategy, even if neither A or B fails update at B not yet propagated to A when client 2 reads it it is not linearizable because client2’s getBalance is after client 1’s set balance in real time. but the following interleaving: getBalanceAA(y)-&amp;gt; 0 getBalanceAA(x) -&amp;gt; 0 setBalanceBA(x,1) setBalanceAA(x,1) satisfies both criteria for sequential consistency are satisfied
  11. secondaries are called ‘back-up’ or ‘slave’
  12. if two different clients use different backups to take over, then the system could behave incorrectly the requirements are met if view-synchronous group communication is used, but that is expensive If using view synchronous group communication, the surviving backups receive a view without the primary, they use an agreed function to calculate who is primary. New primary registers with name service View synchrony also allows processes to agree which operations were performed before primary failed.
  13. queries, FE may need to contact another RM update ops - for higher reliability, block client until update given to f+1 RMs query response - will have waited until ordering constraints apply
  14. client-to-client communication can lead to causal relationships between operations, it also occurs via front ends.
  15. value - application state (each RM is a state machine) - only talking about one value here value timestamp (updated each time an update is applied to the value) update log - held-back until ordering allows it to be applied (when it becomes stable) also held until updates have been received by all other RMs replica timestamp - indicates updates accepted by RM in log (different from value’s timestamp if some updates are not yet stable) executed operation table - prevents an operation being applied twice e.g. if received from other RMs as well as FE timestamp table - collection of vector timestamps received from other RMs in gossip messages. Used to know when RMs have received updates
  16. example has 3 RMs. RMs are 0, 1, 2. RM 0 missed update from RM 2 as 6 is not &amp;lt;= 5 FE must have used RM 2 last time. Merge the FE had not seen the latest update at RM 1 ([prev had 4 and RM had 5) More details of what is stored for updates on page 578. They can implement total ordering by using a primary RM as a sequencer
  17. the timestamp table provides only an estimate - because the other RM may have received updates from elsewhere
  18. mention quorum consensus in which there are more RMs involved in a read and less in a write note about read one/write all - write operation must be applied at all RMs so write locks are set at all of them read operation at only one RM, but is bound to conflict with one of the write locks one copy serialisability is achieved
  19. T’s getBalance is performed by X and its deposit by M, N and P at X T has read A and it is locked until T finishes
  20. T did getBalance at X (before it failed). Then it did deposit at M and P, noticing that N crashed U did getBalance at N and deposit at Y. (it observes X fails)
  21. e.g. those doing the deposit can’t communicate with those doing the withdraw reading during a partition would not cause inconsistency, writing might optimistic e.g available copies with validation - resolve consistencies when partition repaired . Have to be able to do compensating actions, otherwise unsuitable (e.g. unsuitable for banking) (not doing this see section 14.5.4)
  22. availaility - Ex 1 reqd quorum .01, write quorum .01 Ex 2. read quorum 2, fails if either of 2+1 fails - i.e. 2*( .01*.01) write quorum 3, fails if first missing (.01) or other two (.01*.01) total = .0101 Ex 3. read quorum 1. prob all 3 failing = (.01*.01*.01) write quorum 3. prob of one unavailable = 3*.01