SlideShare a Scribd company logo
1 of 51
Teaching material 
based on Distributed 
Systems: Concepts and 
Design, Edition 3, 
Addison-Wesley 2001. 
Distributed Systems Course 
Distributed transactions 
Copyright © George 
Coulouris, Jean Dollimore, 
Tim Kindberg 2001 
email: authors@cdk2.net 
This material is made 
available for private study 
and for direct use by 
individual teachers. 
It may not be included in any 
product or employed in any 
service without the written 
permission of the authors. 
Viewing: These slides 
must be viewed in 
slide show mode. 
13.1 Introduction 
13.2 Flat and nested distributed transactions 
13.3 Atomic commit protocols 
13.4 Concurrency control in distributed 
transactions 
13.5 Distributed deadlocks 
13.6 Transaction recovery
Commitment of distributed transactions - introduction 
 a distributed transaction refers to a flat or nested 
transaction that accesses objects managed by 
multiple servers 
 When a distributed transaction comes to an end 
– the either all of the servers commit the transaction 
– or all of them abort the transaction. 
 one of the servers is coordinator, it must ensure the 
same outcome at all of the servers. 
 the ‘two-phase commit protocol’ is the most 
commonly used protocol for achieving this 
2 
•
Distributed transactions 
Client 
3 
X 
Y 
Z 
X 
Y 
M 
T1 N 
T2 
T11 
Client 
P 
T 
T 12 
T21 
T22 
(a) Flat transaction (b) Nested transactions 
T 
T 
Figure 13.1 
A flat client transaction completes each of 
its requests before going on to the next 
one. Therefore, each transaction accesses 
servers’ objects sequentially 
In a nested transaction, the top-level 
transaction can open 
subtransactions, and each 
subtransaction can open further 
subtransactions down to any 
depth of nesting 
In the nested case, 
subtransactions at the same level 
can run concurrently, so T1 and 
T2 are concurrent, and as they 
invoke objects in different servers, 
they can run in parallel. 
•
Nested banking transaction 
b.withdraw(20) 
openSubTransaction 
b.withdraw(20); 
openSubTransaction 
c.deposit(10); 
openSubTransaction 
d.deposit(20); 
 client transfers $10 from A to C and then transfers $20 from B to 
4 
a.withdraw(10) 
c.deposit(10) 
d.deposit(20) 
Client A 
B 
C 
T1 
T2 
T3 
T4 
T 
D 
X 
Y 
Z 
T = openTransaction 
openSubTransaction 
a.withdraw(10); 
closeTransaction 
Figure 13.2 
requests can be 
run in parallel - 
with several 
servers, the 
nested 
transaction is 
more efficient 
•
Why might a participant abort a transaction? 
The coordinator of a flat distributed transaction 
 Servers execute requests in a distributed transaction 
– when it commits they must communicate with one another to 
coordinate their actions 
– a client starts a transaction by sending an openTransaction request to 
a coordinator in any server (next slide) 
 it returns a TID unique in the distributed system(e.g. server ID + local 
transaction number) 
 at the end, it will be responsible for committing or aborting it 
– each server managing an object accessed by the transaction is a 
participant - it joins the transaction (next slide) 
 a participant keeps track of objects involved in the transaction 
 at the end it cooperates with the coordinator in carrying out the commit 
protocol 
– note that a participant can call abortTransaction in coordinator 
5 
•
openTransaction goes to the 
coordinator 
A flat distributed banking transaction 
join participant 
A 
BranchX 
participant 
B 
participant 
C 
D 
Figure 13.3 
a client’s (flat) 
banking 
transaction 
involves accounts 
A, B, C and D at 
servers BranchX, 
BranchY and 
BranchZ 
Each server is shown 
with a participant, which 
joins the transaction by 
invoking the join method 
in the coordinator 
 Note that the TID (T) is passed with each request e.g. withdraw(T,3) 
6 
. 
BranchZ 
Client 
BranchY 
join 
join 
T 
a.withdraw(4); 
b.withdraw(3); 
c.deposit(4); 
d.deposit(3); 
openTransaction 
b.withdraw(T, 3); 
closeTransaction 
T = openTransaction 
a.withdraw(4); 
c.deposit(4); 
b.withdraw(3); 
d.deposit(3); 
closeTransaction 
Note: the coordinator is in one of the servers, e.g. BranchX 
•
The join operation 
 The interface for Coordinator is shown in Figure 12.3 
– it has openTransaction, closeTransaction and abortTransaction 
– openTransaction returns a TID which is passed with each operation so that 
servers know which transaction is accessing its objects 
 The Coordinator interface provides an additional method, join, 
which is used whenever a new participant joins the 
transaction: 
– join(Trans, reference to participant) 
– informs a coordinator that a new participant has joined the transaction Trans. 
– the coordinator records the new participant in its participant list. 
– the fact that the coordinator knows all the participants and each participant 
knows the coordinator will enable them to collect the information that will be 
needed at commit time. 
7 
•
Atomic commit protocols 
 transaction atomicity requires that at the end, 
– either all of its operations are carried out or none of them. 
 in a distributed transaction, the client has requested the 
operations at more than one server 
 one-phase atomic commit protocol 
– the coordinator tells the participants whether to commit or abort 
– what is the problem with that? 
– this does not allow one of the servers to decide to abort – it may have 
discovered a deadlock or it may have crashed and been restarted 
 two-phase atomic commit protocol 
– is designed to allow any participant to choose to abort a transaction 
– phase 1 - each participant votes. If it votes to commit, it is prepared. It cannot 
change its mind. In case it crashes, it must save updates in permanent store 
– phase 2 - the participants carry out the joint decision 
8 
• 
The decision could be commit or abort - participants record it in permanent store
Failure model for the commit protocols 
 Recall the failure model for transactions in Chapter 12 
– this applies to the two-phase commit protocol 
 Commit protocols are designed to work in 
– asynchronous system (e.g. messages may take a very long time) 
– servers may crash 
– messages may be lost. 
– assume corrupt and duplicated messages are removed. 
– no byzantine faults – servers either crash or they obey their requests 
 2PC is an example of a protocol for reaching a consensus. 
– Chapter 11 says consensus cannot be reached in an asynchronous system if 
processes sometimes fail. 
– however, 2PC does reach consensus under those conditions. 
– because crash failures of processes are masked by replacing a crashed 
process with a new process whose state is set from information saved in 
permanent storage and information held by other processes. 
9 
•
The two-phase commit protocol 
 During the progress of a transaction, the only 
communication between coordinator and participant 
is the join request 
– The client request to commit or abort goes to the coordinator 
 if client or participant request abort, the coordinator informs the 
participants immediately 
 if the client asks to commit, the 2PC comes into use 
10 
 2PC 
– voting phase: coordinator asks all participants if they can commit 
 if yes, participant records updates in permanent storage and then votes 
– completion phase: coordinator tells all participants to commit or abort 
– the next slide shows the operations used in carrying out the protocol 
• 
How many messages are sent between the coordinator and Why does participant record updates in permanent storage e aatc bht hpiasr stitcaipgaen?t?
Operations for two-phase commit protocol 
canCommit?(trans)-> Yes / No 
Call from coordinator to participant to ask whether it can commit a transaction. 
Participant replies with its vote. 
doCommit(trans) 
Call from coordinator to participant to tell participant to commit its part of a 
transaction. 
 participant interface- canCommit?, doCommit, doAbort 
coordinator interface- haveCommitted, getDecision 
11 
doAbort(trans) 
Call from coordinator to participant to tell participant to abort its part of a 
transaction. 
haveCommitted(trans, participant) 
Call from participant to coordinator to confirm that it has committed the transaction. 
getDecision(trans) -> Yes / No 
Call from participant to coordinator to ask for the decision on a transaction after it 
has voted Yes but has still had no reply after some delay. Used to recover from 
server crash or delayed messages. Figure 13.4 
• 
This is a request with a reply 
These are asynchronous requests to avoid delays 
Asynchronous request
The two-phase commit protocol 
12 
Figure 13.5 
• Phase 1 (voting phase): 
• 1. The coordinator sends a canCommit? request to each of the participants in 
the transaction. 
• 2. When a participant receives a canCommit? request it replies with its vote 
(Yes or No) to the coordinator. Before voting Yes, it prepares to commit by saving 
objects in permanent storage. If the vote is No the participant aborts immediately. 
• Phase 2 (completion according to outcome of vote): 
• 3. The coordinator collects the votes (including its own). 
w (a)If there are no failures and all the votes are Yes the coordinator decides to 
commit the transaction and sends a doCommit request to each of the participants. 
w (b)Otherwise the coordinator decides to abort the transaction and sends doAbort 
requests to all participants that voted Yes. 
• 4. Participants that voted Yes are waiting for a doCommit or doAbort request from the 
coordinator. When a participant receives one of these messages it acts accordingly 
and in the case of commit, makes a haveCommitted call as confirmation to the 
coordinator. 
•
Communication in two-phase commit protocol 
status step status 
prepared to commit 
 Time-out actions in the 2PC 
prepared to commit 
 to avoid blocking forever when a process crashes or a message is lost 
– uncertain participant (step 2) has voted yes. it can’t decide on its own 
 it uses getDecision method to ask coordinator about outcome 
– participant has carried out client requests, but has not had a Commit? 
from the coordinator. It can abort unilaterally 
– coordinator delayed in waiting for votes (step 1). It can abort and send 
doAbort to participants. 
13 
canCommit? 
Yes 
doCommit 
haveCommitted 
Coordinator 
1 
3 
(waiting for votes) 
committed 
done 
step 
Participant 
2 
4 
(uncertain) 
committed 
Figure 13.6 
• 
TTThhhiniinnkkk a aabbbooouuutt t s tptheaepr t ci2co io-p rawdnhinta abt teiosfro trihnee ss ptteeroppb 12le --m ww fhhoaart t t ihisse tt hhpeea rpptirrcooipbballeenmmt???
Performance of the two-phase commit protocol 
 if there are no failures, the 2PC involving N 
participants requires 
– N canCommit? messages and replies, followed by N doCommit 
messages. 
 the cost in messages is proportional to 3N, and the cost in time is three 
rounds of messages. 
 The haveCommitted messages are not counted 
– there may be arbitrarily many server and communication failures 
– 2PC is is guaranteed to complete eventually, but it is not possible to 
specify a time limit within which it will be completed 
 delays to participants in uncertain state 
 some 3PCs designed to alleviate such delays 
• they require more messages and more rounds for the normal case 
14 
•
13.3.2 Two-phase commit protocol for nested 
transactions 
 Recall Fig 13.1b, top-level transaction T and subtransactions 
T1, T2, T11, T12, T21, T22 
 A subtransaction starts after its parent and finishes before it 
 When a subtransaction completes, it makes an independent 
decision either to commit provisionally or to abort. 
– A provisional commit is not the same as being prepared: it is a local decision 
and is not backed up on permanent storage. 
– If the server crashes subsequently, its replacement will not be able to carry out 
a provisional commit. 
 A two-phase commit protocol is needed for nested 
transactions 
– it allows servers of provisionally committed transactions that have crashed to 
abort them when they recover. 
15 
•
Figure 13.7 
Operations in coordinator for nested transactions 
openSubTransaction(trans) -> subTrans 
The TID of a subtransaction is an extension of its parent's TID, so that a 
subtransaction can work out the TID of the top-level transaction. 
The client finishes a set of nested transactions by calling closeTransaction or 
abortTransacation in the top-level transaction. 
Opens a new subtransaction whose parent is trans and returns a 
unique subtransaction identifier. 
getStatus(trans)-> committed, aborted, provisional 
Asks the coordinator to report on the status of the transaction 
trans. Returns values representing one of the following: 
committed, aborted, provisional. 
 This is the interface of the coordinator of a subtransaction. 
– It allows it to open further subtransactions 
– It allows its subtransactions to enquire about its status 
 Client starts by using OpenTransaction to open a top-level transaction. 
– This returns a TID for the top-level transaction 
– The TID can be used to open a subtransaction 
 The subtransaction automatically joins the parent and a TID is returned. 
16 
•
Transaction T decides whether to commit 
T12 has provisionally committed and T11 has aborted, but the fate of T12 
depends on its parent T1 and eventually on the top-level transaction, T. 
Although T21 and T22 have both provisionally committed, 2 has 
aborted and this means that T21 and T22 must also abort. 
Suppose that T decides to commit although T2 has aborted, also 
that T1 decides to commit although T11 has aborted 
17 
1 
2 
T11 
provisional commit (at X) 
T12 
T21 
T22 
abort (at M) 
provisional commit (at N) 
aborted (at Y) 
provisional commit (at N) 
provisional commit (at P) 
T 
T 
T 
• 
Figure 13.8 
 Recall that 
1. A parent can commit even if a subtransaction aborts 
2. If a parent aborts, then its subtransactions must abort 
– In the figure, each subtransaction has either 
provisionally committed or aborted
Information held by coordinators of nested transactions 
18 
Coordinator of 
transaction 
Child 
transactions 
Participant Provisional 
commit list 
Abort list 
T T1 , T 2 yes T1 , T 12 T11 , T 2 
T1 T11 , T 12 yes T1 , T 12 T11 
T2 T21 , T 22 no (aborted) T2 
T11 no (aborted) T11 
T12 , T 21 T12 but not T21 T21 , T 12 
T22 no (parent aborted)T22 
• 
 When a top-level transcation commits it carries out a 2PC 
 Each coordinator has a list of its subtransactions 
 At provisional commit, a subtransaction reports its status and 
the status of its descendents to its parent 
 If a subtransaction aborts, it tells its parent 
Figure 13.9 
T12 and T21 share a A subtransaction (e.g. T When T2 is aborted c iot oterdllsin Ta t(onro a isn ftohremy abtoiotnh arubno uatt dsesrvceern Ndents) 21 and T22an orphan uses getStatus to ask its pare) nist acbaolluetd th aen o ourtcpohmaen. iIft oshnoeu oldf aitbso artn icf eitss tpoarrse natb hoarsts
canCommit? for hierarchic two-phase commit protocol 
canCommit?(trans, subTrans) -> Yes / No 
Figure 13.10 
Call a coordinator to ask coordinator of child subtransaction 
whether it can commit a subtransaction subTrans. The first 
argument trans is the transaction identifier of top-level 
transaction. Participant replies with its vote Yes / No. 
19 
• 
 Top-level transaction is coordinator of 2PC. 
 participant list: 
– the coordinators of all the subtransactions that have provisionally committed 
– but do not have an aborted ancestor 
– E.g. T, T1 and T12 in Figure 13.8 
– if they vote yes, they prepare to commit by saving state in permanent store 
 The state is marked as belonging to the top-level transaction 
 The 2PC may be performed in a hierarchic or a flat manner 
Hierarchic 2PC - T asks canCommit? to T1 and T1 asks ThTeh es utrbaTnrsa nasrg aurmguemnte inst uiss euds ew thoe fnin sda tvhineg s tuhbet roabnjseacctsti oinn ptoe rv mcoaatenn Ceonnot.m sIfmt oairtb?as getoen tT, 1v2ote no.
Compare the advantages and disadvantages of the flat and nested approaches 
canCommit? for flat two-phase commit protocol 
canCommit?(trans, abortList) -> Yes / No 
Figure 13.11 
Call from coordinator to participant to ask whether it can 
commit a transaction. Participant replies with its vote Yes / No. 
20 
• 
 Flat 2PC 
– the coordinator of the top-level transaction sends canCommit? messages 
to the coordinators of all of the subtransactions in the provisional commit 
list. 
– in our example, T sends to the coordinators of T1 and T12. 
– the trans argument is the TID of the top-level transaction 
– the abortList argument gives all aborted subtransactions 
 e.g. server N has T12 prov committed and T21 aborted 
– On receiving canCommit, participant 
 looks in list of transactions for any that match trans (e.g. T12 and T21 at N) 
 it prepares any that have provisionally committed and are not in abortList and 
votes yes 
 if it can't find any it votes no
Time-out actions in nested 2PC 
 With nested transactions delays can occur in the 
same three places as before 
– when a participant is prepared to commit 
– when a participant has finished but has not yet received canCommit? 
– when a coordinator is waiting for votes 
 Fourth place: 
– provisionally committed subtransactions of aborted subtransactions 
e.g. T22 whose parent T2 has aborted 
– use getStatus on parent, whose coordinator should remain active for a 
while 
– If parent does not reply, then abort 
21 
•
Summary of 2PC 
 a distributed transaction involves several different servers. 
– A nested transaction structure allows 
 additional concurrency and 
 independent committing by the servers in a distributed transaction. 
 atomicity requires that the servers participating in a 
distributed transaction either all commit it or all abort it. 
 atomic commit protocols are designed to achieve this effect, 
even if servers crash during their execution. 
 the 2PC protocol allows a server to abort unilaterally. 
– it includes timeout actions to deal with delays due to servers crashing. 
– 2PC protocol can take an unbounded amount of time to complete but is 
guaranteed to complete eventually. 
22 
•
13.4 Concurrency control in distributed transactions 
 Each server manages a set of objects and is 
responsible for ensuring that they remain consistent 
when accessed by concurrent transactions 
– therefore, each server is responsible for applying concurrency control 
to its own objects. 
– the members of a collection of servers of distributed transactions are 
jointly responsible for ensuring that they are performed in a serially 
equivalent manner 
– therefore if transaction T is before transaction U in their conflicting 
access to objects at one of the servers then they must be in that order 
at all of the servers whose objects are accessed in a conflicting 
manner by both T and U 
23 
•
13.4.1 Locking 
 In a distributed transaction, the locks on an object 
are held by the server that manages it. 
– The local lock manager decides whether to grant a lock or make the 
requesting transaction wait. 
– it cannot release any locks until it knows that the transaction has been 
committed or aborted at all the servers involved in the transaction. 
– the objects remain locked and are unavailable for other transactions 
during the atomic commit protocol 
 an aborted transaction releases its locks after phase 1 of the protocol. 
24 
•
Interleaving of transactions T and U at servers X and Y 
 in the example on page 529, we have 
– T before U at server X and U before T at server Y 
 different orderings lead to cyclic dependencies and 
distributed deadlock 
– detection and resolution of distributed deadlock in next section 
T U 
Write(A) at X locks A 
25 
Write(B) at Y locks B 
Read(B) at Y waits for U 
Read(A) at X waits for T 
•
13.4.2 Timestamp ordering concurrency control 
 Single server transactions 
– coordinator issues a unique timestamp to each transaction before it starts 
– serial equivalence ensured by committing objects in order of timestamps 
 Distributed transactions 
– the first coordinator accessed by a transaction issues a globally unique 
timestamp 
– as before the timestamp is passed with each object access 
– the servers are jointly responsible for ensuring serial equivalence 
 that is if T access an object before U, then T is before U at all objects 
– coordinators agree on timestamp ordering 
 a timestamp consists of a pair <local timestamp, server-id>. 
 the agreed ordering of pairs of timestamps is based on a comparison in 
which the server-id part is less significant – they should relate to time 
26 
•
Can the same ordering be achieved at all servers without clock synchronization? 
Why Timestamp is it better to ordering have roughly concurrency synchronized clocks? 
control (continued) 
 The same ordering can be achieved at all servers 
even if their clocks are not synchronized 
– for efficiency it is better if local clocks are roughly synchronized 
– then the ordering of transactions corresponds roughly to the real time 
order in which they were started 
 Timestamp ordering 
– conflicts are resolved as each operation is performed 
– if this leads to an abort, the coordinator will be informed 
 it will abort the transaction at the participants 
– any transaction that reaches the client request to commit should 
always be able to do so 
 participant will normally vote yes 
 unless it has crashed and recovered during the transaction 
27 
•
Use backward validation 
Optimistic concurrency control 
 each transaction is validated before it is allowed to commit 
1. write/read, 2. read/write, 3. write/write 
– transaction numbers assigned at start of validation 
– transactions serialized according to transaction numbers 
– validation takes place in phase 1 of 2PC protocol 
 consider the following interleavings of T and U 
– T before U at X and U before T at Y 
Suppose T & U start validation at about the same time 
T U 
Read(A) at X Read(B) at Y 
Write(A) Write(B) 
Read(B) at Y Read(A) at X 
Write(B) Write(A) 
28 
1. satisfied 
2. checked 
3. paralllel 
X does T first 
Y does U first 
No parallel 
Validation –. 
commitment 
deadlock 
•
Commitment deadlock in optimistic concurrency control 
 servers of distributed transactions do parallel validation 
– therefore rule 3 must be validated as well as rule 2 
 the write set of Tv is checked for overlaps with write sets of earlier 
transactions 
– this prevents commitment deadlock 
– it also avoids delaying the 2PC protocol 
 another problem - independent servers may 
schedule transactions in different orders 
– e.g. T before U at X and U before T at Y 
– this must be prevented - some hints as to how on page 531 
29 
•
13.5 Distributed deadlocks 
 Single server transactions can experience deadlocks 
– prevent or detect and resolve 
– use of timeouts is clumsy, detection is preferable. 
 it uses wait-for graphs. 
 Distributed transactions lead to distributed deadlocks 
– in theory can construct global wait-for graph from local ones 
– a cycle in a global wait-for graph that is not in local ones is a 
distributed deadlock 
30 
•
Figure 13.12 
Interleavings of transactions U, V and W 
 objects A, B managed by X and Y ; C and D by Z 
U V W 
d.deposit(10) lock D 
31 
b.deposit(10) lock B 
a.deposit(20) lock A at Y 
at X 
c.deposit(30) lock C 
b.withdraw(30) wait at Y at Z 
c.withdraw(20) wait at Z 
a.withdraw(20) wait at X 
– next slide has global wait-for graph 
• 
U ® V at Y 
V ® W at 
Z W ® U at X
Figure 13.13 
Distributed deadlock 
 a deadlock cycle has alternate edges showing wait-for and held-by 
 wait-for added in order: U ® V at Y; V ® W at Z and W ® U at X 
C A 
32 
D 
Waits for 
Waits 
for 
Held by 
Held 
by 
B Waits for 
Held 
by 
X 
Y 
Z 
Held by 
W 
V U 
W 
V 
U 
(a) (b) 
•
Deadlock detection - local wait-for graphs 
 Local wait-for graphs can be built, e.g. 
– server Y: U ® V added when U requests b.withdraw(30) 
– server Z: V ® W added when V requests c.withdraw(20) 
– server X: W ® U added when W requests a.withdraw(20) 
 to find a global cycle, communication between the 
servers is needed 
 centralized deadlock detection 
– one server takes on role of global deadlock detector 
– the other servers send it their local graphs from time to time 
– it detects deadlocks, makes decisions about which transactions to 
abort and informs the other servers 
– usual problems of a centralized service - poor availability, lack of fault 
tolerance and no ability to scale 
33 
•
Figure 13.14 
Local and global wait-for graphs 
local wait-for graph local wait-for graph global deadlock detector 
T U 
34 
X 
V T 
Y 
T 
U V 
 Phantom deadlocks 
– a ‘deadlock’ that is detected, but is not really one 
– happens when there appears to be a cycle, but one of the transactions 
has released a lock, due to time lags in distributing graphs 
– in the figure suppose U releases the object at X then waits for V at Y 
 and the global detector gets Y’s graph before X’s (T ® U ® V ® T) 
•
Edge chasing - a distributed approach to deadlock 
detection 
 a global graph is not constructed, but each server 
knows about some of the edges 
– servers try to find cycles by sending probes which follow the edges of 
the graph through the distributed system 
– when should a server send a probe (go back to Fig 13.13) 
– edges were added in order U ® V at Y; V ® W at Z and W ® U at X 
 when W ® U at X was added, U was waiting, but 
 when V ® W at Z, W was not waiting 
– send a probe when an edge T1 ® T2 when T2 is waiting 
– each coordinator records whether its transactions are active or waiting 
 the local lock manager tells coordinators if transactions start/stop waiting 
 when a transaction is aborted to break a deadlock, the coordinator tells the 
participants, locks are removed and edges taken from wait-for graphs 
35 
•
Edge-chasing algorithms 
 Three steps 
– Initiation: 
 When a server notes that T starts waiting for U, where U is waiting at 
another server, it initiates detection by sending a probe containing the 
edge < T ® U > to the server where U is blocked. 
 If U is sharing a lock, probes are sent to all the holders of the lock. 
– Detection: 
 Detection consists of receiving probes and deciding whether deadlock has 
occurred and whether to forward the probes. 
• e.g. when server receives probe < T ® U > it checks if U is waiting, 
e.g. U ® V, if so it forwards < T ® U ® V > to server where V waits 
• when a server adds a new edge, it checks whether a cycle is there 
– Resolution: 
 When a cycle is detected, a transaction in the cycle is aborted to break the 
deadlock. 
36 
•
Figure 13.15 
Probes transmitted to detect deadlock 
 example of edge chasing starts with X sending <W ® U>, then Y sends 
<W ® U ® V >, then Z sends <W ® U ® V ® W> 
37 
V 
Held by 
W 
Held by Waits for 
Waits 
for 
Waits for 
Deadlock 
detected 
U 
C 
A 
B 
Initiation 
W® U ® V ® W 
W® U 
W® U ® V 
Z 
Y 
X 
•
Edge chasing conclusion 
 probe to detect a cycle with N transactions will require 2(N-1) messages. 
– Studies of databases show that the average deadlock involves 2 transactions. 
 the above algorithm detects deadlock provided that 
– waiting transactions do not abort 
– no process crashes, no lost messages 
– to be realistic it would need to allow for the above failures 
 refinements of the algorithm (p 536-7) 
– to avoid more than one transaction causing detection to start and then more than one 
being aborted 
– not time to study these now 
38 
•
Summary of concurrency control for distributed 
transactions 
 each server is responsible for the serializability of 
transactions that access its own objects. 
 additional protocols are required to ensure that 
transactions are serializable globally. 
– timestamp ordering requires a globally agreed timestamp ordering 
– optimistic concurrency control requires global validation or a means of 
forcing a global ordering on transactions. 
– two-phase locking can lead to distributed deadlocks. 
 distributed deadlock detection looks for cycles in the global wait-for graph. 
 edge chasing is a non-centralized approach to the detection of distributed 
deadlocks 
 . 
41 
•
13.6 Transaction recovery 
 Atomicity property of transactions 
WWhhaatt iiss mmeeaanntt bbyy fdauilruarbei laityto?micity? 
– durability and failure atomicity 
– durability requires that objects are saved in permanent storage and will be available 
indefinitely 
– failure atomicity requires that effects of transactions are atomic even when the server 
42 
crashes 
 Recovery is concerned with 
– ensuring that a server’s objects are durable and 
– that the service provides failure atomicity. 
– for simplicity we assume that when a server is running, all of its objects are in volatile 
memory 
– and all of its committed objects are in a recovery file in permanent storage 
– recovery consists of restoring the server with the latest committed versions of all of its 
objects from its recovery file 
•
Recovery manager 
 The task of the Recovery Manager (RM) is: 
– to save objects in permanent storage (in a recovery file) for committed 
transactions; 
– to restore the server’s objects after a crash; 
– to reorganize the recovery file to improve the performance of recovery; 
– to reclaim storage space (in the recovery file). 
 media failures 
– i.e. disk failures affecting the recovery file 
– need another copy of the recovery file on an independent disk. e.g. 
implemented as stable storage or using mirrored disks 
 we deal with recovery of 2PC separately (at the end) 
– we study logging (13.6.1) but not shadow versions (13.6.2) 
43 
•
Recovery - intentions lists 
 Each server records an intentions list for each of its 
currently active transactions 
– an intentions list contains a list of the object references and the values 
of all the objects that are altered by a transaction 
– when a transaction commits, the intentions list is used to identify the 
objects affected 
 the committed version of each object is replaced by the tentative one 
 the new value is written to the server’s recovery file 
– in 2PC, when a participant says it is ready to commit, its RM must 
record its intentions list and its objects in the recovery file 
 it will be able to commit later on even if it crashes 
 when a client has been told a transaction has committed, the recovery files 
of all participating servers must show that the transaction is committed, 
• even if they crash between prepare to commit and commit 
44 
•
Types of entry in a recovery file 
Type of entry Description of contents of entry 
Object A value of an object. 
Transaction status Transaction identifier, transaction status ( prepared, committed 
 For distributed transactions we need information relating to the 2PC as 
well as object values, that is: 
– transaction status (committed, prepared or aborted) 
– intentions list 
45 
aborted) and other status values used for the two-phase 
commit protocol. 
Intentions list Transaction identifier and a sequence of intentions, each of 
which consists of <identifier of object>, <position in recovery 
file of value of object>. 
Figure 13.18 
• 
Why is that a good 
idea? 
Object state flattened to bytes 
first entry says prepared 
Note that the objects need not be next to one another in the recovery file
Logging - a technique for the recovery file 
 the recovery file represents a log of the history of all 
the transactions at a server 
– it includes objects, intentions lists and transaction status 
– in the order that transactions prepared, committed and aborted 
– a recent snapshot + a history of transactions after the snapshot 
– during normal operation the RM is called whenever a transaction 
prepares, commits or aborts 
 prepare - RM appends to recovery file all the objects in the intentions list 
followed by status (prepared) and the intentions list 
 commit/abort - RM appends to recovery file the corresponding status 
 assume append operation is atomic, if server fails only the last write will be 
incomplete 
 to make efficient use of disk, buffer writes. Note: sequential writes are 
more efficient than those to random locations 
 committed status is forced to the log - in case server crashes 
46 
•
Log for banking service 
committed status 
P0 P1 P2 P3 P4 P5 P6 P7 
Object:A Object:B Object:C Object:A Object:B Trans:T Trans:T Object:C Object:B Trans:U 
100 200 300 80 220 prepared committed 278 242 prepared 
 Logging mechanism for Fig 12.7 (there would really be other objects in log file) 
– initial balances of A, B and C $100, $200, $300 
– T sets A and B to $80 and $220. U sets B and C to $242 and $278 
– entries to left of line represent a snapshot (checkpoint) of values of A, B and C before T 
started. T has committed, but U is prepared. 
– the RM gives each object a unique identifier (A, B, C in diagram) 
– each status entry contains a pointer to the previous status entry, then the checkpoint 
can follow transactions backwards through the file 
47 
<A, P1> <C, P5> 
<B, P2> <B, P6> 
P0 P3 P4 
Checkpoint 
End 
of log Figure 13.19. 
prepared status and intentions list 
•
Recovery of objects - with logging 
 When a server is replaced after a crash 
– it first sets default initial values for its objects 
– and then hands over to its recovery manager. 
 The RM restores the server’s objects to include 
– all the effects of all the committed transactions in the correct order and 
– none of the effects of incomplete or aborted transactions 
– it ‘reads the recovery file backwards’ (by following the pointers) 
 restores values of objects with values from committed transactions 
 continuing until all of the objects have been restored 
– if it started at the beginning, there would generally be more work to do 
– to recover the effects of a transaction use the intentions list to find the value of the 
objects 
 e.g. look at previous slide (assuming the server crashed before T committed) 
– the recovery procedure must be idempotent 
48 
•
Logging - reorganising the recovery file 
 RM is responsible for reorganizing its recovery file 
– so as to make the process of recovery faster and 
– to reduce its use of space 
 checkpointing 
– the process of writing the following to a new recovery file 
 the current committed values of a server’s objects, 
 transaction status entries and intentions lists of transactions that have not 
yet been fully resolved 
 including information related to the two-phase commit protocol (see later) 
– checkpointing makes recovery faster and saves disk space 
 done after recovery and from time to time 
 can use old recovery file until new one is ready, add a ‘mark’ to old file 
 do as above and then copy items after the mark to new recovery file 
 replace old recovery file by new recovery file 
49 
•
Recovery of the two-phase commit protocol 
 The above recovery scheme is extended to deal with transactions doing the 2PC 
protocol when a server fails 
– it uses new transaction status values done, uncertain (see Fig 13.6) 
 the coordinator uses committed when result is Yes; 
 done when 2PC complete ( if a transaction is done its information may be removed 
when reorganising the recovery file) 
 the participant uses uncertain when it has voted Yes; committed when told the result 
(uncertain entries must not be removed from recovery file) 
– It also requires two additional types of entry: 
Type of entry Description of contents of entry 
Coordinator Transaction identifier, list of participants 
51 
added by RM when coordinator prepared 
Participant Transaction identifier, coordinator 
added by RM when participant votes yes 
•
Start at end, for U find it is committed Boru itWf itfh eteh h esa esvrevr eTvre crco rhamasmsh ecitrdtea edsha arenldied rb cweofeo rhdeai ntvhaeet o U lara nspdtr e eapn aptrrayer dwti ceaip nhadan vptea rUtic uipnacnetrtain and participant 
Log with entries relating to two-phase commit protocol 
Trans:T Coord’r:T Trans:T Trans:U Part’pant:U Trans:U Trans:U 
prepared part’pant 
list: . . . 
 entries in log for 
coordinator entry participant entry 
– T where server is coordinator (prepared comes first, followed by the 
coordinator entry, then committed – done is not shown) 
– and U where server is participant (prepared comes first followed by 
the participant entry, then uncertain and finally committed) 
– these entries will be interspersed with values of objects 
 recovery must deal with 2PC entries as well as restoring objects 
– where server was coordinator find coordinator entry and status 
entries. 
– where server was participant find participant entry and status entries 
52 
committed prepared Coord’r: . . uncertain committed 
intentions 
list 
intentions 
list 
Figure 13.21 
•
Recovery of the two-phase commit protocol 
Role Status Action of recovery manager 
Coordinator prepared No decision had been reached before the server failed. It sends 
Figure 13.22 
53 
abortTransaction to all the servers in the participant list and adds the 
transaction status aborted in its recovery file. Same action for state 
aborted. If there is no participant list, the participants will eventually 
timeout and abort the transaction. 
Coordinator committed A decision to commit had been reached before the server failed. It 
sends a doCommit to all the participants in its participant list (in case 
it had not done so before) and resumes the two-phase protocol at step 4 
(Fig 13.5). 
Participant committed The participant sends a haveCommitted message to the coordinator (in 
case this was not done before it failed). This will allow the coordinator 
to discard information about this transaction at the next checkpoint. 
Participant uncertain The participant failed before it knew the outcome of the transaction. It 
cannot determine the status of the transaction until the coordinator 
informs it of the decision. It will send a getDecision to the coordinator 
to determine the status of the transaction. When it receives the reply it 
will commit or abort accordingly. 
Participant prepared The participant has not yet voted and can abort the transaction. 
Coordinator done No action is required. 
• 
the most recent entry in the recovery file determines the status of the 
transaction at the time of failure 
the RM action for each transaction depends on whether server 
was coordinator or participant and the status
Summary of transaction recovery 
 Transaction-based applications have strong requirements for the long life 
and integrity of the information stored. 
 Transactions are made durable by performing checkpoints and logging in 
a recovery file, which is used for recovery when a server is replaced after 
a crash. 
 Users of a transaction service would experience some delay during 
recovery. 
 It is assumed that the servers of distributed transactions exhibit crash 
failures and run in an asynchronous system, 
– but they can reach consensus about the outcome of transactions because crashed 
servers are replaced with new processes that can acquire all the relevant information 
from permanent storage or from other servers 
55 
•

More Related Content

What's hot

management of distributed transactions
management of distributed transactionsmanagement of distributed transactions
management of distributed transactionsNilu Desai
 
deadlock handling
deadlock handlingdeadlock handling
deadlock handlingSuraj Kumar
 
File models and file accessing models
File models and file accessing modelsFile models and file accessing models
File models and file accessing modelsishmecse13
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSKathirvel Ayyaswamy
 
Transactions and Concurrency Control
Transactions and Concurrency ControlTransactions and Concurrency Control
Transactions and Concurrency ControlDilum Bandara
 
distributed shared memory
 distributed shared memory distributed shared memory
distributed shared memoryAshish Kumar
 
Concurrency Control in Distributed Database.
Concurrency Control in Distributed Database.Concurrency Control in Distributed Database.
Concurrency Control in Distributed Database.Meghaj Mallick
 
2 PHASE COMMIT PROTOCOL
2 PHASE COMMIT PROTOCOL2 PHASE COMMIT PROTOCOL
2 PHASE COMMIT PROTOCOLKABILESH RAMAR
 
Distributed Transactions(flat and nested) and Atomic Commit Protocols
Distributed Transactions(flat and nested) and Atomic Commit ProtocolsDistributed Transactions(flat and nested) and Atomic Commit Protocols
Distributed Transactions(flat and nested) and Atomic Commit ProtocolsSachin Chauhan
 
Deadlock Detection in Distributed Systems
Deadlock Detection in Distributed SystemsDeadlock Detection in Distributed Systems
Deadlock Detection in Distributed SystemsDHIVYADEVAKI
 
Introduction to distributed file systems
Introduction to distributed file systemsIntroduction to distributed file systems
Introduction to distributed file systemsViet-Trung TRAN
 
RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure callSunita Sahu
 
Distributed transaction
Distributed transactionDistributed transaction
Distributed transactionMohitKothari26
 
Deadlock management
Deadlock managementDeadlock management
Deadlock managementAhmed kasim
 
Remote invocation
Remote invocationRemote invocation
Remote invocationishapadhy
 
Optimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed SystemsOptimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed Systemsmridul mishra
 
Distributed system lamport's and vector algorithm
Distributed system lamport's and vector algorithmDistributed system lamport's and vector algorithm
Distributed system lamport's and vector algorithmpinki soni
 
Processes and Processors in Distributed Systems
Processes and Processors in Distributed SystemsProcesses and Processors in Distributed Systems
Processes and Processors in Distributed SystemsDr Sandeep Kumar Poonia
 
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUESDISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUESAAKANKSHA JAIN
 

What's hot (20)

management of distributed transactions
management of distributed transactionsmanagement of distributed transactions
management of distributed transactions
 
deadlock handling
deadlock handlingdeadlock handling
deadlock handling
 
File models and file accessing models
File models and file accessing modelsFile models and file accessing models
File models and file accessing models
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
 
Transactions and Concurrency Control
Transactions and Concurrency ControlTransactions and Concurrency Control
Transactions and Concurrency Control
 
distributed shared memory
 distributed shared memory distributed shared memory
distributed shared memory
 
Concurrency Control in Distributed Database.
Concurrency Control in Distributed Database.Concurrency Control in Distributed Database.
Concurrency Control in Distributed Database.
 
2 PHASE COMMIT PROTOCOL
2 PHASE COMMIT PROTOCOL2 PHASE COMMIT PROTOCOL
2 PHASE COMMIT PROTOCOL
 
11. dfs
11. dfs11. dfs
11. dfs
 
Distributed Transactions(flat and nested) and Atomic Commit Protocols
Distributed Transactions(flat and nested) and Atomic Commit ProtocolsDistributed Transactions(flat and nested) and Atomic Commit Protocols
Distributed Transactions(flat and nested) and Atomic Commit Protocols
 
Deadlock Detection in Distributed Systems
Deadlock Detection in Distributed SystemsDeadlock Detection in Distributed Systems
Deadlock Detection in Distributed Systems
 
Introduction to distributed file systems
Introduction to distributed file systemsIntroduction to distributed file systems
Introduction to distributed file systems
 
RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure call
 
Distributed transaction
Distributed transactionDistributed transaction
Distributed transaction
 
Deadlock management
Deadlock managementDeadlock management
Deadlock management
 
Remote invocation
Remote invocationRemote invocation
Remote invocation
 
Optimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed SystemsOptimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed Systems
 
Distributed system lamport's and vector algorithm
Distributed system lamport's and vector algorithmDistributed system lamport's and vector algorithm
Distributed system lamport's and vector algorithm
 
Processes and Processors in Distributed Systems
Processes and Processors in Distributed SystemsProcesses and Processors in Distributed Systems
Processes and Processors in Distributed Systems
 
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUESDISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
 

Similar to Chapter 13

3 distributed transactions-cocurrency-query
3 distributed transactions-cocurrency-query3 distributed transactions-cocurrency-query
3 distributed transactions-cocurrency-queryM Rezaur Rahman
 
Management of Distributed Transactions
Management of Distributed TransactionsManagement of Distributed Transactions
Management of Distributed TransactionsAnkita Dubey
 
UNIT IV DIS.pptx
UNIT IV DIS.pptxUNIT IV DIS.pptx
UNIT IV DIS.pptxSamPrem3
 
Recovery in Multi database Systems
Recovery in Multi database SystemsRecovery in Multi database Systems
Recovery in Multi database SystemsMoutasm Tamimi
 
[WSO2Con EU 2018] Managing Transactions in Your Microservices Architecture
[WSO2Con EU 2018] Managing Transactions in Your Microservices Architecture [WSO2Con EU 2018] Managing Transactions in Your Microservices Architecture
[WSO2Con EU 2018] Managing Transactions in Your Microservices Architecture WSO2
 
Ch17 OS
Ch17 OSCh17 OS
Ch17 OSC.U
 
Chapter 18 - Distributed Coordination
Chapter 18 - Distributed CoordinationChapter 18 - Distributed Coordination
Chapter 18 - Distributed CoordinationWayne Jones Jnr
 
deadlock prevention
deadlock preventiondeadlock prevention
deadlock preventionNilu Desai
 
[WSO2Con Asia 2018] Managing Transactions in Your Microservice Architecture
[WSO2Con Asia 2018] Managing Transactions in Your Microservice Architecture[WSO2Con Asia 2018] Managing Transactions in Your Microservice Architecture
[WSO2Con Asia 2018] Managing Transactions in Your Microservice ArchitectureWSO2
 
19. Distributed Databases in DBMS
19. Distributed Databases in DBMS19. Distributed Databases in DBMS
19. Distributed Databases in DBMSkoolkampus
 
Fault Tolerant and Distributed System
Fault Tolerant and Distributed SystemFault Tolerant and Distributed System
Fault Tolerant and Distributed Systemsreenivas1591
 
Transaction Processing Monitors (TPM)
Transaction Processing Monitors (TPM)Transaction Processing Monitors (TPM)
Transaction Processing Monitors (TPM)Peter R. Egli
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordinationsiva krishna
 

Similar to Chapter 13 (20)

3 distributed transactions-cocurrency-query
3 distributed transactions-cocurrency-query3 distributed transactions-cocurrency-query
3 distributed transactions-cocurrency-query
 
Transaction management transparencies
Transaction management transparenciesTransaction management transparencies
Transaction management transparencies
 
Management of Distributed Transactions
Management of Distributed TransactionsManagement of Distributed Transactions
Management of Distributed Transactions
 
UNIT IV DIS.pptx
UNIT IV DIS.pptxUNIT IV DIS.pptx
UNIT IV DIS.pptx
 
Recovery in Multi database Systems
Recovery in Multi database SystemsRecovery in Multi database Systems
Recovery in Multi database Systems
 
[WSO2Con EU 2018] Managing Transactions in Your Microservices Architecture
[WSO2Con EU 2018] Managing Transactions in Your Microservices Architecture [WSO2Con EU 2018] Managing Transactions in Your Microservices Architecture
[WSO2Con EU 2018] Managing Transactions in Your Microservices Architecture
 
Ch17 OS
Ch17 OSCh17 OS
Ch17 OS
 
OSCh17
OSCh17OSCh17
OSCh17
 
OS_Ch17
OS_Ch17OS_Ch17
OS_Ch17
 
dos.ppt.pptx
dos.ppt.pptxdos.ppt.pptx
dos.ppt.pptx
 
Chapter 18 - Distributed Coordination
Chapter 18 - Distributed CoordinationChapter 18 - Distributed Coordination
Chapter 18 - Distributed Coordination
 
deadlock prevention
deadlock preventiondeadlock prevention
deadlock prevention
 
[WSO2Con Asia 2018] Managing Transactions in Your Microservice Architecture
[WSO2Con Asia 2018] Managing Transactions in Your Microservice Architecture[WSO2Con Asia 2018] Managing Transactions in Your Microservice Architecture
[WSO2Con Asia 2018] Managing Transactions in Your Microservice Architecture
 
19. Distributed Databases in DBMS
19. Distributed Databases in DBMS19. Distributed Databases in DBMS
19. Distributed Databases in DBMS
 
Fault Tolerant and Distributed System
Fault Tolerant and Distributed SystemFault Tolerant and Distributed System
Fault Tolerant and Distributed System
 
Transaction Processing Monitors (TPM)
Transaction Processing Monitors (TPM)Transaction Processing Monitors (TPM)
Transaction Processing Monitors (TPM)
 
Bab8 transaction
Bab8 transactionBab8 transaction
Bab8 transaction
 
Lec13s transaction
Lec13s transactionLec13s transaction
Lec13s transaction
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordination
 
Managing transactions 11g release 1 (10.3
Managing transactions   11g release 1 (10.3Managing transactions   11g release 1 (10.3
Managing transactions 11g release 1 (10.3
 

More from AbDul ThaYyal

Chapter 15 distributed mm systems
Chapter 15 distributed mm systemsChapter 15 distributed mm systems
Chapter 15 distributed mm systemsAbDul ThaYyal
 
Chapter 14 replication
Chapter 14 replicationChapter 14 replication
Chapter 14 replicationAbDul ThaYyal
 
Chapter 11d coordination agreement
Chapter 11d coordination agreementChapter 11d coordination agreement
Chapter 11d coordination agreementAbDul ThaYyal
 
Chapter 11c coordination agreement
Chapter 11c coordination agreementChapter 11c coordination agreement
Chapter 11c coordination agreementAbDul ThaYyal
 
Chapter 8 distributed file systems
Chapter 8 distributed file systemsChapter 8 distributed file systems
Chapter 8 distributed file systemsAbDul ThaYyal
 
Chapter 4 a interprocess communication
Chapter 4 a interprocess communicationChapter 4 a interprocess communication
Chapter 4 a interprocess communicationAbDul ThaYyal
 
Chapter 3 networking and internetworking
Chapter 3 networking and internetworkingChapter 3 networking and internetworking
Chapter 3 networking and internetworkingAbDul ThaYyal
 
Chapter 1 characterisation of distributed systems
Chapter 1 characterisation of distributed systemsChapter 1 characterisation of distributed systems
Chapter 1 characterisation of distributed systemsAbDul ThaYyal
 
Chapter 2 system models
Chapter 2 system modelsChapter 2 system models
Chapter 2 system modelsAbDul ThaYyal
 
4.file service architecture
4.file service architecture4.file service architecture
4.file service architectureAbDul ThaYyal
 
4.file service architecture (1)
4.file service architecture (1)4.file service architecture (1)
4.file service architecture (1)AbDul ThaYyal
 

More from AbDul ThaYyal (20)

Chapter 17 corba
Chapter 17 corbaChapter 17 corba
Chapter 17 corba
 
Chapter 15 distributed mm systems
Chapter 15 distributed mm systemsChapter 15 distributed mm systems
Chapter 15 distributed mm systems
 
Chapter 14 replication
Chapter 14 replicationChapter 14 replication
Chapter 14 replication
 
Chapter 11d coordination agreement
Chapter 11d coordination agreementChapter 11d coordination agreement
Chapter 11d coordination agreement
 
Chapter 11c coordination agreement
Chapter 11c coordination agreementChapter 11c coordination agreement
Chapter 11c coordination agreement
 
Chapter 11b
Chapter 11bChapter 11b
Chapter 11b
 
Chapter 11
Chapter 11Chapter 11
Chapter 11
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
 
Chapter 9 names
Chapter 9 namesChapter 9 names
Chapter 9 names
 
Chapter 8 distributed file systems
Chapter 8 distributed file systemsChapter 8 distributed file systems
Chapter 8 distributed file systems
 
Chapter 7 security
Chapter 7 securityChapter 7 security
Chapter 7 security
 
Chapter 6 os
Chapter 6 osChapter 6 os
Chapter 6 os
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
 
Chapter 4 a interprocess communication
Chapter 4 a interprocess communicationChapter 4 a interprocess communication
Chapter 4 a interprocess communication
 
Chapter 3 networking and internetworking
Chapter 3 networking and internetworkingChapter 3 networking and internetworking
Chapter 3 networking and internetworking
 
Chapter 1 characterisation of distributed systems
Chapter 1 characterisation of distributed systemsChapter 1 characterisation of distributed systems
Chapter 1 characterisation of distributed systems
 
Chapter 2 system models
Chapter 2 system modelsChapter 2 system models
Chapter 2 system models
 
4.file service architecture
4.file service architecture4.file service architecture
4.file service architecture
 
4.file service architecture (1)
4.file service architecture (1)4.file service architecture (1)
4.file service architecture (1)
 
4. system models
4. system models4. system models
4. system models
 

Chapter 13

  • 1. Teaching material based on Distributed Systems: Concepts and Design, Edition 3, Addison-Wesley 2001. Distributed Systems Course Distributed transactions Copyright © George Coulouris, Jean Dollimore, Tim Kindberg 2001 email: authors@cdk2.net This material is made available for private study and for direct use by individual teachers. It may not be included in any product or employed in any service without the written permission of the authors. Viewing: These slides must be viewed in slide show mode. 13.1 Introduction 13.2 Flat and nested distributed transactions 13.3 Atomic commit protocols 13.4 Concurrency control in distributed transactions 13.5 Distributed deadlocks 13.6 Transaction recovery
  • 2. Commitment of distributed transactions - introduction  a distributed transaction refers to a flat or nested transaction that accesses objects managed by multiple servers  When a distributed transaction comes to an end – the either all of the servers commit the transaction – or all of them abort the transaction.  one of the servers is coordinator, it must ensure the same outcome at all of the servers.  the ‘two-phase commit protocol’ is the most commonly used protocol for achieving this 2 •
  • 3. Distributed transactions Client 3 X Y Z X Y M T1 N T2 T11 Client P T T 12 T21 T22 (a) Flat transaction (b) Nested transactions T T Figure 13.1 A flat client transaction completes each of its requests before going on to the next one. Therefore, each transaction accesses servers’ objects sequentially In a nested transaction, the top-level transaction can open subtransactions, and each subtransaction can open further subtransactions down to any depth of nesting In the nested case, subtransactions at the same level can run concurrently, so T1 and T2 are concurrent, and as they invoke objects in different servers, they can run in parallel. •
  • 4. Nested banking transaction b.withdraw(20) openSubTransaction b.withdraw(20); openSubTransaction c.deposit(10); openSubTransaction d.deposit(20);  client transfers $10 from A to C and then transfers $20 from B to 4 a.withdraw(10) c.deposit(10) d.deposit(20) Client A B C T1 T2 T3 T4 T D X Y Z T = openTransaction openSubTransaction a.withdraw(10); closeTransaction Figure 13.2 requests can be run in parallel - with several servers, the nested transaction is more efficient •
  • 5. Why might a participant abort a transaction? The coordinator of a flat distributed transaction  Servers execute requests in a distributed transaction – when it commits they must communicate with one another to coordinate their actions – a client starts a transaction by sending an openTransaction request to a coordinator in any server (next slide)  it returns a TID unique in the distributed system(e.g. server ID + local transaction number)  at the end, it will be responsible for committing or aborting it – each server managing an object accessed by the transaction is a participant - it joins the transaction (next slide)  a participant keeps track of objects involved in the transaction  at the end it cooperates with the coordinator in carrying out the commit protocol – note that a participant can call abortTransaction in coordinator 5 •
  • 6. openTransaction goes to the coordinator A flat distributed banking transaction join participant A BranchX participant B participant C D Figure 13.3 a client’s (flat) banking transaction involves accounts A, B, C and D at servers BranchX, BranchY and BranchZ Each server is shown with a participant, which joins the transaction by invoking the join method in the coordinator  Note that the TID (T) is passed with each request e.g. withdraw(T,3) 6 . BranchZ Client BranchY join join T a.withdraw(4); b.withdraw(3); c.deposit(4); d.deposit(3); openTransaction b.withdraw(T, 3); closeTransaction T = openTransaction a.withdraw(4); c.deposit(4); b.withdraw(3); d.deposit(3); closeTransaction Note: the coordinator is in one of the servers, e.g. BranchX •
  • 7. The join operation  The interface for Coordinator is shown in Figure 12.3 – it has openTransaction, closeTransaction and abortTransaction – openTransaction returns a TID which is passed with each operation so that servers know which transaction is accessing its objects  The Coordinator interface provides an additional method, join, which is used whenever a new participant joins the transaction: – join(Trans, reference to participant) – informs a coordinator that a new participant has joined the transaction Trans. – the coordinator records the new participant in its participant list. – the fact that the coordinator knows all the participants and each participant knows the coordinator will enable them to collect the information that will be needed at commit time. 7 •
  • 8. Atomic commit protocols  transaction atomicity requires that at the end, – either all of its operations are carried out or none of them.  in a distributed transaction, the client has requested the operations at more than one server  one-phase atomic commit protocol – the coordinator tells the participants whether to commit or abort – what is the problem with that? – this does not allow one of the servers to decide to abort – it may have discovered a deadlock or it may have crashed and been restarted  two-phase atomic commit protocol – is designed to allow any participant to choose to abort a transaction – phase 1 - each participant votes. If it votes to commit, it is prepared. It cannot change its mind. In case it crashes, it must save updates in permanent store – phase 2 - the participants carry out the joint decision 8 • The decision could be commit or abort - participants record it in permanent store
  • 9. Failure model for the commit protocols  Recall the failure model for transactions in Chapter 12 – this applies to the two-phase commit protocol  Commit protocols are designed to work in – asynchronous system (e.g. messages may take a very long time) – servers may crash – messages may be lost. – assume corrupt and duplicated messages are removed. – no byzantine faults – servers either crash or they obey their requests  2PC is an example of a protocol for reaching a consensus. – Chapter 11 says consensus cannot be reached in an asynchronous system if processes sometimes fail. – however, 2PC does reach consensus under those conditions. – because crash failures of processes are masked by replacing a crashed process with a new process whose state is set from information saved in permanent storage and information held by other processes. 9 •
  • 10. The two-phase commit protocol  During the progress of a transaction, the only communication between coordinator and participant is the join request – The client request to commit or abort goes to the coordinator  if client or participant request abort, the coordinator informs the participants immediately  if the client asks to commit, the 2PC comes into use 10  2PC – voting phase: coordinator asks all participants if they can commit  if yes, participant records updates in permanent storage and then votes – completion phase: coordinator tells all participants to commit or abort – the next slide shows the operations used in carrying out the protocol • How many messages are sent between the coordinator and Why does participant record updates in permanent storage e aatc bht hpiasr stitcaipgaen?t?
  • 11. Operations for two-phase commit protocol canCommit?(trans)-> Yes / No Call from coordinator to participant to ask whether it can commit a transaction. Participant replies with its vote. doCommit(trans) Call from coordinator to participant to tell participant to commit its part of a transaction.  participant interface- canCommit?, doCommit, doAbort coordinator interface- haveCommitted, getDecision 11 doAbort(trans) Call from coordinator to participant to tell participant to abort its part of a transaction. haveCommitted(trans, participant) Call from participant to coordinator to confirm that it has committed the transaction. getDecision(trans) -> Yes / No Call from participant to coordinator to ask for the decision on a transaction after it has voted Yes but has still had no reply after some delay. Used to recover from server crash or delayed messages. Figure 13.4 • This is a request with a reply These are asynchronous requests to avoid delays Asynchronous request
  • 12. The two-phase commit protocol 12 Figure 13.5 • Phase 1 (voting phase): • 1. The coordinator sends a canCommit? request to each of the participants in the transaction. • 2. When a participant receives a canCommit? request it replies with its vote (Yes or No) to the coordinator. Before voting Yes, it prepares to commit by saving objects in permanent storage. If the vote is No the participant aborts immediately. • Phase 2 (completion according to outcome of vote): • 3. The coordinator collects the votes (including its own). w (a)If there are no failures and all the votes are Yes the coordinator decides to commit the transaction and sends a doCommit request to each of the participants. w (b)Otherwise the coordinator decides to abort the transaction and sends doAbort requests to all participants that voted Yes. • 4. Participants that voted Yes are waiting for a doCommit or doAbort request from the coordinator. When a participant receives one of these messages it acts accordingly and in the case of commit, makes a haveCommitted call as confirmation to the coordinator. •
  • 13. Communication in two-phase commit protocol status step status prepared to commit  Time-out actions in the 2PC prepared to commit  to avoid blocking forever when a process crashes or a message is lost – uncertain participant (step 2) has voted yes. it can’t decide on its own  it uses getDecision method to ask coordinator about outcome – participant has carried out client requests, but has not had a Commit? from the coordinator. It can abort unilaterally – coordinator delayed in waiting for votes (step 1). It can abort and send doAbort to participants. 13 canCommit? Yes doCommit haveCommitted Coordinator 1 3 (waiting for votes) committed done step Participant 2 4 (uncertain) committed Figure 13.6 • TTThhhiniinnkkk a aabbbooouuutt t s tptheaepr t ci2co io-p rawdnhinta abt teiosfro trihnee ss ptteeroppb 12le --m ww fhhoaart t t ihisse tt hhpeea rpptirrcooipbballeenmmt???
  • 14. Performance of the two-phase commit protocol  if there are no failures, the 2PC involving N participants requires – N canCommit? messages and replies, followed by N doCommit messages.  the cost in messages is proportional to 3N, and the cost in time is three rounds of messages.  The haveCommitted messages are not counted – there may be arbitrarily many server and communication failures – 2PC is is guaranteed to complete eventually, but it is not possible to specify a time limit within which it will be completed  delays to participants in uncertain state  some 3PCs designed to alleviate such delays • they require more messages and more rounds for the normal case 14 •
  • 15. 13.3.2 Two-phase commit protocol for nested transactions  Recall Fig 13.1b, top-level transaction T and subtransactions T1, T2, T11, T12, T21, T22  A subtransaction starts after its parent and finishes before it  When a subtransaction completes, it makes an independent decision either to commit provisionally or to abort. – A provisional commit is not the same as being prepared: it is a local decision and is not backed up on permanent storage. – If the server crashes subsequently, its replacement will not be able to carry out a provisional commit.  A two-phase commit protocol is needed for nested transactions – it allows servers of provisionally committed transactions that have crashed to abort them when they recover. 15 •
  • 16. Figure 13.7 Operations in coordinator for nested transactions openSubTransaction(trans) -> subTrans The TID of a subtransaction is an extension of its parent's TID, so that a subtransaction can work out the TID of the top-level transaction. The client finishes a set of nested transactions by calling closeTransaction or abortTransacation in the top-level transaction. Opens a new subtransaction whose parent is trans and returns a unique subtransaction identifier. getStatus(trans)-> committed, aborted, provisional Asks the coordinator to report on the status of the transaction trans. Returns values representing one of the following: committed, aborted, provisional.  This is the interface of the coordinator of a subtransaction. – It allows it to open further subtransactions – It allows its subtransactions to enquire about its status  Client starts by using OpenTransaction to open a top-level transaction. – This returns a TID for the top-level transaction – The TID can be used to open a subtransaction  The subtransaction automatically joins the parent and a TID is returned. 16 •
  • 17. Transaction T decides whether to commit T12 has provisionally committed and T11 has aborted, but the fate of T12 depends on its parent T1 and eventually on the top-level transaction, T. Although T21 and T22 have both provisionally committed, 2 has aborted and this means that T21 and T22 must also abort. Suppose that T decides to commit although T2 has aborted, also that T1 decides to commit although T11 has aborted 17 1 2 T11 provisional commit (at X) T12 T21 T22 abort (at M) provisional commit (at N) aborted (at Y) provisional commit (at N) provisional commit (at P) T T T • Figure 13.8  Recall that 1. A parent can commit even if a subtransaction aborts 2. If a parent aborts, then its subtransactions must abort – In the figure, each subtransaction has either provisionally committed or aborted
  • 18. Information held by coordinators of nested transactions 18 Coordinator of transaction Child transactions Participant Provisional commit list Abort list T T1 , T 2 yes T1 , T 12 T11 , T 2 T1 T11 , T 12 yes T1 , T 12 T11 T2 T21 , T 22 no (aborted) T2 T11 no (aborted) T11 T12 , T 21 T12 but not T21 T21 , T 12 T22 no (parent aborted)T22 •  When a top-level transcation commits it carries out a 2PC  Each coordinator has a list of its subtransactions  At provisional commit, a subtransaction reports its status and the status of its descendents to its parent  If a subtransaction aborts, it tells its parent Figure 13.9 T12 and T21 share a A subtransaction (e.g. T When T2 is aborted c iot oterdllsin Ta t(onro a isn ftohremy abtoiotnh arubno uatt dsesrvceern Ndents) 21 and T22an orphan uses getStatus to ask its pare) nist acbaolluetd th aen o ourtcpohmaen. iIft oshnoeu oldf aitbso artn icf eitss tpoarrse natb hoarsts
  • 19. canCommit? for hierarchic two-phase commit protocol canCommit?(trans, subTrans) -> Yes / No Figure 13.10 Call a coordinator to ask coordinator of child subtransaction whether it can commit a subtransaction subTrans. The first argument trans is the transaction identifier of top-level transaction. Participant replies with its vote Yes / No. 19 •  Top-level transaction is coordinator of 2PC.  participant list: – the coordinators of all the subtransactions that have provisionally committed – but do not have an aborted ancestor – E.g. T, T1 and T12 in Figure 13.8 – if they vote yes, they prepare to commit by saving state in permanent store  The state is marked as belonging to the top-level transaction  The 2PC may be performed in a hierarchic or a flat manner Hierarchic 2PC - T asks canCommit? to T1 and T1 asks ThTeh es utrbaTnrsa nasrg aurmguemnte inst uiss euds ew thoe fnin sda tvhineg s tuhbet roabnjseacctsti oinn ptoe rv mcoaatenn Ceonnot.m sIfmt oairtb?as getoen tT, 1v2ote no.
  • 20. Compare the advantages and disadvantages of the flat and nested approaches canCommit? for flat two-phase commit protocol canCommit?(trans, abortList) -> Yes / No Figure 13.11 Call from coordinator to participant to ask whether it can commit a transaction. Participant replies with its vote Yes / No. 20 •  Flat 2PC – the coordinator of the top-level transaction sends canCommit? messages to the coordinators of all of the subtransactions in the provisional commit list. – in our example, T sends to the coordinators of T1 and T12. – the trans argument is the TID of the top-level transaction – the abortList argument gives all aborted subtransactions  e.g. server N has T12 prov committed and T21 aborted – On receiving canCommit, participant  looks in list of transactions for any that match trans (e.g. T12 and T21 at N)  it prepares any that have provisionally committed and are not in abortList and votes yes  if it can't find any it votes no
  • 21. Time-out actions in nested 2PC  With nested transactions delays can occur in the same three places as before – when a participant is prepared to commit – when a participant has finished but has not yet received canCommit? – when a coordinator is waiting for votes  Fourth place: – provisionally committed subtransactions of aborted subtransactions e.g. T22 whose parent T2 has aborted – use getStatus on parent, whose coordinator should remain active for a while – If parent does not reply, then abort 21 •
  • 22. Summary of 2PC  a distributed transaction involves several different servers. – A nested transaction structure allows  additional concurrency and  independent committing by the servers in a distributed transaction.  atomicity requires that the servers participating in a distributed transaction either all commit it or all abort it.  atomic commit protocols are designed to achieve this effect, even if servers crash during their execution.  the 2PC protocol allows a server to abort unilaterally. – it includes timeout actions to deal with delays due to servers crashing. – 2PC protocol can take an unbounded amount of time to complete but is guaranteed to complete eventually. 22 •
  • 23. 13.4 Concurrency control in distributed transactions  Each server manages a set of objects and is responsible for ensuring that they remain consistent when accessed by concurrent transactions – therefore, each server is responsible for applying concurrency control to its own objects. – the members of a collection of servers of distributed transactions are jointly responsible for ensuring that they are performed in a serially equivalent manner – therefore if transaction T is before transaction U in their conflicting access to objects at one of the servers then they must be in that order at all of the servers whose objects are accessed in a conflicting manner by both T and U 23 •
  • 24. 13.4.1 Locking  In a distributed transaction, the locks on an object are held by the server that manages it. – The local lock manager decides whether to grant a lock or make the requesting transaction wait. – it cannot release any locks until it knows that the transaction has been committed or aborted at all the servers involved in the transaction. – the objects remain locked and are unavailable for other transactions during the atomic commit protocol  an aborted transaction releases its locks after phase 1 of the protocol. 24 •
  • 25. Interleaving of transactions T and U at servers X and Y  in the example on page 529, we have – T before U at server X and U before T at server Y  different orderings lead to cyclic dependencies and distributed deadlock – detection and resolution of distributed deadlock in next section T U Write(A) at X locks A 25 Write(B) at Y locks B Read(B) at Y waits for U Read(A) at X waits for T •
  • 26. 13.4.2 Timestamp ordering concurrency control  Single server transactions – coordinator issues a unique timestamp to each transaction before it starts – serial equivalence ensured by committing objects in order of timestamps  Distributed transactions – the first coordinator accessed by a transaction issues a globally unique timestamp – as before the timestamp is passed with each object access – the servers are jointly responsible for ensuring serial equivalence  that is if T access an object before U, then T is before U at all objects – coordinators agree on timestamp ordering  a timestamp consists of a pair <local timestamp, server-id>.  the agreed ordering of pairs of timestamps is based on a comparison in which the server-id part is less significant – they should relate to time 26 •
  • 27. Can the same ordering be achieved at all servers without clock synchronization? Why Timestamp is it better to ordering have roughly concurrency synchronized clocks? control (continued)  The same ordering can be achieved at all servers even if their clocks are not synchronized – for efficiency it is better if local clocks are roughly synchronized – then the ordering of transactions corresponds roughly to the real time order in which they were started  Timestamp ordering – conflicts are resolved as each operation is performed – if this leads to an abort, the coordinator will be informed  it will abort the transaction at the participants – any transaction that reaches the client request to commit should always be able to do so  participant will normally vote yes  unless it has crashed and recovered during the transaction 27 •
  • 28. Use backward validation Optimistic concurrency control  each transaction is validated before it is allowed to commit 1. write/read, 2. read/write, 3. write/write – transaction numbers assigned at start of validation – transactions serialized according to transaction numbers – validation takes place in phase 1 of 2PC protocol  consider the following interleavings of T and U – T before U at X and U before T at Y Suppose T & U start validation at about the same time T U Read(A) at X Read(B) at Y Write(A) Write(B) Read(B) at Y Read(A) at X Write(B) Write(A) 28 1. satisfied 2. checked 3. paralllel X does T first Y does U first No parallel Validation –. commitment deadlock •
  • 29. Commitment deadlock in optimistic concurrency control  servers of distributed transactions do parallel validation – therefore rule 3 must be validated as well as rule 2  the write set of Tv is checked for overlaps with write sets of earlier transactions – this prevents commitment deadlock – it also avoids delaying the 2PC protocol  another problem - independent servers may schedule transactions in different orders – e.g. T before U at X and U before T at Y – this must be prevented - some hints as to how on page 531 29 •
  • 30. 13.5 Distributed deadlocks  Single server transactions can experience deadlocks – prevent or detect and resolve – use of timeouts is clumsy, detection is preferable.  it uses wait-for graphs.  Distributed transactions lead to distributed deadlocks – in theory can construct global wait-for graph from local ones – a cycle in a global wait-for graph that is not in local ones is a distributed deadlock 30 •
  • 31. Figure 13.12 Interleavings of transactions U, V and W  objects A, B managed by X and Y ; C and D by Z U V W d.deposit(10) lock D 31 b.deposit(10) lock B a.deposit(20) lock A at Y at X c.deposit(30) lock C b.withdraw(30) wait at Y at Z c.withdraw(20) wait at Z a.withdraw(20) wait at X – next slide has global wait-for graph • U ® V at Y V ® W at Z W ® U at X
  • 32. Figure 13.13 Distributed deadlock  a deadlock cycle has alternate edges showing wait-for and held-by  wait-for added in order: U ® V at Y; V ® W at Z and W ® U at X C A 32 D Waits for Waits for Held by Held by B Waits for Held by X Y Z Held by W V U W V U (a) (b) •
  • 33. Deadlock detection - local wait-for graphs  Local wait-for graphs can be built, e.g. – server Y: U ® V added when U requests b.withdraw(30) – server Z: V ® W added when V requests c.withdraw(20) – server X: W ® U added when W requests a.withdraw(20)  to find a global cycle, communication between the servers is needed  centralized deadlock detection – one server takes on role of global deadlock detector – the other servers send it their local graphs from time to time – it detects deadlocks, makes decisions about which transactions to abort and informs the other servers – usual problems of a centralized service - poor availability, lack of fault tolerance and no ability to scale 33 •
  • 34. Figure 13.14 Local and global wait-for graphs local wait-for graph local wait-for graph global deadlock detector T U 34 X V T Y T U V  Phantom deadlocks – a ‘deadlock’ that is detected, but is not really one – happens when there appears to be a cycle, but one of the transactions has released a lock, due to time lags in distributing graphs – in the figure suppose U releases the object at X then waits for V at Y  and the global detector gets Y’s graph before X’s (T ® U ® V ® T) •
  • 35. Edge chasing - a distributed approach to deadlock detection  a global graph is not constructed, but each server knows about some of the edges – servers try to find cycles by sending probes which follow the edges of the graph through the distributed system – when should a server send a probe (go back to Fig 13.13) – edges were added in order U ® V at Y; V ® W at Z and W ® U at X  when W ® U at X was added, U was waiting, but  when V ® W at Z, W was not waiting – send a probe when an edge T1 ® T2 when T2 is waiting – each coordinator records whether its transactions are active or waiting  the local lock manager tells coordinators if transactions start/stop waiting  when a transaction is aborted to break a deadlock, the coordinator tells the participants, locks are removed and edges taken from wait-for graphs 35 •
  • 36. Edge-chasing algorithms  Three steps – Initiation:  When a server notes that T starts waiting for U, where U is waiting at another server, it initiates detection by sending a probe containing the edge < T ® U > to the server where U is blocked.  If U is sharing a lock, probes are sent to all the holders of the lock. – Detection:  Detection consists of receiving probes and deciding whether deadlock has occurred and whether to forward the probes. • e.g. when server receives probe < T ® U > it checks if U is waiting, e.g. U ® V, if so it forwards < T ® U ® V > to server where V waits • when a server adds a new edge, it checks whether a cycle is there – Resolution:  When a cycle is detected, a transaction in the cycle is aborted to break the deadlock. 36 •
  • 37. Figure 13.15 Probes transmitted to detect deadlock  example of edge chasing starts with X sending <W ® U>, then Y sends <W ® U ® V >, then Z sends <W ® U ® V ® W> 37 V Held by W Held by Waits for Waits for Waits for Deadlock detected U C A B Initiation W® U ® V ® W W® U W® U ® V Z Y X •
  • 38. Edge chasing conclusion  probe to detect a cycle with N transactions will require 2(N-1) messages. – Studies of databases show that the average deadlock involves 2 transactions.  the above algorithm detects deadlock provided that – waiting transactions do not abort – no process crashes, no lost messages – to be realistic it would need to allow for the above failures  refinements of the algorithm (p 536-7) – to avoid more than one transaction causing detection to start and then more than one being aborted – not time to study these now 38 •
  • 39. Summary of concurrency control for distributed transactions  each server is responsible for the serializability of transactions that access its own objects.  additional protocols are required to ensure that transactions are serializable globally. – timestamp ordering requires a globally agreed timestamp ordering – optimistic concurrency control requires global validation or a means of forcing a global ordering on transactions. – two-phase locking can lead to distributed deadlocks.  distributed deadlock detection looks for cycles in the global wait-for graph.  edge chasing is a non-centralized approach to the detection of distributed deadlocks  . 41 •
  • 40. 13.6 Transaction recovery  Atomicity property of transactions WWhhaatt iiss mmeeaanntt bbyy fdauilruarbei laityto?micity? – durability and failure atomicity – durability requires that objects are saved in permanent storage and will be available indefinitely – failure atomicity requires that effects of transactions are atomic even when the server 42 crashes  Recovery is concerned with – ensuring that a server’s objects are durable and – that the service provides failure atomicity. – for simplicity we assume that when a server is running, all of its objects are in volatile memory – and all of its committed objects are in a recovery file in permanent storage – recovery consists of restoring the server with the latest committed versions of all of its objects from its recovery file •
  • 41. Recovery manager  The task of the Recovery Manager (RM) is: – to save objects in permanent storage (in a recovery file) for committed transactions; – to restore the server’s objects after a crash; – to reorganize the recovery file to improve the performance of recovery; – to reclaim storage space (in the recovery file).  media failures – i.e. disk failures affecting the recovery file – need another copy of the recovery file on an independent disk. e.g. implemented as stable storage or using mirrored disks  we deal with recovery of 2PC separately (at the end) – we study logging (13.6.1) but not shadow versions (13.6.2) 43 •
  • 42. Recovery - intentions lists  Each server records an intentions list for each of its currently active transactions – an intentions list contains a list of the object references and the values of all the objects that are altered by a transaction – when a transaction commits, the intentions list is used to identify the objects affected  the committed version of each object is replaced by the tentative one  the new value is written to the server’s recovery file – in 2PC, when a participant says it is ready to commit, its RM must record its intentions list and its objects in the recovery file  it will be able to commit later on even if it crashes  when a client has been told a transaction has committed, the recovery files of all participating servers must show that the transaction is committed, • even if they crash between prepare to commit and commit 44 •
  • 43. Types of entry in a recovery file Type of entry Description of contents of entry Object A value of an object. Transaction status Transaction identifier, transaction status ( prepared, committed  For distributed transactions we need information relating to the 2PC as well as object values, that is: – transaction status (committed, prepared or aborted) – intentions list 45 aborted) and other status values used for the two-phase commit protocol. Intentions list Transaction identifier and a sequence of intentions, each of which consists of <identifier of object>, <position in recovery file of value of object>. Figure 13.18 • Why is that a good idea? Object state flattened to bytes first entry says prepared Note that the objects need not be next to one another in the recovery file
  • 44. Logging - a technique for the recovery file  the recovery file represents a log of the history of all the transactions at a server – it includes objects, intentions lists and transaction status – in the order that transactions prepared, committed and aborted – a recent snapshot + a history of transactions after the snapshot – during normal operation the RM is called whenever a transaction prepares, commits or aborts  prepare - RM appends to recovery file all the objects in the intentions list followed by status (prepared) and the intentions list  commit/abort - RM appends to recovery file the corresponding status  assume append operation is atomic, if server fails only the last write will be incomplete  to make efficient use of disk, buffer writes. Note: sequential writes are more efficient than those to random locations  committed status is forced to the log - in case server crashes 46 •
  • 45. Log for banking service committed status P0 P1 P2 P3 P4 P5 P6 P7 Object:A Object:B Object:C Object:A Object:B Trans:T Trans:T Object:C Object:B Trans:U 100 200 300 80 220 prepared committed 278 242 prepared  Logging mechanism for Fig 12.7 (there would really be other objects in log file) – initial balances of A, B and C $100, $200, $300 – T sets A and B to $80 and $220. U sets B and C to $242 and $278 – entries to left of line represent a snapshot (checkpoint) of values of A, B and C before T started. T has committed, but U is prepared. – the RM gives each object a unique identifier (A, B, C in diagram) – each status entry contains a pointer to the previous status entry, then the checkpoint can follow transactions backwards through the file 47 <A, P1> <C, P5> <B, P2> <B, P6> P0 P3 P4 Checkpoint End of log Figure 13.19. prepared status and intentions list •
  • 46. Recovery of objects - with logging  When a server is replaced after a crash – it first sets default initial values for its objects – and then hands over to its recovery manager.  The RM restores the server’s objects to include – all the effects of all the committed transactions in the correct order and – none of the effects of incomplete or aborted transactions – it ‘reads the recovery file backwards’ (by following the pointers)  restores values of objects with values from committed transactions  continuing until all of the objects have been restored – if it started at the beginning, there would generally be more work to do – to recover the effects of a transaction use the intentions list to find the value of the objects  e.g. look at previous slide (assuming the server crashed before T committed) – the recovery procedure must be idempotent 48 •
  • 47. Logging - reorganising the recovery file  RM is responsible for reorganizing its recovery file – so as to make the process of recovery faster and – to reduce its use of space  checkpointing – the process of writing the following to a new recovery file  the current committed values of a server’s objects,  transaction status entries and intentions lists of transactions that have not yet been fully resolved  including information related to the two-phase commit protocol (see later) – checkpointing makes recovery faster and saves disk space  done after recovery and from time to time  can use old recovery file until new one is ready, add a ‘mark’ to old file  do as above and then copy items after the mark to new recovery file  replace old recovery file by new recovery file 49 •
  • 48. Recovery of the two-phase commit protocol  The above recovery scheme is extended to deal with transactions doing the 2PC protocol when a server fails – it uses new transaction status values done, uncertain (see Fig 13.6)  the coordinator uses committed when result is Yes;  done when 2PC complete ( if a transaction is done its information may be removed when reorganising the recovery file)  the participant uses uncertain when it has voted Yes; committed when told the result (uncertain entries must not be removed from recovery file) – It also requires two additional types of entry: Type of entry Description of contents of entry Coordinator Transaction identifier, list of participants 51 added by RM when coordinator prepared Participant Transaction identifier, coordinator added by RM when participant votes yes •
  • 49. Start at end, for U find it is committed Boru itWf itfh eteh h esa esvrevr eTvre crco rhamasmsh ecitrdtea edsha arenldied rb cweofeo rhdeai ntvhaeet o U lara nspdtr e eapn aptrrayer dwti ceaip nhadan vptea rUtic uipnacnetrtain and participant Log with entries relating to two-phase commit protocol Trans:T Coord’r:T Trans:T Trans:U Part’pant:U Trans:U Trans:U prepared part’pant list: . . .  entries in log for coordinator entry participant entry – T where server is coordinator (prepared comes first, followed by the coordinator entry, then committed – done is not shown) – and U where server is participant (prepared comes first followed by the participant entry, then uncertain and finally committed) – these entries will be interspersed with values of objects  recovery must deal with 2PC entries as well as restoring objects – where server was coordinator find coordinator entry and status entries. – where server was participant find participant entry and status entries 52 committed prepared Coord’r: . . uncertain committed intentions list intentions list Figure 13.21 •
  • 50. Recovery of the two-phase commit protocol Role Status Action of recovery manager Coordinator prepared No decision had been reached before the server failed. It sends Figure 13.22 53 abortTransaction to all the servers in the participant list and adds the transaction status aborted in its recovery file. Same action for state aborted. If there is no participant list, the participants will eventually timeout and abort the transaction. Coordinator committed A decision to commit had been reached before the server failed. It sends a doCommit to all the participants in its participant list (in case it had not done so before) and resumes the two-phase protocol at step 4 (Fig 13.5). Participant committed The participant sends a haveCommitted message to the coordinator (in case this was not done before it failed). This will allow the coordinator to discard information about this transaction at the next checkpoint. Participant uncertain The participant failed before it knew the outcome of the transaction. It cannot determine the status of the transaction until the coordinator informs it of the decision. It will send a getDecision to the coordinator to determine the status of the transaction. When it receives the reply it will commit or abort accordingly. Participant prepared The participant has not yet voted and can abort the transaction. Coordinator done No action is required. • the most recent entry in the recovery file determines the status of the transaction at the time of failure the RM action for each transaction depends on whether server was coordinator or participant and the status
  • 51. Summary of transaction recovery  Transaction-based applications have strong requirements for the long life and integrity of the information stored.  Transactions are made durable by performing checkpoints and logging in a recovery file, which is used for recovery when a server is replaced after a crash.  Users of a transaction service would experience some delay during recovery.  It is assumed that the servers of distributed transactions exhibit crash failures and run in an asynchronous system, – but they can reach consensus about the outcome of transactions because crashed servers are replaced with new processes that can acquire all the relevant information from permanent storage or from other servers 55 •

Editor's Notes

  1. Times taken to present this material: Sections 13.1-3 require about 1 hour 40 minutes Section 13.4 was not presented (slides are included) Section 13.5 (except 13.6.2 and 13.6.3 ) took about 40 minutes.
  2. also going to discuss cc for distributed Tx and recovery of distributed Tx
  3. flat transaction: client makes requests to several servers A flat client transaction completes each of its requests before going on to the next one. Therefore, each transaction accesses servers’ objects sequentially. In a nested transaction, the top-level transaction can open subtransactions, and each subtransaction can open further subtransactions down to any depth of nesting In the nested case, subtransactions at the same level can run concurrently, so T1 and T2 are concurrent, and as they invoke objects in different servers, they can run in parallel.
  4. nested Tx - run requests in parallel, more efficient if several servers involved
  5. Participants aborts it if it crashes and then restarts or if it has a concurrency control problem, e.g. deadlock or failure of validation in optimistic cc or failure of an operation in timestamps.
  6. a client’s (flat) banking transaction involves accounts A, B, C and D at servers BranchX, BranchY and BranchZ explain openTransaction goes to coordinator (in any of the servers) go back to previous slide Each server is shown with a participant, which joins the transaction by invoking the join method in the coordinator when does join occur ? on first request from client to new server. When the client invokes one of the methods in the transaction, for example b.withdraw(T, 3), the object receiving the invocation (B at BranchY in this case) informs its participant object that the object belongs to the transaction T.
  7. An 12.1.2 mentions failure of disks, processes and messages (here we are assuming that disc failures can be masked as discussed in Ch 12 e.g. by use of stable storage and that failure of two blocks is a disaster) processes crash, messages lost
  8. Can you commit, 2. Yes/no, 3 do commit/abort (4 is just a confiirmation) Use permanent storage because it might crash.
  9. The two-phase commit protocol consists of a voting phase and a completion phase By the end of step (2) the coordinator and all the participants that voted Yes are prepared to commit. By the end of step (3) the transaction is effectively completed. At step (3a) the coordinator and the participants are committed, so the coordinator can report a decision to commit to the client. At (3b) the coordinator reports a decision to abort to the client At step (4) participants confirm that they have committed so that the coordinator knows when the information it has recorded about the transaction is no longer needed
  10. Step 2 - participant is uncertain. E.g. coordinator may have crashed Before step 2. Maybe coordinator has crashed In step 1 maybe some participants have crashed
  11. Advantage of flat - simpler set of calls, does not depend on lower levels all replying Disadvantage - need abort list.
  12. In the above ordering T locks A at X then U locks B at Y T tries to access B at Y and waits for U’s lock U tries to access A at X and waits for T’s lock
  13. problems when local orderings far from real time e.g. S1 has 10 and S2 has 100 then transactions at S1 are always too late suppose that a transactions T and U are started at S1 and S2 with timestamps &amp;lt;S1, 10&amp;gt; and &amp;lt;S2, 100&amp;gt; we have &amp;lt;S2, 100&amp;gt; &amp;gt; &amp;lt;S1, 10&amp;gt; , similarly &amp;lt;S2, 100&amp;gt; &amp;gt; &amp;lt;S1, 11&amp;gt; etc so transactions such as T at S1 will find that transactions such as U at S2 have timestamp T &amp;lt; timestamps set by U when reading and writing objects so it will be hard for T to succeed
  14. suppose T and U start validation at about the same time (in different servers) but X validates T first and Y validates U first recall that the validation protocol in each server only does one transaction at a time, so each server will be unable to validate the other transaction until the first has completed. Commitment deadlock
  15. e.g. how to prevent different orderings global validation after local ones use of globally unique transaction numbers with bagreed orderings
  16. a transaction can wait for only one object at a time, therefore, objects may be left out of the wait for graphs (b) when we return from a later slide on edge chasing: Consider the situation at server X. It has just added the edge W -&amp;gt; U to its local wait-for graph and at this time, transaction U is waiting to access object B, which transaction V holds at server Y. This edge could possibly be part of a cycle such as V -&amp;gt; T1 -&amp;gt; T2 -&amp;gt; … -&amp;gt; W -&amp;gt; U -&amp;gt; V involving transactions using objects at other servers. There is a potential distributed deadlock cycle, which could be found by sending out a probe to server Y But earlier on when Z added V-&amp;gt;W, W was not waiting
  17. in the figure, Suppose that transaction U then releases an object at server X and requests the one held by V at server Y. Suppose also that the global detector receives server Y’s local graph before server X’s. In this case, it would detect a cycle T Æ U Æ V Æ T, although the edge T Æ U no longer exists. Actually with 2 phase locking this can’t happen because U can’t release a lock and then request another. But a phantom deadlock could occur if a transaction in a deadlock cycle aborts while the detection procedure is being carried out
  18. transaction coordinators rectangles X, Y, Z objects circles A, B, C we show probes going directly from one object server to another in reality they go from object server to coordinator and then to the next object server. i.e. two messages per probe probe to detect a cycle with N transactions will require 2(N-1) messages. Studies of databases show that the average deadlock involves 2 transactions.
  19. database servers often just load objects into volatile memory when they are accessed
  20. The RM deals with both durability and failure atomicity - it saves committed objects and can be used to restore server state
  21. openTransaction -&amp;gt; TID which is passed with each operation updates are put in a private set of tentative versions of objects the server makes the intentions list as the transaction progresses
  22. Good idea to be able to write each object as it is convenient. E.g. sometimes might write an object early, or might allow threads in RM - preparing two transactions at once. The object itself must be sequential.
  23. to simplify we have only three objects A, B and C in the server when T prepares, it writes its new values A= $80 and B= $220. at P1 and P2. followed by prepared status and intentions list (refers to A, P1 etc) when T commits, its status is added when U prepares, it writes its new values C= $278 and B= $242. at P5 and P6. followed by prepared status and intentions list (refers to C, P5 etc)
  24. Omit discussion of shadow versions
  25. the most recent entry determines the status of the transaction at the time of failure RM action for each transaction depends on whether server was coordinator or participant and the status as above
  26. Not doing nested transactions (Fig 13.23)