Upload
eddy-tansill
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
OSes: 16. Dist. Coord 1
Operating SystemsOperating Systems
ObjectivesObjectives– introduce issues such as event ordering, introduce issues such as event ordering,
mutual exclusion, atomicity, deadlockmutual exclusion, atomicity, deadlock
Certificate Program in Software DevelopmentCSE-TC and CSIM, AITSeptember -- November, 2003
16. Distributed Coordination(S&G 6th ed., Ch. 17)
OSes: 16. Dist. Coord 2
OverviewOverview
1. Event Ordering1. Event Ordering
2. Distributed Mutual Exclusion 2. Distributed Mutual Exclusion
3. Atomicity3. Atomicity
4. Concurrency Control4. Concurrency Control
5. Deadlock Handling5. Deadlock Handling
6. Election Algorithms6. Election Algorithms
7. Reaching Agreement7. Reaching Agreement
OSes: 16. Dist. Coord 3
1. Event Ordering1. Event Ordering
Happened-beforeHappened-before relation (denoted by -->). relation (denoted by -->).– If If AA and and BB are events in the same process, and are events in the same process, and AA
was executed before was executed before BB, then , then AA --> --> BB..
– If If AA is the event of sending a message by one is the event of sending a message by one process and process and BB is the event of receiving that is the event of receiving that message by another process, then message by another process, then AA --> --> BB..
– If If AA --> --> BB and B --> and B --> C C thenthen A A -->--> C C..
OSes: 16. Dist. Coord 4
Relative Time for 3 Concurrent ProcessesRelative Time for 3 Concurrent ProcessesFigure 17.1, p.597
message
event
tim
e
OSes: 16. Dist. Coord 5
1.1. Implementation of --> 1.1. Implementation of --> Associate a timestamp with each system event. Associate a timestamp with each system event. Require that for every pair of events Require that for every pair of events AA and and BB, if , if
A A -->--> B B, then the timestamp of , then the timestamp of AA is less than is less than the timestamp of the timestamp of BB..
Each process Each process PPii has a has a logical clocklogical clock, , LCLCii
– implemented as a counter, incremented when implemented as a counter, incremented when an event occursan event occurs
continued
OSes: 16. Dist. Coord 6
A process advances its logical clock when it A process advances its logical clock when it receives a message whose timestamp is receives a message whose timestamp is greater than the current value of its logical greater than the current value of its logical clock.clock.
If the timestamps of two events If the timestamps of two events AA and and BB are are the same, then the events are concurrent.the same, then the events are concurrent.
OSes: 16. Dist. Coord 7
2. Distributed Mutual Exclusion 2. Distributed Mutual Exclusion (DME) (DME)
Assumptions:Assumptions:– the system consists of the system consists of nn processes; each process processes; each process PPii
resides at a different processor;resides at a different processor;– each process has a critical section that requires each process has a critical section that requires
mutual exclusionmutual exclusion
Requirement:Requirement:– if if PPii is executing in its critical section, then no other is executing in its critical section, then no other
process process PPjj is executing in its own critical section is executing in its own critical section
OSes: 16. Dist. Coord 8
2.1. DME: Centralized Approach2.1. DME: Centralized Approach
One of the processes in the system is One of the processes in the system is chosen to coordinate the entry to the chosen to coordinate the entry to the critical sections.critical sections.
A process that wants to enter its critical A process that wants to enter its critical section sends a section sends a requestrequest message to the message to the coordinator.coordinator.
continued
OSes: 16. Dist. Coord 9
The coordinator decides which process can The coordinator decides which process can enter the critical section next, and its sends enter the critical section next, and its sends that process a that process a replyreply message. message.
When the process receives a When the process receives a replyreply message message from the coordinator, it enters its critical from the coordinator, it enters its critical section.section.
continued
OSes: 16. Dist. Coord 10
After exiting its critical section, the process sends After exiting its critical section, the process sends a a releaserelease message to the coordinator and message to the coordinator and proceeds with its execution.proceeds with its execution.
This scheme requires three messages per critical-This scheme requires three messages per critical-section entry:section entry:– request request
– replyreply
– releaserelease
OSes: 16. Dist. Coord 11
2.2. DME: Fully Distributed Approach2.2. DME: Fully Distributed Approach
When process When process PPii wants to enter its critical wants to enter its critical
section, it generates a new timestamp, section, it generates a new timestamp, TSTS, , and sends the message and sends the message requestrequest ((PPii, TS, TS) to all ) to all
other processes in the system.other processes in the system.
When process When process PPjj receives a receives a requestrequest message, message,
it may reply immediately or it may defer it may reply immediately or it may defer sending a reply back.sending a reply back.
continued
OSes: 16. Dist. Coord 12
When process When process PPi i receives a receives a replyreply message message
from all other processes in the system, it from all other processes in the system, it can enter its critical section.can enter its critical section.
After exiting its critical section, the process After exiting its critical section, the process sends sends replyreply messages to all its deferred messages to all its deferred requests.requests.
continued
OSes: 16. Dist. Coord 13
The decision whether process The decision whether process PPjj replies replies
immediately to a immediately to a requestrequest((PPii, TS, TS) message or ) message or
defers its reply is based on three factors:defers its reply is based on three factors:– 1. if 1. if PPjj is in its critical section, then it defers its is in its critical section, then it defers its
reply to reply to PPii
– 2. if 2. if PPjj does does notnot want to enter its critical section, want to enter its critical section,
then it sends a then it sends a replyreply immediately to immediately to PPii..
continued
OSes: 16. Dist. Coord 14
– If If PPjj wants to enter its critical section but has wants to enter its critical section but has
not yet entered it, then it compares its own not yet entered it, then it compares its own request timestamp with the timestamp request timestamp with the timestamp TSTS..
If its own request timestamp is greater than If its own request timestamp is greater than TSTS, then it sends a , then it sends a replyreply immediately to immediately to PPii
((PPii asked first).asked first).
Otherwise, the reply is deferred.Otherwise, the reply is deferred.
OSes: 16. Dist. Coord 15
Desirable Behavior of Fully Desirable Behavior of Fully Distributed ApproachDistributed Approach
Freedom from deadlock.Freedom from deadlock.
Freedom from starvation, since entry to the Freedom from starvation, since entry to the critical section is scheduled according to the critical section is scheduled according to the timestamp orderingtimestamp ordering– the timestamp ordering means that processes the timestamp ordering means that processes
are served in a FCFS orderare served in a FCFS order
continued
OSes: 16. Dist. Coord 16
The number of messages per critical-section The number of messages per critical-section entry is 2(entry is 2(nn – 1). – 1).– this is the minimum number of required this is the minimum number of required
messages per critical-section entry when messages per critical-section entry when processes act independently and concurrentlyprocesses act independently and concurrently
n is the number of processes
OSes: 16. Dist. Coord 17
Three Undesirable ConsequencesThree Undesirable Consequences
1. The processes need to know the identity 1. The processes need to know the identity of all other processes in the systemof all other processes in the system– makes the dynamic addition and removal of makes the dynamic addition and removal of
processes more complexprocesses more complex
2. If one of the processes fails, then the 2. If one of the processes fails, then the entire scheme collapsesentire scheme collapses– this can be dealt with by continuously this can be dealt with by continuously
monitoring the state of all the processesmonitoring the state of all the processes
continued
OSes: 16. Dist. Coord 18
3. Processes that have not entered their 3. Processes that have not entered their critical section must pause frequently to tell critical section must pause frequently to tell other processes that they intend to enter other processes that they intend to enter their critical sectiontheir critical section
This approach is best suited for small, stable This approach is best suited for small, stable sets of cooperating processes.sets of cooperating processes.
OSes: 16. Dist. Coord 19
2.3. Token Passing2.3. Token Passing
When a process has the token, it can enter its When a process has the token, it can enter its critical section (if it wants to), or pass it on.critical section (if it wants to), or pass it on.
Problems: ring breaks, token lossProblems: ring breaks, token loss
c.sP2
c.sP1
c.sP0
token
OSes: 16. Dist. Coord 20
3. Atomicity 3. Atomicity
Either Either allall the operations associated with a the operations associated with a program unit are executed to completion, program unit are executed to completion, oror nonenone are performed. are performed.
Ensuring atomicity in a distributed system Ensuring atomicity in a distributed system requires a requires a transaction coordinator.transaction coordinator.
continued
OSes: 16. Dist. Coord 21
Transaction Coordinator tasks:Transaction Coordinator tasks:– start the transactionstart the transaction
– break the transaction into subtransactions, and break the transaction into subtransactions, and distribution them to appropriate sites for executiondistribution them to appropriate sites for execution
– coordinate the termination of the transactioncoordinate the termination of the transaction may result in commits or aborts at all sitesmay result in commits or aborts at all sites
OSes: 16. Dist. Coord 22
3.1. Two-Phase Commit Protocol (2PC)3.1. Two-Phase Commit Protocol (2PC)
Assumes a Assumes a fail-stop modelfail-stop model..
Execution of 2PC starts after the last step Execution of 2PC starts after the last step of the transaction.of the transaction.
When 2PC starts, the transaction may still When 2PC starts, the transaction may still be executing at some sub-sites.be executing at some sub-sites.
continued
OSes: 16. Dist. Coord 23
2PC involves all the sub-sites at which the 2PC involves all the sub-sites at which the transaction executed.transaction executed.
Example: Let Example: Let TT be a transaction initiated at be a transaction initiated at sitesite S Sii and let the transaction coordinator at and let the transaction coordinator at SSii bebe C Cii..
OSes: 16. Dist. Coord 24
Phase 1: Obtaining a DecisionPhase 1: Obtaining a Decision
CCii adds <prepare adds <prepare TT> record to the log.> record to the log.
CCii sends <prepare sends <prepare TT> message to all sites.> message to all sites.
When a sub-site receives a <prepare When a sub-site receives a <prepare TT> > message, the transaction manager message, the transaction manager determines if it can commit the transaction.determines if it can commit the transaction.
continued
OSes: 16. Dist. Coord 25
– If no: add <no If no: add <no TT> record to the log and respond > record to the log and respond to to CCii with <abort with <abort TT>>
– If yes:If yes: add <ready add <ready TT> record to the log.> record to the log. force force all log recordsall log records for for TT onto stable storage. onto stable storage. transaction manager sends <ready transaction manager sends <ready TT> message > message
to to CCii..
continued
OSes: 16. Dist. Coord 26
Coordinator collects responsesCoordinator collects responses– all respond “ready”, decision is all respond “ready”, decision is commitcommit
– at least one response is “abort”, decision is at least one response is “abort”, decision is abortabort
– at least one participant fails to respond within a at least one participant fails to respond within a time-out period, decision is time-out period, decision is abortabort. .
OSes: 16. Dist. Coord 27
Phase 2: Record Decision in DBPhase 2: Record Decision in DB
Coordinator adds a decision recordCoordinator adds a decision record<abort <abort TT> or <commit > or <commit TT>>
to its log and records onto stable storage.to its log and records onto stable storage.
Once the info reaches stable storage, it is Once the info reaches stable storage, it is irrevocableirrevocable (even if failure occur). (even if failure occur).
continued
OSes: 16. Dist. Coord 28
Coordinator sends a message to each Coordinator sends a message to each participant informing it of the decision participant informing it of the decision (commit or abort).(commit or abort).
Participants take appropriate action locally.Participants take appropriate action locally.
OSes: 16. Dist. Coord 29
3.2. Failure Handling in 2PC – 3.2. Failure Handling in 2PC – Site FailureSite Failure
The log contains a <commit The log contains a <commit TT> record. In > record. In this case, the site executes this case, the site executes redoredo((TT).).
The log contains an <abort The log contains an <abort TT> record. In > record. In this case, the site executes this case, the site executes undoundo((TT).).
continued
OSes: 16. Dist. Coord 30
The log contains a <ready The log contains a <ready TT> record; > record; consult consult CCii. If . If CCii is down, site sends is down, site sends query-query-
statusstatus TT message to the other sites. message to the other sites.
The log contains no control records The log contains no control records concerning concerning TT. In this case, the site executes . In this case, the site executes undoundo((TT).).
OSes: 16. Dist. Coord 31
Failure Handling in 2PC – Failure Handling in 2PC – Coordinator Coordinator CCii FailureFailure
If all active sites contain a <commit If all active sites contain a <commit TT> record > record in its log, the in its log, the TT must be committed. must be committed.
If an active site contains an <abort If an active site contains an <abort TT> record in > record in its log, then its log, then TT must be aborted. must be aborted.
continued
the main problemwith 2PC
OSes: 16. Dist. Coord 32
If some active site does If some active site does notnot contain the contain the record <ready record <ready TT> in its log then the failed > in its log then the failed coordinator coordinator CCii cannot have decided to cannot have decided to
commit commit TT. . Rather than wait for Rather than wait for CCii to recover, it is to recover, it is
better to abort better to abort TT. .
continued
OSes: 16. Dist. Coord 33
All active sites have a <ready All active sites have a <ready TT> record in > record in their logs, but no additional control records. their logs, but no additional control records.
In this case we must wait for the In this case we must wait for the coordinator to recover. coordinator to recover. – blocking problem – blocking problem – TT is blocked pending the is blocked pending the
recovery of site recovery of site SSii..
OSes: 16. Dist. Coord 34
4. Concurrency Control4. Concurrency Control
Modify the centralized concurrency Modify the centralized concurrency schemes to handle the distribution of schemes to handle the distribution of transactions.transactions.
A A transaction managertransaction manager coordinates the coordinates the execution of transactions (or execution of transactions (or subtransactions) which access data at local subtransactions) which access data at local sites. sites.
continued
OSes: 16. Dist. Coord 35
Local transaction only executes at that site. Local transaction only executes at that site.
Global transaction executes at several sites. Global transaction executes at several sites.
OSes: 16. Dist. Coord 36
4.1. Locking Protocols4.1. Locking Protocols
We can use the two-phase locking protocol We can use the two-phase locking protocol in a distributed environment by changing in a distributed environment by changing how the lock manager is implemented.how the lock manager is implemented.
A A non-replicated schemenon-replicated scheme– each site maintains a local lock manager which each site maintains a local lock manager which
administers lock and unlock requests for data administers lock and unlock requests for data items stored at that siteitems stored at that site
continued
OSes: 16. Dist. Coord 37
– a simple implementation involves two message a simple implementation involves two message transfers for handling lock requests, and one transfers for handling lock requests, and one message transfer for handling unlock requestsmessage transfer for handling unlock requests
– deadlock handling is more complexdeadlock handling is more complex
OSes: 16. Dist. Coord 38
Single-Coordinator ApproachSingle-Coordinator Approach
A single lock manager resides at a single A single lock manager resides at a single chosen sitechosen site– all lock and unlock requests are made at that all lock and unlock requests are made at that
sitesite
Simple implementationSimple implementation Simple deadlock handlingSimple deadlock handling
continued
OSes: 16. Dist. Coord 39
Possibility of bottleneckPossibility of bottleneck
Vulnerable to loss of concurrency controller Vulnerable to loss of concurrency controller if single site fails if single site fails
Multiple-coordinator approachMultiple-coordinator approach distributes distributes lock-manager function over several sites. lock-manager function over several sites.
OSes: 16. Dist. Coord 40
Majority ProtocolMajority Protocol
Avoids drawbacks of central control by dealing Avoids drawbacks of central control by dealing with with replicated datareplicated data in a decentralized manner. in a decentralized manner.
More complicated to implement More complicated to implement
Deadlock handling must be modifiedDeadlock handling must be modified– possible for deadlock to occur when locking only possible for deadlock to occur when locking only
one data item one data item
for replicated data
OSes: 16. Dist. Coord 41
Biased ProtocolBiased Protocol Similar to majority protocol, but requests for Similar to majority protocol, but requests for
shared locksshared locks are prioritized over requests for are prioritized over requests for exclusive locksexclusive locks..
Less overhead on read operations than in majority Less overhead on read operations than in majority protocol; but has additional overheads for writes. protocol; but has additional overheads for writes.
Like majority protocol, deadlock handling is Like majority protocol, deadlock handling is complex.complex.
OSes: 16. Dist. Coord 42
Primary CopyPrimary Copy
One of the sites at which a replica resides is One of the sites at which a replica resides is designated as the designated as the primary siteprimary site. . – a request to lock a data item is made at the a request to lock a data item is made at the
primary site of that itemprimary site of that item
Concurrency control for replicated data is Concurrency control for replicated data is handled in a similar way to unreplicated handled in a similar way to unreplicated data. data.
continued
OSes: 16. Dist. Coord 43
A simple implementation, but if the primary A simple implementation, but if the primary site fails, then the data item is unavailable, site fails, then the data item is unavailable, even though other sites may have a copy. even though other sites may have a copy.
OSes: 16. Dist. Coord 44
5. Deadlock Handling5. Deadlock Handling
Deadlock PreventionDeadlock Prevention Deadlock AvoidanceDeadlock Avoidance Deadlock DetectionDeadlock Detection
OSes: 16. Dist. Coord 45
5.1. Deadlock Prevention5.1. Deadlock Prevention
Resource-ordering deadlock-prevention – Resource-ordering deadlock-prevention – define a define a globalglobal ordering among the system ordering among the system resources. resources. – assign a unique number to all system resourcesassign a unique number to all system resources– a process may request a resource with unique a process may request a resource with unique
number number ii only if it is not holding a resource with only if it is not holding a resource with a unique number greater thana unique number greater than i i
– simple to implement; requires little overheadsimple to implement; requires little overhead
continued
OSes: 16. Dist. Coord 46
Banker’s algorithm in a dist. systemBanker’s algorithm in a dist. system– designate one of the processes in the system as designate one of the processes in the system as
the process that maintains the information the process that maintains the information necessary to carry out the Banker’s algorithmnecessary to carry out the Banker’s algorithm
– also implemented easily, but may require too also implemented easily, but may require too much overheadmuch overhead
OSes: 16. Dist. Coord 47
Priority Deadlock-Prevention SchemePriority Deadlock-Prevention Scheme
Each process Each process PPii is assigned a unique is assigned a unique
priority number priority number
Priority numbers are used to decide whether Priority numbers are used to decide whether a processa process P Pii should wait for a process should wait for a process PPjj; ;
otherwise otherwise PPii is rolled back. is rolled back.
continued
OSes: 16. Dist. Coord 48
The scheme prevents deadlocks. The scheme prevents deadlocks. – for every edge for every edge PPii --> --> PPjj in the wait-for graph, in the wait-for graph, PPii
has a higher priority than has a higher priority than PPjj. . a cycle cannot exista cycle cannot exist
OSes: 16. Dist. Coord 49
Wait-Die SchemeWait-Die Scheme
Based on a non-preemptive technique.Based on a non-preemptive technique.
If If PPii requests a resource currently held by requests a resource currently held by
PPjj, P, Pii is allowed to wait only if it has a is allowed to wait only if it has a
smallersmaller timestamp than does timestamp than does PPjj ( (PPii is older is older
than than PPjj). Otherwise, P). Otherwise, Pii is rolled back (dies). is rolled back (dies).
continued
OSes: 16. Dist. Coord 50
Example: Suppose that processes Example: Suppose that processes PP11, , PP22, ,
and and PP33 have timestamps 5, 10, and 15 have timestamps 5, 10, and 15
respectively.respectively.– if if PP11 request a resource held by request a resource held by PP22, then , then PP11 will will
wait.wait.
– If If PP33 requests a resource held by requests a resource held by PP22, then , then PP33 will will
be rolled back.be rolled back.
OSes: 16. Dist. Coord 51
Would-Wait SchemeWould-Wait Scheme
Based on a preemptive techniqueBased on a preemptive technique– counterpart to the wait-die systemcounterpart to the wait-die system
If If PPii requests a resource currently held by requests a resource currently held by PPjj, , PPii is allowed to wait only if it has a is allowed to wait only if it has a largerlarger timestamp than does timestamp than does PPjj ( (PPii is younger than is younger than PPjj). Otherwise ). Otherwise PPjj is rolled back ( is rolled back (PPjj is is wounded by wounded by PPii).).
continued
OSes: 16. Dist. Coord 52
Example: Suppose that processes Example: Suppose that processes PP11, , PP2, 2,
and and PP33 have timestamps 5, 10, and 15 have timestamps 5, 10, and 15 respectively.respectively.– If If PP11 requests a resource held by requests a resource held by PP22, then the , then the
resource will be preempted from resource will be preempted from PP22 and and PP22 will will be rolled back.be rolled back.
– If If PP33 requests a resource held by requests a resource held by PP22, then , then PP33 will will wait.wait.
OSes: 16. Dist. Coord 53
5.2. Deadlock Detection – 5.2. Deadlock Detection – Centralized ApproachCentralized Approach
Each site keeps a Each site keeps a locallocal wait-for graph. wait-for graph.– the nodes of the graph correspond to all the the nodes of the graph correspond to all the
processes that are currently either holding or processes that are currently either holding or requesting any of the resources local to that siterequesting any of the resources local to that site
A global wait-for graph is maintained in a A global wait-for graph is maintained in a singlesingle coordination process coordination process– this graph is the union of all local wait-for graphsthis graph is the union of all local wait-for graphs
continued
OSes: 16. Dist. Coord 54
There are There are threethree different options (points in time) when the different options (points in time) when the globalglobal wait-for graph may be constructed wait-for graph may be constructed– 1. Whenever a new edge is inserted or removed in one of the 1. Whenever a new edge is inserted or removed in one of the
local wait-for graphs.local wait-for graphs.– 2. Periodically, when a number of changes have occurred in a 2. Periodically, when a number of changes have occurred in a
wait-for graph.wait-for graph.– 3. Whenever the coordinator needs to invoke the cycle-3. Whenever the coordinator needs to invoke the cycle-
detection algorithm.detection algorithm.
Unnecessary rollbacks may occur as a result ofUnnecessary rollbacks may occur as a result of false false cyclescycles..
OSes: 16. Dist. Coord 55
Two Local Wait-For GraphsTwo Local Wait-For GraphsFigure 17.3, p.612
OSes: 16. Dist. Coord 56
Global Wait-For GraphGlobal Wait-For GraphFigure 17.4, p.613
OSes: 16. Dist. Coord 57
Detection Algorithm Based on Detection Algorithm Based on Option 3Option 3
Append unique identifiers (timestamps) to Append unique identifiers (timestamps) to requests from different sites.requests from different sites.
When process When process PPii, at site , at site AA, requests a resource , requests a resource
from process from process PPjj, at site , at site BB, a request message , a request message
with timestamp with timestamp TSTS is sent. is sent.
continued
OSes: 16. Dist. Coord 58
The edge The edge PPii --> --> PPjj with the label with the label TSTS is is
inserted in the local wait-for of inserted in the local wait-for of AA. The edge . The edge is inserted in the local wait-for graph of is inserted in the local wait-for graph of BB only if only if BB has received the request message has received the request message and cannot immediately grant the requested and cannot immediately grant the requested resource.resource.
OSes: 16. Dist. Coord 59
The Algorithm The Algorithm
1.1. The controller sends an initiating message to The controller sends an initiating message to each site in the system.each site in the system.
2.2. On receiving this message, a site sends its local On receiving this message, a site sends its local wait-for graph to the coordinator.wait-for graph to the coordinator.
3.3. When the controller has received a reply from When the controller has received a reply from each site, it constructs a graph as follows:each site, it constructs a graph as follows:
continued
OSes: 16. Dist. Coord 60
– (a)(a) The constructed graph contains a vertex for The constructed graph contains a vertex for every process in the system.every process in the system.
– (b)(b) The graph has an edge The graph has an edge PPii --> P --> Pjj if and only if if and only if
(1) there is an edge (1) there is an edge PPii --> P --> Pjj in one of the wait-for in one of the wait-for
graphs, or (2) an edge graphs, or (2) an edge PPii --> P --> Pjj with some label with some label TSTS
appears in more than one wait-for graph. appears in more than one wait-for graph.
– If the constructed graph contains a cycle then system If the constructed graph contains a cycle then system deadlock.deadlock.
OSes: 16. Dist. Coord 61
Local and Global Wait-For GraphsLocal and Global Wait-For GraphsFigure 17.5, p.614
OSes: 16. Dist. Coord 62
Fully Distributed ApproachFully Distributed Approach
All controllers share equally the responsibility All controllers share equally the responsibility for detecting deadlock.for detecting deadlock.
Every site constructs a wait-for graph that Every site constructs a wait-for graph that represents a part of the total graph.represents a part of the total graph.
We add one additional node We add one additional node PPexex to each local to each local
wait-for graph.wait-for graph.
continued
OSes: 16. Dist. Coord 63
If a local wait-for graph contains a cycle that If a local wait-for graph contains a cycle that does not involve node does not involve node PPexex, then the system is in a , then the system is in a
deadlock state.deadlock state.
A cycle involving A cycle involving PPexex implies the implies the possibilitypossibility of a of a
deadlock.deadlock.– to find out whether a deadlock does exist, a to find out whether a deadlock does exist, a
distributed deadlock-detection algorithm must be distributed deadlock-detection algorithm must be calledcalled
OSes: 16. Dist. Coord 64
Augmented Local Wait-For Graphs Augmented Local Wait-For Graphs Figure 17.6, p.616
OSes: 16. Dist. Coord 65
Augmented Local Wait-For Graph Augmented Local Wait-For Graph in Site S2in Site S2
Figure 17.7, p.617
OSes: 16. Dist. Coord 66
6. Election Algorithms6. Election Algorithms
Determine where a new copy of the Determine where a new copy of the coordinator should be restarted.coordinator should be restarted.
Assume that a unique priority number is Assume that a unique priority number is associated with each active process in the associated with each active process in the system, and assume that the priority number system, and assume that the priority number of process of process PPii is is ii..
Assume a one-to-one correspondence Assume a one-to-one correspondence between processes and sites.between processes and sites.
continued
OSes: 16. Dist. Coord 67
The coordinator is always the process with The coordinator is always the process with the largest priority numberthe largest priority number– when a coordinator fails, the algorithm must elect when a coordinator fails, the algorithm must elect
the active process with the largest priority numberthe active process with the largest priority number
Two algorithms, the Two algorithms, the bully algorithmbully algorithm or the or the ring algorithmring algorithm, can be used to elect a new , can be used to elect a new coordinator in case of failures.coordinator in case of failures.
OSes: 16. Dist. Coord 68
6.1. Bully Algorithm6.1. Bully Algorithm
Applicable to systems where every process Applicable to systems where every process can send a message to every other process.can send a message to every other process.
If process If process PPii sends a request that is not sends a request that is not
answered by the coordinator within a time answered by the coordinator within a time interval interval TT, assume that the coordinator has , assume that the coordinator has failed; failed; PPii tries to elect itself as the new tries to elect itself as the new
coordinator.coordinator.
continued
OSes: 16. Dist. Coord 69
PPii sends an election message to every sends an election message to every
process with a higher priority number, process with a higher priority number, PPii
then waits for any of these processes to then waits for any of these processes to answer within answer within TT..
continued
OSes: 16. Dist. Coord 70
If no response within If no response within TT, assume that all , assume that all processes with numbers greater than i have processes with numbers greater than i have failed; failed; PPii elects itself the new coordinator.elects itself the new coordinator.
If answer is received, If answer is received, PPii begins time interval begins time interval TT
´,´, waiting to receive a message that a process waiting to receive a message that a process with a higher priority number has been elected.with a higher priority number has been elected.
continued
OSes: 16. Dist. Coord 71
If no message is sent within If no message is sent within T´,T´, assume the assume the process with a higher number has failed; process with a higher number has failed; PPii
should restart the algorithmshould restart the algorithm
continued
OSes: 16. Dist. Coord 72
If If PPii is not the coordinator, then, at any time is not the coordinator, then, at any time
during execution, during execution, PPi i may receive one of the may receive one of the
following two messages from process following two messages from process PPjj..
– PPjj is the is the new coordinatornew coordinator ( (j > ij > i). ). PPii, in turn, records , in turn, records
this information.this information.
– PPjj started an electionstarted an election ( (j > ij > i). ). PPii, sends a response to , sends a response to PPjj
and begins its own election algorithm, provided that and begins its own election algorithm, provided that PPii
has not already initiated such an election.has not already initiated such an election.
continued
OSes: 16. Dist. Coord 73
After a failed process recovers, it immediately After a failed process recovers, it immediately begins execution of the same algorithm.begins execution of the same algorithm.
If there are no active processes with higher If there are no active processes with higher numbers, the recovered process forces all numbers, the recovered process forces all processes with lower number to let it become processes with lower number to let it become the coordinator process, even if there is a the coordinator process, even if there is a currently active coordinator with a lower currently active coordinator with a lower number. number.
OSes: 16. Dist. Coord 74
6.2. Ring Algorithm6.2. Ring Algorithm
Applicable to systems organized as a ring Applicable to systems organized as a ring (logically or physically).(logically or physically).
Assumes that the links are unidirectional, Assumes that the links are unidirectional, and that processes send their messages to and that processes send their messages to their right neighbors. their right neighbors.
continued
OSes: 16. Dist. Coord 75
Each process maintains an Each process maintains an active listactive list, consisting , consisting of all the priority numbers of all active processes of all the priority numbers of all active processes in the system when the algorithm ends.in the system when the algorithm ends.
If process If process PPii detects a coordinator failure, it detects a coordinator failure, it
creates a new active list that is initially empty. It creates a new active list that is initially empty. It then sends a message then sends a message electelect(i)(i) to its right neighbor, to its right neighbor, and adds the number and adds the number ii to its active list. to its active list.
continued
OSes: 16. Dist. Coord 76
If If PPii receives a message receives a message electelect((jj) from the ) from the process on the left, it must respond in one process on the left, it must respond in one of three ways:of three ways:
1.1. If this is the first If this is the first electelect message it has seen or message it has seen or sent, sent, PPii creates a new active list with the creates a new active list with the numbers numbers ii and and jj. It then sends the message . It then sends the message electelect(i),(i), followed by the message followed by the message electelect(j).(j).
continued
OSes: 16. Dist. Coord 77
– If If i = ji = j, then add , then add electelect(j)(j) to the active list for to the active list for PPii
– If If i = ji = j, then , then PPii store the message store the message electelect(i)(i) the active list for the active list for PPii contains all the active contains all the active
processes in the systemprocesses in the system PPii can now determine the new coordinator can now determine the new coordinator
processprocess based on prioritybased on priority
OSes: 16. Dist. Coord 78
7. Reaching Agreement7. Reaching Agreement There are applications where a set of processes wish There are applications where a set of processes wish
to agree on a common “value”.to agree on a common “value”.
Such agreement may not take place due to:Such agreement may not take place due to:– a faulty communication mediuma faulty communication medium
– faulty processes faulty processes processes may send garbled or incorrect messages to other processes may send garbled or incorrect messages to other
processesprocesses a subset of the processes may collaborate with each other in an a subset of the processes may collaborate with each other in an
attempt to defeat the schemeattempt to defeat the scheme
OSes: 16. Dist. Coord 79
7.1. Faulty Communications7.1. Faulty Communications Process Process PPii at site at site AA, has sent a message to process , has sent a message to process PPjj at at
site site BB; to proceed, ; to proceed, PPii needs to know if needs to know if PPjj has received the has received the
message.message.
Detect failures using a Detect failures using a time-out schemetime-out scheme..– When When PPii sends out a message, it also specifies a time interval sends out a message, it also specifies a time interval
during which it is willing to wait for an acknowledgment during which it is willing to wait for an acknowledgment message form message form PPjj..
– When When PPjj receives the message, it immediately sends an receives the message, it immediately sends an
acknowledgment to acknowledgment to PPii..
continued
OSes: 16. Dist. Coord 80
– If If PPii receives the acknowledgment message receives the acknowledgment message
within the specified time interval, it concludes within the specified time interval, it concludes that that PPjj has received its message. If a time-out has received its message. If a time-out
occurs, occurs, PPjj needs to retransmit its message and needs to retransmit its message and
wait for an acknowledgment.wait for an acknowledgment.
– Continue until Continue until PPii either receives an either receives an
acknowledgment, or is notified by the system acknowledgment, or is notified by the system that that BB is down. is down.
continued
OSes: 16. Dist. Coord 81
Suppose that Suppose that PPjj also needs to know that also needs to know that PPii has has
received its acknowledgment message, in order to received its acknowledgment message, in order to decide on how to proceed.decide on how to proceed.
– in the presence of failure, it is not possible to accomplish in the presence of failure, it is not possible to accomplish this taskthis task
– it is very hard (time consuming) in a distributed it is very hard (time consuming) in a distributed environment for processes environment for processes PPii and and PPjj to agree completely to agree completely
on their respective states. on their respective states.
OSes: 16. Dist. Coord 82
7.2. Faulty Processes 7.2. Faulty Processes (Byzantine Generals Problem)(Byzantine Generals Problem)
Communication medium is reliable, but Communication medium is reliable, but processes can fail in unpredictable ways.processes can fail in unpredictable ways.
Consider a system of n processes, of Consider a system of n processes, of which no more than m are faulty. which no more than m are faulty.
Suppose that each process Suppose that each process PPii has some has some
private value of private value of VVii..
continued
OSes: 16. Dist. Coord 83
Devise an algorithm that allows each non-Devise an algorithm that allows each non-faulty Pfaulty Pii to construct a vector to construct a vector
XXii = ( = (AAi,i,11, , AAii,2,2, …, , …, AAi,ni,n) such that:) such that:
– if if PPjj is a non-faulty process, then is a non-faulty process, then AAij ij = = VVj.j.
– if if PPii and and PPjj are both non-faulty processes, then are both non-faulty processes, then
XXii = = XXjj..
continued
OSes: 16. Dist. Coord 84
Solutions share the following properties.Solutions share the following properties.– a correct algorithm can be devised only if a correct algorithm can be devised only if
n n >= 3(>= 3(mm + 1). + 1).
– the worst-case delay for reaching agreement is the worst-case delay for reaching agreement is proportionate to proportionate to mm + 1 message-passing delays + 1 message-passing delays
continued
OSes: 16. Dist. Coord 85
An algorithm for the case where An algorithm for the case where m m = 1 and = 1 and nn = 4 requires = 4 requires twotwo rounds of information rounds of information exchange:exchange:– each process sends its private value to the other each process sends its private value to the other
3 processes3 processes
– each process sends the information it has each process sends the information it has obtained in the first round to all other processesobtained in the first round to all other processes
continued
OSes: 16. Dist. Coord 86
If a faulty process refuses to send If a faulty process refuses to send messages, a non-faulty process can choose messages, a non-faulty process can choose an arbitrary value and pretend that that an arbitrary value and pretend that that value was sent by that process. value was sent by that process.
continued
OSes: 16. Dist. Coord 87
After the two rounds are completed, a non-faulty After the two rounds are completed, a non-faulty process process PPii can construct its vector can construct its vector X Xii = = (A(Ai,i,11, , AAi,i,22, ,
AAii,3,3, , AAii,4,4) as follows:) as follows:
– AAi,ji,j = = VVii..
– for for j <> i,j <> i, if at least two of the three values reported if at least two of the three values reported for process for process PPjj agree, then the majority value is used agree, then the majority value is used
to set the value of to set the value of AAijij. .
Otherwise, a default value (Otherwise, a default value (nilnil) is used.) is used.