OSes: 16. Dist. Coord 1 Operating Systems v Objectives –introduce issues such as event ordering, mutual exclusion, atomicity, deadlock Certificate Program

OSes: 16. Dist. Coord 1

Operating SystemsOperating Systems

ObjectivesObjectives– introduce issues such as event ordering, introduce issues such as event ordering,

mutual exclusion, atomicity, deadlockmutual exclusion, atomicity, deadlock

Certificate Program in Software DevelopmentCSE-TC and CSIM, AITSeptember -- November, 2003

16. Distributed Coordination(S&G 6th ed., Ch. 17)


OverviewOverview

1. Event Ordering1. Event Ordering

2. Distributed Mutual Exclusion 2. Distributed Mutual Exclusion

3. Atomicity3. Atomicity

4. Concurrency Control4. Concurrency Control

5. Deadlock Handling5. Deadlock Handling

6. Election Algorithms6. Election Algorithms

7. Reaching Agreement7. Reaching Agreement


1. Event Ordering1. Event Ordering

Happened-beforeHappened-before relation (denoted by -->). relation (denoted by -->).– If If AA and and BB are events in the same process, and are events in the same process, and AA

was executed before was executed before BB, then , then AA --> --> BB..

– If If AA is the event of sending a message by one is the event of sending a message by one process and process and BB is the event of receiving that is the event of receiving that message by another process, then message by another process, then AA --> --> BB..

– If If AA --> --> BB and B --> and B --> C C thenthen A A -->--> C C..


Relative Time for 3 Concurrent ProcessesRelative Time for 3 Concurrent ProcessesFigure 17.1, p.597

message

event

tim

e


1.1. Implementation of --> 1.1. Implementation of --> Associate a timestamp with each system event. Associate a timestamp with each system event. Require that for every pair of events Require that for every pair of events AA and and BB, if , if

A A -->--> B B, then the timestamp of , then the timestamp of AA is less than is less than the timestamp of the timestamp of BB..

Each process Each process PPii has a has a logical clocklogical clock, , LCLCii

– implemented as a counter, incremented when implemented as a counter, incremented when an event occursan event occurs

continued


A process advances its logical clock when it A process advances its logical clock when it receives a message whose timestamp is receives a message whose timestamp is greater than the current value of its logical greater than the current value of its logical clock.clock.

If the timestamps of two events If the timestamps of two events AA and and BB are are the same, then the events are concurrent.the same, then the events are concurrent.


2. Distributed Mutual Exclusion 2. Distributed Mutual Exclusion (DME) (DME)

Assumptions:Assumptions:– the system consists of the system consists of nn processes; each process processes; each process PPii

resides at a different processor;resides at a different processor;– each process has a critical section that requires each process has a critical section that requires

mutual exclusionmutual exclusion

Requirement:Requirement:– if if PPii is executing in its critical section, then no other is executing in its critical section, then no other

process process PPjj is executing in its own critical section is executing in its own critical section


2.1. DME: Centralized Approach2.1. DME: Centralized Approach

One of the processes in the system is One of the processes in the system is chosen to coordinate the entry to the chosen to coordinate the entry to the critical sections.critical sections.

A process that wants to enter its critical A process that wants to enter its critical section sends a section sends a requestrequest message to the message to the coordinator.coordinator.

continued


The coordinator decides which process can The coordinator decides which process can enter the critical section next, and its sends enter the critical section next, and its sends that process a that process a replyreply message. message.

When the process receives a When the process receives a replyreply message message from the coordinator, it enters its critical from the coordinator, it enters its critical section.section.

continued


After exiting its critical section, the process sends After exiting its critical section, the process sends a a releaserelease message to the coordinator and message to the coordinator and proceeds with its execution.proceeds with its execution.

This scheme requires three messages per critical-This scheme requires three messages per critical-section entry:section entry:– request request

– replyreply

– releaserelease


2.2. DME: Fully Distributed Approach2.2. DME: Fully Distributed Approach

When process When process PPii wants to enter its critical wants to enter its critical

section, it generates a new timestamp, section, it generates a new timestamp, TSTS, , and sends the message and sends the message requestrequest ((PPii, TS, TS) to all ) to all

other processes in the system.other processes in the system.

When process When process PPjj receives a receives a requestrequest message, message,

it may reply immediately or it may defer it may reply immediately or it may defer sending a reply back.sending a reply back.

continued


When process When process PPi i receives a receives a replyreply message message

from all other processes in the system, it from all other processes in the system, it can enter its critical section.can enter its critical section.

After exiting its critical section, the process After exiting its critical section, the process sends sends replyreply messages to all its deferred messages to all its deferred requests.requests.

continued


The decision whether process The decision whether process PPjj replies replies

immediately to a immediately to a requestrequest((PPii, TS, TS) message or ) message or

defers its reply is based on three factors:defers its reply is based on three factors:– 1. if 1. if PPjj is in its critical section, then it defers its is in its critical section, then it defers its

reply to reply to PPii

– 2. if 2. if PPjj does does notnot want to enter its critical section, want to enter its critical section,

then it sends a then it sends a replyreply immediately to immediately to PPii..

continued


– If If PPjj wants to enter its critical section but has wants to enter its critical section but has

not yet entered it, then it compares its own not yet entered it, then it compares its own request timestamp with the timestamp request timestamp with the timestamp TSTS..

If its own request timestamp is greater than If its own request timestamp is greater than TSTS, then it sends a , then it sends a replyreply immediately to immediately to PPii

((PPii asked first).asked first).

Otherwise, the reply is deferred.Otherwise, the reply is deferred.


Desirable Behavior of Fully Desirable Behavior of Fully Distributed ApproachDistributed Approach

Freedom from deadlock.Freedom from deadlock.

Freedom from starvation, since entry to the Freedom from starvation, since entry to the critical section is scheduled according to the critical section is scheduled according to the timestamp orderingtimestamp ordering– the timestamp ordering means that processes the timestamp ordering means that processes

are served in a FCFS orderare served in a FCFS order

continued


The number of messages per critical-section The number of messages per critical-section entry is 2(entry is 2(nn – 1). – 1).– this is the minimum number of required this is the minimum number of required

messages per critical-section entry when messages per critical-section entry when processes act independently and concurrentlyprocesses act independently and concurrently

n is the number of processes


Three Undesirable ConsequencesThree Undesirable Consequences

1. The processes need to know the identity 1. The processes need to know the identity of all other processes in the systemof all other processes in the system– makes the dynamic addition and removal of makes the dynamic addition and removal of

processes more complexprocesses more complex

2. If one of the processes fails, then the 2. If one of the processes fails, then the entire scheme collapsesentire scheme collapses– this can be dealt with by continuously this can be dealt with by continuously

monitoring the state of all the processesmonitoring the state of all the processes

continued


3. Processes that have not entered their 3. Processes that have not entered their critical section must pause frequently to tell critical section must pause frequently to tell other processes that they intend to enter other processes that they intend to enter their critical sectiontheir critical section

This approach is best suited for small, stable This approach is best suited for small, stable sets of cooperating processes.sets of cooperating processes.


2.3. Token Passing2.3. Token Passing

When a process has the token, it can enter its When a process has the token, it can enter its critical section (if it wants to), or pass it on.critical section (if it wants to), or pass it on.

Problems: ring breaks, token lossProblems: ring breaks, token loss

c.sP2

c.sP1

c.sP0

token


3. Atomicity 3. Atomicity

Either Either allall the operations associated with a the operations associated with a program unit are executed to completion, program unit are executed to completion, oror nonenone are performed. are performed.

Ensuring atomicity in a distributed system Ensuring atomicity in a distributed system requires a requires a transaction coordinator.transaction coordinator.

continued


Transaction Coordinator tasks:Transaction Coordinator tasks:– start the transactionstart the transaction

– break the transaction into subtransactions, and break the transaction into subtransactions, and distribution them to appropriate sites for executiondistribution them to appropriate sites for execution

– coordinate the termination of the transactioncoordinate the termination of the transaction may result in commits or aborts at all sitesmay result in commits or aborts at all sites


3.1. Two-Phase Commit Protocol (2PC)3.1. Two-Phase Commit Protocol (2PC)

Assumes a Assumes a fail-stop modelfail-stop model..

Execution of 2PC starts after the last step Execution of 2PC starts after the last step of the transaction.of the transaction.

When 2PC starts, the transaction may still When 2PC starts, the transaction may still be executing at some sub-sites.be executing at some sub-sites.

continued


2PC involves all the sub-sites at which the 2PC involves all the sub-sites at which the transaction executed.transaction executed.

Example: Let Example: Let TT be a transaction initiated at be a transaction initiated at sitesite S Sii and let the transaction coordinator at and let the transaction coordinator at SSii bebe C Cii..


Phase 1: Obtaining a DecisionPhase 1: Obtaining a Decision

CCii adds <prepare adds <prepare TT> record to the log.> record to the log.

CCii sends <prepare sends <prepare TT> message to all sites.> message to all sites.

When a sub-site receives a <prepare When a sub-site receives a <prepare TT> > message, the transaction manager message, the transaction manager determines if it can commit the transaction.determines if it can commit the transaction.

continued


– If no: add <no If no: add <no TT> record to the log and respond > record to the log and respond to to CCii with <abort with <abort TT>>

– If yes:If yes: add <ready add <ready TT> record to the log.> record to the log. force force all log recordsall log records for for TT onto stable storage. onto stable storage. transaction manager sends <ready transaction manager sends <ready TT> message > message

to to CCii..

continued


Coordinator collects responsesCoordinator collects responses– all respond “ready”, decision is all respond “ready”, decision is commitcommit

– at least one response is “abort”, decision is at least one response is “abort”, decision is abortabort

– at least one participant fails to respond within a at least one participant fails to respond within a time-out period, decision is time-out period, decision is abortabort. .


Phase 2: Record Decision in DBPhase 2: Record Decision in DB

Coordinator adds a decision recordCoordinator adds a decision record<abort <abort TT> or <commit > or <commit TT>>

to its log and records onto stable storage.to its log and records onto stable storage.

Once the info reaches stable storage, it is Once the info reaches stable storage, it is irrevocableirrevocable (even if failure occur). (even if failure occur).

continued


Coordinator sends a message to each Coordinator sends a message to each participant informing it of the decision participant informing it of the decision (commit or abort).(commit or abort).

Participants take appropriate action locally.Participants take appropriate action locally.


3.2. Failure Handling in 2PC – 3.2. Failure Handling in 2PC – Site FailureSite Failure

The log contains a <commit The log contains a <commit TT> record. In > record. In this case, the site executes this case, the site executes redoredo((TT).).

The log contains an <abort The log contains an <abort TT> record. In > record. In this case, the site executes this case, the site executes undoundo((TT).).

continued


The log contains a <ready The log contains a <ready TT> record; > record; consult consult CCii. If . If CCii is down, site sends is down, site sends query-query-

statusstatus TT message to the other sites. message to the other sites.

The log contains no control records The log contains no control records concerning concerning TT. In this case, the site executes . In this case, the site executes undoundo((TT).).


Failure Handling in 2PC – Failure Handling in 2PC – Coordinator Coordinator CCii FailureFailure

If all active sites contain a <commit If all active sites contain a <commit TT> record > record in its log, the in its log, the TT must be committed. must be committed.

If an active site contains an <abort If an active site contains an <abort TT> record in > record in its log, then its log, then TT must be aborted. must be aborted.

continued

the main problemwith 2PC


If some active site does If some active site does notnot contain the contain the record <ready record <ready TT> in its log then the failed > in its log then the failed coordinator coordinator CCii cannot have decided to cannot have decided to

commit commit TT. . Rather than wait for Rather than wait for CCii to recover, it is to recover, it is

better to abort better to abort TT. .

continued


All active sites have a <ready All active sites have a <ready TT> record in > record in their logs, but no additional control records. their logs, but no additional control records.

In this case we must wait for the In this case we must wait for the coordinator to recover. coordinator to recover. – blocking problem – blocking problem – TT is blocked pending the is blocked pending the

recovery of site recovery of site SSii..


4. Concurrency Control4. Concurrency Control

Modify the centralized concurrency Modify the centralized concurrency schemes to handle the distribution of schemes to handle the distribution of transactions.transactions.

A A transaction managertransaction manager coordinates the coordinates the execution of transactions (or execution of transactions (or subtransactions) which access data at local subtransactions) which access data at local sites. sites.

continued


Local transaction only executes at that site. Local transaction only executes at that site.

Global transaction executes at several sites. Global transaction executes at several sites.


4.1. Locking Protocols4.1. Locking Protocols

We can use the two-phase locking protocol We can use the two-phase locking protocol in a distributed environment by changing in a distributed environment by changing how the lock manager is implemented.how the lock manager is implemented.

A A non-replicated schemenon-replicated scheme– each site maintains a local lock manager which each site maintains a local lock manager which

administers lock and unlock requests for data administers lock and unlock requests for data items stored at that siteitems stored at that site

continued


– a simple implementation involves two message a simple implementation involves two message transfers for handling lock requests, and one transfers for handling lock requests, and one message transfer for handling unlock requestsmessage transfer for handling unlock requests

– deadlock handling is more complexdeadlock handling is more complex


Single-Coordinator ApproachSingle-Coordinator Approach

A single lock manager resides at a single A single lock manager resides at a single chosen sitechosen site– all lock and unlock requests are made at that all lock and unlock requests are made at that

sitesite

Simple implementationSimple implementation Simple deadlock handlingSimple deadlock handling

continued


Possibility of bottleneckPossibility of bottleneck

Vulnerable to loss of concurrency controller Vulnerable to loss of concurrency controller if single site fails if single site fails

Multiple-coordinator approachMultiple-coordinator approach distributes distributes lock-manager function over several sites. lock-manager function over several sites.


Majority ProtocolMajority Protocol

Avoids drawbacks of central control by dealing Avoids drawbacks of central control by dealing with with replicated datareplicated data in a decentralized manner. in a decentralized manner.

More complicated to implement More complicated to implement

Deadlock handling must be modifiedDeadlock handling must be modified– possible for deadlock to occur when locking only possible for deadlock to occur when locking only

one data item one data item

for replicated data


Biased ProtocolBiased Protocol Similar to majority protocol, but requests for Similar to majority protocol, but requests for

shared locksshared locks are prioritized over requests for are prioritized over requests for exclusive locksexclusive locks..

Less overhead on read operations than in majority Less overhead on read operations than in majority protocol; but has additional overheads for writes. protocol; but has additional overheads for writes.

Like majority protocol, deadlock handling is Like majority protocol, deadlock handling is complex.complex.


Primary CopyPrimary Copy

One of the sites at which a replica resides is One of the sites at which a replica resides is designated as the designated as the primary siteprimary site. . – a request to lock a data item is made at the a request to lock a data item is made at the

primary site of that itemprimary site of that item

Concurrency control for replicated data is Concurrency control for replicated data is handled in a similar way to unreplicated handled in a similar way to unreplicated data. data.

continued


A simple implementation, but if the primary A simple implementation, but if the primary site fails, then the data item is unavailable, site fails, then the data item is unavailable, even though other sites may have a copy. even though other sites may have a copy.


5. Deadlock Handling5. Deadlock Handling

Deadlock PreventionDeadlock Prevention Deadlock AvoidanceDeadlock Avoidance Deadlock DetectionDeadlock Detection


5.1. Deadlock Prevention5.1. Deadlock Prevention

Resource-ordering deadlock-prevention – Resource-ordering deadlock-prevention – define a define a globalglobal ordering among the system ordering among the system resources. resources. – assign a unique number to all system resourcesassign a unique number to all system resources– a process may request a resource with unique a process may request a resource with unique

number number ii only if it is not holding a resource with only if it is not holding a resource with a unique number greater thana unique number greater than i i

– simple to implement; requires little overheadsimple to implement; requires little overhead

continued


Banker’s algorithm in a dist. systemBanker’s algorithm in a dist. system– designate one of the processes in the system as designate one of the processes in the system as

the process that maintains the information the process that maintains the information necessary to carry out the Banker’s algorithmnecessary to carry out the Banker’s algorithm

– also implemented easily, but may require too also implemented easily, but may require too much overheadmuch overhead


Priority Deadlock-Prevention SchemePriority Deadlock-Prevention Scheme

Each process Each process PPii is assigned a unique is assigned a unique

priority number priority number

Priority numbers are used to decide whether Priority numbers are used to decide whether a processa process P Pii should wait for a process should wait for a process PPjj; ;

otherwise otherwise PPii is rolled back. is rolled back.

continued


The scheme prevents deadlocks. The scheme prevents deadlocks. – for every edge for every edge PPii --> --> PPjj in the wait-for graph, in the wait-for graph, PPii

has a higher priority than has a higher priority than PPjj. . a cycle cannot exista cycle cannot exist


Wait-Die SchemeWait-Die Scheme

Based on a non-preemptive technique.Based on a non-preemptive technique.

If If PPii requests a resource currently held by requests a resource currently held by

PPjj, P, Pii is allowed to wait only if it has a is allowed to wait only if it has a

smallersmaller timestamp than does timestamp than does PPjj ( (PPii is older is older

than than PPjj). Otherwise, P). Otherwise, Pii is rolled back (dies). is rolled back (dies).

continued


Example: Suppose that processes Example: Suppose that processes PP11, , PP22, ,

and and PP33 have timestamps 5, 10, and 15 have timestamps 5, 10, and 15

respectively.respectively.– if if PP11 request a resource held by request a resource held by PP22, then , then PP11 will will

wait.wait.

– If If PP33 requests a resource held by requests a resource held by PP22, then , then PP33 will will

be rolled back.be rolled back.


Would-Wait SchemeWould-Wait Scheme

Based on a preemptive techniqueBased on a preemptive technique– counterpart to the wait-die systemcounterpart to the wait-die system

If If PPii requests a resource currently held by requests a resource currently held by PPjj, , PPii is allowed to wait only if it has a is allowed to wait only if it has a largerlarger timestamp than does timestamp than does PPjj ( (PPii is younger than is younger than PPjj). Otherwise ). Otherwise PPjj is rolled back ( is rolled back (PPjj is is wounded by wounded by PPii).).

continued


Example: Suppose that processes Example: Suppose that processes PP11, , PP2, 2,

and and PP33 have timestamps 5, 10, and 15 have timestamps 5, 10, and 15 respectively.respectively.– If If PP11 requests a resource held by requests a resource held by PP22, then the , then the

resource will be preempted from resource will be preempted from PP22 and and PP22 will will be rolled back.be rolled back.

– If If PP33 requests a resource held by requests a resource held by PP22, then , then PP33 will will wait.wait.


5.2. Deadlock Detection – 5.2. Deadlock Detection – Centralized ApproachCentralized Approach

Each site keeps a Each site keeps a locallocal wait-for graph. wait-for graph.– the nodes of the graph correspond to all the the nodes of the graph correspond to all the

processes that are currently either holding or processes that are currently either holding or requesting any of the resources local to that siterequesting any of the resources local to that site

A global wait-for graph is maintained in a A global wait-for graph is maintained in a singlesingle coordination process coordination process– this graph is the union of all local wait-for graphsthis graph is the union of all local wait-for graphs

continued


There are There are threethree different options (points in time) when the different options (points in time) when the globalglobal wait-for graph may be constructed wait-for graph may be constructed– 1. Whenever a new edge is inserted or removed in one of the 1. Whenever a new edge is inserted or removed in one of the

local wait-for graphs.local wait-for graphs.– 2. Periodically, when a number of changes have occurred in a 2. Periodically, when a number of changes have occurred in a

wait-for graph.wait-for graph.– 3. Whenever the coordinator needs to invoke the cycle-3. Whenever the coordinator needs to invoke the cycle-

detection algorithm.detection algorithm.

Unnecessary rollbacks may occur as a result ofUnnecessary rollbacks may occur as a result of false false cyclescycles..


Two Local Wait-For GraphsTwo Local Wait-For GraphsFigure 17.3, p.612


Global Wait-For GraphGlobal Wait-For GraphFigure 17.4, p.613


Detection Algorithm Based on Detection Algorithm Based on Option 3Option 3

Append unique identifiers (timestamps) to Append unique identifiers (timestamps) to requests from different sites.requests from different sites.

When process When process PPii, at site , at site AA, requests a resource , requests a resource

from process from process PPjj, at site , at site BB, a request message , a request message

with timestamp with timestamp TSTS is sent. is sent.

continued


The edge The edge PPii --> --> PPjj with the label with the label TSTS is is

inserted in the local wait-for of inserted in the local wait-for of AA. The edge . The edge is inserted in the local wait-for graph of is inserted in the local wait-for graph of BB only if only if BB has received the request message has received the request message and cannot immediately grant the requested and cannot immediately grant the requested resource.resource.


The Algorithm The Algorithm

1.1. The controller sends an initiating message to The controller sends an initiating message to each site in the system.each site in the system.

2.2. On receiving this message, a site sends its local On receiving this message, a site sends its local wait-for graph to the coordinator.wait-for graph to the coordinator.

3.3. When the controller has received a reply from When the controller has received a reply from each site, it constructs a graph as follows:each site, it constructs a graph as follows:

continued


– (a)(a) The constructed graph contains a vertex for The constructed graph contains a vertex for every process in the system.every process in the system.

– (b)(b) The graph has an edge The graph has an edge PPii --> P --> Pjj if and only if if and only if

(1) there is an edge (1) there is an edge PPii --> P --> Pjj in one of the wait-for in one of the wait-for

graphs, or (2) an edge graphs, or (2) an edge PPii --> P --> Pjj with some label with some label TSTS

appears in more than one wait-for graph. appears in more than one wait-for graph.

– If the constructed graph contains a cycle then system If the constructed graph contains a cycle then system deadlock.deadlock.


Local and Global Wait-For GraphsLocal and Global Wait-For GraphsFigure 17.5, p.614


Fully Distributed ApproachFully Distributed Approach

All controllers share equally the responsibility All controllers share equally the responsibility for detecting deadlock.for detecting deadlock.

Every site constructs a wait-for graph that Every site constructs a wait-for graph that represents a part of the total graph.represents a part of the total graph.

We add one additional node We add one additional node PPexex to each local to each local

wait-for graph.wait-for graph.

continued


If a local wait-for graph contains a cycle that If a local wait-for graph contains a cycle that does not involve node does not involve node PPexex, then the system is in a , then the system is in a

deadlock state.deadlock state.

A cycle involving A cycle involving PPexex implies the implies the possibilitypossibility of a of a

deadlock.deadlock.– to find out whether a deadlock does exist, a to find out whether a deadlock does exist, a

distributed deadlock-detection algorithm must be distributed deadlock-detection algorithm must be calledcalled


Augmented Local Wait-For Graphs Augmented Local Wait-For Graphs Figure 17.6, p.616


Augmented Local Wait-For Graph Augmented Local Wait-For Graph in Site S2in Site S2

Figure 17.7, p.617


6. Election Algorithms6. Election Algorithms

Determine where a new copy of the Determine where a new copy of the coordinator should be restarted.coordinator should be restarted.

Assume that a unique priority number is Assume that a unique priority number is associated with each active process in the associated with each active process in the system, and assume that the priority number system, and assume that the priority number of process of process PPii is is ii..

Assume a one-to-one correspondence Assume a one-to-one correspondence between processes and sites.between processes and sites.

continued


The coordinator is always the process with The coordinator is always the process with the largest priority numberthe largest priority number– when a coordinator fails, the algorithm must elect when a coordinator fails, the algorithm must elect

the active process with the largest priority numberthe active process with the largest priority number

Two algorithms, the Two algorithms, the bully algorithmbully algorithm or the or the ring algorithmring algorithm, can be used to elect a new , can be used to elect a new coordinator in case of failures.coordinator in case of failures.


6.1. Bully Algorithm6.1. Bully Algorithm

Applicable to systems where every process Applicable to systems where every process can send a message to every other process.can send a message to every other process.

If process If process PPii sends a request that is not sends a request that is not

answered by the coordinator within a time answered by the coordinator within a time interval interval TT, assume that the coordinator has , assume that the coordinator has failed; failed; PPii tries to elect itself as the new tries to elect itself as the new

coordinator.coordinator.

continued


PPii sends an election message to every sends an election message to every

process with a higher priority number, process with a higher priority number, PPii

then waits for any of these processes to then waits for any of these processes to answer within answer within TT..

continued


If no response within If no response within TT, assume that all , assume that all processes with numbers greater than i have processes with numbers greater than i have failed; failed; PPii elects itself the new coordinator.elects itself the new coordinator.

If answer is received, If answer is received, PPii begins time interval begins time interval TT

´,´, waiting to receive a message that a process waiting to receive a message that a process with a higher priority number has been elected.with a higher priority number has been elected.

continued


If no message is sent within If no message is sent within T´,T´, assume the assume the process with a higher number has failed; process with a higher number has failed; PPii

should restart the algorithmshould restart the algorithm

continued


If If PPii is not the coordinator, then, at any time is not the coordinator, then, at any time

during execution, during execution, PPi i may receive one of the may receive one of the

following two messages from process following two messages from process PPjj..

– PPjj is the is the new coordinatornew coordinator ( (j > ij > i). ). PPii, in turn, records , in turn, records

this information.this information.

– PPjj started an electionstarted an election ( (j > ij > i). ). PPii, sends a response to , sends a response to PPjj

and begins its own election algorithm, provided that and begins its own election algorithm, provided that PPii

has not already initiated such an election.has not already initiated such an election.

continued


After a failed process recovers, it immediately After a failed process recovers, it immediately begins execution of the same algorithm.begins execution of the same algorithm.

If there are no active processes with higher If there are no active processes with higher numbers, the recovered process forces all numbers, the recovered process forces all processes with lower number to let it become processes with lower number to let it become the coordinator process, even if there is a the coordinator process, even if there is a currently active coordinator with a lower currently active coordinator with a lower number. number.


6.2. Ring Algorithm6.2. Ring Algorithm

Applicable to systems organized as a ring Applicable to systems organized as a ring (logically or physically).(logically or physically).

Assumes that the links are unidirectional, Assumes that the links are unidirectional, and that processes send their messages to and that processes send their messages to their right neighbors. their right neighbors.

continued


Each process maintains an Each process maintains an active listactive list, consisting , consisting of all the priority numbers of all active processes of all the priority numbers of all active processes in the system when the algorithm ends.in the system when the algorithm ends.

If process If process PPii detects a coordinator failure, it detects a coordinator failure, it

creates a new active list that is initially empty. It creates a new active list that is initially empty. It then sends a message then sends a message electelect(i)(i) to its right neighbor, to its right neighbor, and adds the number and adds the number ii to its active list. to its active list.

continued


If If PPii receives a message receives a message electelect((jj) from the ) from the process on the left, it must respond in one process on the left, it must respond in one of three ways:of three ways:

1.1. If this is the first If this is the first electelect message it has seen or message it has seen or sent, sent, PPii creates a new active list with the creates a new active list with the numbers numbers ii and and jj. It then sends the message . It then sends the message electelect(i),(i), followed by the message followed by the message electelect(j).(j).

continued


– If If i = ji = j, then add , then add electelect(j)(j) to the active list for to the active list for PPii

– If If i = ji = j, then , then PPii store the message store the message electelect(i)(i) the active list for the active list for PPii contains all the active contains all the active

processes in the systemprocesses in the system PPii can now determine the new coordinator can now determine the new coordinator

processprocess based on prioritybased on priority


7. Reaching Agreement7. Reaching Agreement There are applications where a set of processes wish There are applications where a set of processes wish

to agree on a common “value”.to agree on a common “value”.

Such agreement may not take place due to:Such agreement may not take place due to:– a faulty communication mediuma faulty communication medium

– faulty processes faulty processes processes may send garbled or incorrect messages to other processes may send garbled or incorrect messages to other

processesprocesses a subset of the processes may collaborate with each other in an a subset of the processes may collaborate with each other in an

attempt to defeat the schemeattempt to defeat the scheme


7.1. Faulty Communications7.1. Faulty Communications Process Process PPii at site at site AA, has sent a message to process , has sent a message to process PPjj at at

site site BB; to proceed, ; to proceed, PPii needs to know if needs to know if PPjj has received the has received the

message.message.

Detect failures using a Detect failures using a time-out schemetime-out scheme..– When When PPii sends out a message, it also specifies a time interval sends out a message, it also specifies a time interval

during which it is willing to wait for an acknowledgment during which it is willing to wait for an acknowledgment message form message form PPjj..

– When When PPjj receives the message, it immediately sends an receives the message, it immediately sends an

acknowledgment to acknowledgment to PPii..

continued


– If If PPii receives the acknowledgment message receives the acknowledgment message

within the specified time interval, it concludes within the specified time interval, it concludes that that PPjj has received its message. If a time-out has received its message. If a time-out

occurs, occurs, PPjj needs to retransmit its message and needs to retransmit its message and

wait for an acknowledgment.wait for an acknowledgment.

– Continue until Continue until PPii either receives an either receives an

acknowledgment, or is notified by the system acknowledgment, or is notified by the system that that BB is down. is down.

continued


Suppose that Suppose that PPjj also needs to know that also needs to know that PPii has has

received its acknowledgment message, in order to received its acknowledgment message, in order to decide on how to proceed.decide on how to proceed.

– in the presence of failure, it is not possible to accomplish in the presence of failure, it is not possible to accomplish this taskthis task

– it is very hard (time consuming) in a distributed it is very hard (time consuming) in a distributed environment for processes environment for processes PPii and and PPjj to agree completely to agree completely

on their respective states. on their respective states.


7.2. Faulty Processes 7.2. Faulty Processes (Byzantine Generals Problem)(Byzantine Generals Problem)

Communication medium is reliable, but Communication medium is reliable, but processes can fail in unpredictable ways.processes can fail in unpredictable ways.

Consider a system of n processes, of Consider a system of n processes, of which no more than m are faulty. which no more than m are faulty.

Suppose that each process Suppose that each process PPii has some has some

private value of private value of VVii..

continued


Devise an algorithm that allows each non-Devise an algorithm that allows each non-faulty Pfaulty Pii to construct a vector to construct a vector

XXii = ( = (AAi,i,11, , AAii,2,2, …, , …, AAi,ni,n) such that:) such that:

– if if PPjj is a non-faulty process, then is a non-faulty process, then AAij ij = = VVj.j.

– if if PPii and and PPjj are both non-faulty processes, then are both non-faulty processes, then

XXii = = XXjj..

continued


Solutions share the following properties.Solutions share the following properties.– a correct algorithm can be devised only if a correct algorithm can be devised only if

n n >= 3(>= 3(mm + 1). + 1).

– the worst-case delay for reaching agreement is the worst-case delay for reaching agreement is proportionate to proportionate to mm + 1 message-passing delays + 1 message-passing delays

continued


An algorithm for the case where An algorithm for the case where m m = 1 and = 1 and nn = 4 requires = 4 requires twotwo rounds of information rounds of information exchange:exchange:– each process sends its private value to the other each process sends its private value to the other

3 processes3 processes

– each process sends the information it has each process sends the information it has obtained in the first round to all other processesobtained in the first round to all other processes

continued


If a faulty process refuses to send If a faulty process refuses to send messages, a non-faulty process can choose messages, a non-faulty process can choose an arbitrary value and pretend that that an arbitrary value and pretend that that value was sent by that process. value was sent by that process.

continued


After the two rounds are completed, a non-faulty After the two rounds are completed, a non-faulty process process PPii can construct its vector can construct its vector X Xii = = (A(Ai,i,11, , AAi,i,22, ,

AAii,3,3, , AAii,4,4) as follows:) as follows:

– AAi,ji,j = = VVii..

– for for j <> i,j <> i, if at least two of the three values reported if at least two of the three values reported for process for process PPjj agree, then the majority value is used agree, then the majority value is used

to set the value of to set the value of AAijij. .

Otherwise, a default value (Otherwise, a default value (nilnil) is used.) is used.

Documents

OSes: 16. Dist. Coord 1 Operating Systems v Objectives –introduce issues such as event ordering, mutual exclusion, atomicity, deadlock Certificate Program