Chapter 18.3: Distributed Coordination. 18.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 18 Distributed Coordination Chapter

Chapter 18.3: Distributed CoordinationChapter 18.3: Distributed Coordination

18.2 Silberschatz, Galvin and Gagne ©2005Operating System Concepts

Chapter 18 Distributed CoordinationChapter 18 Distributed Coordination

Chapter 18.1

Event Ordering

Mutual Exclusion

Atomicity

Chapter 18.2

Concurrency Control

Deadlock Handling

Chapter 18.3

Deadlock Prevention – finish up

Election Algorithms – a little bit

Reaching Agreement – a little bit


Chapter ObjectivesChapter Objectives

To present schemes for handling deadlock detection in a distributed system (have looked at deadlock prevention and avoidance)

To take a brief look at election algorigthms

To take a brief look at Reaching Agreement considerations.


Deadlock Detection


Deadlock Detection In deadlock prevention, we may implement an algorithm that preempts resources

even if no deadlock has occurred.

This is not necessarily good, and we want to avoid unnecessary preemptions wherever possible – an this is a real problem with deadlock prevention...

To help us avoid unnecessary preemptions, we can build a wait-for graph that is used to describe the state of resource allocations.

Remember that we’re only considering a single resource of each type, and thus if we have a cycle in our wait-for graph, we are in trouble and have a deadlock.

Wait-for graph philosophy is reasonably straightforward; issue is how to maintain it.

Two techniques we consider require each site to keep its own local wait-for graph.

In the wait-for graphs, nodes correspond to processes (local and non-local) currently holding or requesting any resources local to that site.

Can see in the figure (next page), we have a system consisting of two sites, each maintaining its own local wait-for graph.

Note that P(2) and P(3) appear in both graphs, and this indicates that these processes have requested resources at both sites.


Two Local Wait-For GraphsTwo Local Wait-For Graphs

Both local wait-for graphs are built in the accustomed manner for local processes / resources.When a process P(i) at site S(i) needs a resource held by process P(J) in site S(2), a request message is sent by P(i) to site S(2). The edge P(i) P(J) is then inserted into the local wait-for graph of site S(2)

Of course, if any local wait-for graph has a cycle, we have deadlock.BUT the fact that there are NO cycles does not mean there are no deadlocks. We must look at a ‘larger picture.’

To show this:Note that each graph above is acyclic; nevertheless a deadlock exists in the system.

To prove that a deadlock has NOT occurred, we must show that the UNION of all local graphs is acyclic. Next slide shows this is not the case…


Global Wait-For GraphGlobal Wait-For Graph

Continuing, when we take the union of the two wait-for graphs, it is clearn that we do indeed have a cycle, and this implies that the system is in a deadlocked state.

We have a number of methods to organize the wait-for graph in a distributed system.Some common approaches are

Centralized approaches and Fully distributed approaches.

These are very detailed and in the interest of time (and desire to cover another chapter after this one) in this course, we will not go into detail on these two approaches.Rather, we will jump to Election Algorithms and Reaching Agreement.


Election Algorithms


Election AlgorithmsElection Algorithms We have discussed in a number of instances how centralized and fully distributed

approaches handle the coordination of transactins. So, given we understand the role (and possible distribution) of transaction coordinators,

what happens when one such transaction coordinator becomes unavailable?

We must determine where a new copy of the coordinator should be restarted. Hence, enter a process referred to as Election Algorithms. These algorithms assume that a unique priority number is associated with each

active process in the system; assume also that the priority number of process Pi is i Assume also a one-to-one correspondence between processes and sites

The coordinator is always the process with the largest priority number. So, when a coordinator fails, the algorithm must elect that active process with the largest priority number

Then, this number is sent to each active process in the system. Also, when the former transaction coordinator becomes restored, it must be able to

identify the new transaction coordinator via this algorithm.. Two algorithms are typically used to elect a new coordinator:

A bully algorithm and A ring algorithm


Bully Algorithm (1 of 2)Bully Algorithm (1 of 2)

This algorithm is applicable to systems where every process can send a message to every other process in the system

Given this assumption, If process Pi sends a request that is not answered by the coordinator within a time interval T, then Pi assumes that the coordinator has failed; Pi then acts like a bully and tries to elect itself as the new coordinator

Pi sends an election message to every process with a higher priority number, P( j ), then waits for any of these processes to answer within some time, T

If there’s no response within T, P(i) assume that all processes with numbers greater than i have failed; Pi then elects itself the new coordinator

If an answer is received, Pi begins time interval T´, waiting to receive a message that a process with a higher priority number has been elected

If no message is sent within T´, P(i) assumes the process with a higher number has failed; Pi should restart the algorithm.


Bully Algorithm (Cont.)Bully Algorithm (Cont.) If Pi is not the coordinator, then, at any time during execution, Pi may

receive one of the following two messages from process P( j ) P( j ) is the new coordinator (j > i). Pi, in turn, records this information

P ( j )j started an election (j > i). Pi, sends a response to P ( j ) and begins its own election algorithm, provided that Pi has not already initiated such an election

The process that completes its algorithm has the highest number and is elected as the coordinator. It will have also sent its number to all active processes with smaller numbers.

After a failed process recovers, it will immediately begins execution of the same algorithm – being a bully that it is.

If there are no active processes with higher numbers, the recovered process forces all processes with lower number to let it become the coordinator process, even if there is a currently active coordinator with a lower number

You can go through the detailed example of how these elections occur…


Ring Algorithm (1 of 2)Ring Algorithm (1 of 2)

No great surprise here. This election algorithm is based on a ring architectural structure or at least a logical ring, if not physical ring.

Communications are as expected where processes sends its messages to the neighbors on the right.

The Active List. The main data structure used by the algorithm includes what is called an ‘active list’ containing priority numbers of all processes active in the system.

Each process maintains an active list, consisting of all the priority numbers of all active processes in the system.

If process P(i) detects a coordinator failure, it creates an initially empty new active list.

It then sends a message elect(i) to its right neighbor, and adds the number i to its active list Note the direction of the communications.


Ring Algorithm (Cont.)Ring Algorithm (Cont.) If Pi receives a message elect( j ) from the process on the left, it must respond in

one of three ways:

1. If this is the first elect message it has seen or sent, Pi creates a new active list with the numbers i and j

It then sends the message elect( i ), followed by the message elect( j )

2. If i j, then the active list for Pi now contains the numbers of all the active processes in the system

Pi can now easily determine the largest number in the active list to identify the new coordinator process

3. If i = j, then Pi receives the message elect( i )

The active list for Pi contains all the active processes in the system

Pi can now determine the new coordinator process.


Reaching Agreement


Reaching Agreement (directly from book)Reaching Agreement (directly from book) Normally, applications processes wish to agree on a common “values”

Such agreement, however, may not take place due to a: Faulty communication medium which might result in lost or garbled

messages, or Faulty processes

Processes themselves may send garbled or otherwise incorrect messages to other processes

Processes themselves can also be flawed in other ways and result in unpredictable process behaviors.

In short, we can have a mess.

We can ‘hope’ that processes fail in a clean manner, But processes can fail miserably and send garbled / incorrect messages to

other processes or even collaborate with other failed processes in an attempt to destroy the integrity of the system.

So let’s look more closely at reaching agreement:


Reaching Agreement – Unreliable Communications Approach 1: assume processes fail in a clean manner, where the data communications

medium is unreliable.

So lets assume that some process P( i ) at site S(1) which has sent a message to process P( j ) at site S(2), needs to know whether P( j ) has received the message so that it can decide how to proceed with, say, its computation. For example, P( i ) may decide to compute a function foo if P( j ) has received its

message or to compute a function boo if P( j ) has not received the message (because of some hardware failure).

We can use a time-out scheme similar to the one described earlier to detect failures. To implement this, when P( i ) sends out a message, it also specifies some kind of time

interval during which it is willing to wait for an acknowledgment message from P( j ).

When P( j ) receives the message, it immediately sends an acknowledgement to P( i ). If P( i ) received the acknowledgment message within the specified time interval, it can

safely conclude that P( j ) needs to retransmit its message and wait for an acknowledgment. Then P( I ) can know whether to execute foo or boo.

This procedure continues until P( i ) either gets the acknowledgment message back or it is notified by the system that site S(2) is down. Note that, if these are the only two viable alternatives, P( i ) must wait until it has

been notified that one of the situations has occurred.


Reaching Agreement – Unreliable Communications

Suppose now that P( j ) also needs to know that P( i ) has received its acknowledgment message so that it can decide how to proceed with its computation.

For example, P( j ) may want to compute foo only if it is assured that P( i ) got its acknowledgment.

In other words, P( i ) and P( j ) will compute foo if and only if both have agreed on it. It turns out that, in the presence of failure, it is not possible to accomplish this task.

More precisely, it is not possible in a distributed environment for processes P( i ) and P ( j ) to agree completely on their respective states.


Reaching Agreement – Unreliable Communications

To prove this claim, let us suppose that a minimal sequence of message transfers exists such that, after the messages have been delivered, both processes agree to compute foo.

Let m’ be the last message sent by P( i ) to P ( j ). Since P( i ) does not know whether its message will arrive at P( j )

(since the message may be lost due to a failure), P( i ) will execute foo regardless of the outcome of the message delivery.

Thus, the message m’ could be removed from the communications sequence without affecting the decision procedure.

Hence, the original sequence was not minimal, contradicting our assumption and showing that there is no sequence. (proof by contradiction)

The processes can never be sure that both will compute foo.

End of Chapter 18.3End of Chapter 18.3

Documents

Chapter 18.3: Distributed Coordination. 18.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 18 Distributed Coordination Chapter