Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb 2016242

Embed Size (px)

DESCRIPTION

GSSI - Feb shared memory, synchronous x x x x x x x x x x clock synchronization A. An example: Unison

Citation preview

Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb Part 0: Part 0: An overview Part 1: Part 1: Lower bounds Part 2: Part 2: Computing in spite of faults Part 3: Part 3: Detecting faults Part 4: Part 4: Self-stabilization GSSI - Feb GSSI - Feb shared memory, synchronous x x x x x x x x x x clock synchronization A. An example: Unison Program at each processor p : choose a neighbor x of p ; if t(p) = t(x) then t(p):= t(p)+1; GSSI - Feb Stabilizing if all start with the same time, however GSSI - Feb Is it stabilizing? no! (may even deadlock) if t(p) = t(x) then t(p):= t(p)+1; Program at each processor p : choose a neighbor x of p ; if t(p) = t(x) then t(p):= t(p)+1; GSSI - Feb Is it stabilizing? if t(p) < t(x) then t(p):= t(x); GSSI - Feb x x x x x 8 8 x x 8 x 8 x x 8 Is it stabilizing?yes! if t(p) = t(x) then t(p):= t(p)+1; if t(p) < t(x) then t(p):= t(x); GSSI - Feb x x x x x 9 9 x x 8 x 8 x x 8 Is it stabilizing?no! if t(p) = t(x) then t(p):= t(p)+1; if t(p) < t(x) then t(p):= t(x); GSSI - Feb Program at each processor p : choose a neighbor x of p if t(p) = t(x) then t(p):= t(p)+1; if t(p) < t(x) then t(p):= t(x); fairly; Is it stabilizing? yes! GSSI - Feb proof GSSI - Feb n=5, range(S)=7-4=3, top(S)=2 S: GSSI - Feb n=5, range(S 1 )=7-7=0, top(S 1 )=5 S1:S1: x x x x x GSSI - Feb n=5, range(S 2 )=8-7=1, top(S 2 )=2 S2:S2: x x x x x GSSI - Feb n=5, range(S 3 )=9-8=1, top(S 3 )=2 S3:S3: x x x x x x x x x x GSSI - Feb Use for: correctness (especially termination) + complexity S : a special kind of fault tolerance guarantees an automatic recovery failures Started by Dijkstra. Introduced with the problem of mutual exclusion on a ring of processors Simple case: Distributed system. Steps taken one at a time, determined by some underlying mechanism. GSSI - Feb b. Self stabilization 258 Bidirectional Ring Shared Memory: x i Configuration: (x 0, , x n1 ) p0p0 p1p1 x n1 p i1... xixi x i+1 x1x1 x0x0 x i1 pipi p i+1 p n1 GSSI - Feb 2016 259 Not legitimate configuration = many tokens p0p0 p1p1 x n1 p i1... xixi x i+1 x1x1 x0x0 x i1 pipi p i+1 p n1 GSSI - Feb 2016 260 Legitimate configuration = one token p0p0 p1p1 x n1 p i1... xixi x i+1 x1x1 x0x0 x i1 pipi p i+1 p n1 Stabilized ! GSSI - Feb 2016 261 p0p0 p1p1 x n1 p i1... xixi x i+1 x1x1 x0x0 x i1 pipi p i+1 p n1 IF condition THEN change_state END. p is privileged condition = TRUE p holds GSSI - Feb 2016 A new approach to fault tolerance A unified approach to formally incorporate failures into the design model GSSI - Feb Applications Distributed data structures Digital and analog circuits Genetic algorithms Network protocols Sensor networks GSSI - Feb Some general comments on self stabilization 265 S t a b i l i z e d ! Self-stabilizing System Any configuration Correct output Legitimate configuration Noise Valid execution GSSI - Feb 2016 266 E W Dijkstra. Self-stabilization in Spite of Distributed Control, 1973 Again I beg my intrigued readers to stop reading here and to try to solve the stated problem themselves, for only then they will (slowly!) build up some sympathy with my difficulties: the problem has been with me for many months, while I was oscillating between trying to find a solution and many an at first sight plausible construction turned out to be wrong! and trying to prove the non-existence of a solution. And all the time I had no indication in which of the two directions to aim, nor of the simplicity or complexity of argument if any! that would settle the question. GSSI - Feb 2016 267 E W Dijkstra. Self-stabilizing Systems in Spite of Distributed Control. Communications of the ACM, 1974 For brevitys sake most of the heuristics that lead me to find them and the proofs that they satisfy the requirements have been omitted and to quote Douglas T.Rosss comment on an earlier draft the appreciation is left as an exercise for the reader. GSSI - Feb 2016 268 That same summer I designed the first so-called "self- stabilizing systems". It was a new problem of which no one knew whether it could be solved at all, and the solutions were at the time a tour de force. I submitted a short article to the Communications of the ACM, which published it in 1974; it was ignored until Leslie Lamport drew attention to it in 1984, and since then it has added a new dimension to ways of dealing with error recovery. E W Dijkstra From My Life, EWD1166, 1993 GSSI - Feb 2016 269 E W Dijkstra. A Belated Proof of Self-stabilization. Distributed Computing, 1986 GSSI - Feb 2016 Dijkstras 1 st algorithm N processors 0,1,,N-1 Solution with k-state Machines machine P 0 if x[0] = x[N-1] then x[0] := x[0]+1 mod k machine P i ( i= 1,2,,N-1) if x[i] x[i-1] then x[i] := x[i-1] GSSI - Feb Q: Do we need distint machines? Uniform protocol : same program at each processor Theorem: There is no uniform protocol that solves the mutual exclusion problem (under either a centralized or a distributed scheduler) GSSI - Feb GSSI - Feb p0p0 p3p3 p1p1 p2p2 0 0 p6p6 p5p Uniform protocol : same program at each processor Theorem: There is no uniform protocol that solves the mutual exclusion problem (under either a centralized or a distributed scheduler). However, if n is a prime, it is possible to solve the problem under a centralized scheduler. GSSI - Feb distributed control ring of (finite-state machine) processes for each machine define privileges (when a transition is enabled) nodes are influenced only by neighbors legitimate states in each legitimate state there is at least one privileged machine in each legitimate state each possible move will bring the system again to a legitimate state GSSI - Feb c. Mutual exclusion Dijkstras 1 st algorithms A system is self-stabilizing if, regardless of the initial state and regardless of the privilege selected each time for the next move, at least one privilege will always be present and the system is guaranteed to find itself in a legitimate state after a finite number of moves. GSSI - Feb GSSI - Feb non-stable states stable states Finite number of transitions Token ring. N machines, denoted 0 through N-1. A machine is priviliged (has token) or not priviliged. daemon/scheduler determines which of the priviliged machines will make the next move Legitimate States: all states with exactly one privileged machine. GSSI - Feb In Dijkstras model: Notes to consider: scheduler (daemon) : points at each time to one of the privileged machines, or point at each time to one of the machines. centralized scheduler, distributed scheduler fairness: Do we need fairness, and of what kind? Unconditional Fairness: Each machine gets pointed to infinitely often. Type of fairness: - each machine is pointed at infinitely often, - round robin, or - other. GSSI - Feb Dijkstras 1 st algorithm N processors 0,1,,N-1 Solution with k-state Machines machine P 0 if x[0] = x[N-1] then x[0] := x[0]+1 mod k machine P i ( i= 1,2,,N-1) if x[i] x[i-1] then x[i] := x[i-1] GSSI - Feb GSSI - Feb p0p0 p3p3 p1p1 p2p2 N=4, K=3 State : 0,1,2 GSSI - Feb p0p0 p3p3 p1p1 p2p2 machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k GSSI - Feb p0p0 p3p3 p1p1 p2p2 machine P i : if x[i] x[i-1] then x[i] := x[i-1] GSSI - Feb p0p0 p3p3 p1p1 p2p2 machine P i : if x[i] x[i-1] then x[i] := x[i-1] machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k So far: GSSI - Feb p0p0 p3p3 p1p1 p2p2 machine P i : if x[i] x[i-1] then x[i] := x[i-1] machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k K=3 Etc Example 1 GSSI - Feb p0p0 p3p3 p1p1 p2p2 machine P i : if x[i] x[i-1] then x[i] := x[i-1] machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k K= Etc Example 2 GSSI - Feb p0p0 p3p3 p1p1 p2p2 machine P i : if x[i] x[i-1] then x[i] := x[i-1] machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k K= Etc Example 3 GSSI - Feb Informally: If machine is privileged, then make it non-privileged. Note: machine P i : if x[i] x[i-1] then x[i] := x[i-1] machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k GSSI - Feb Legitimate States: exactly one privileged machine p0p0 p3p3 p1p1 p2p K=2 1 1 machine P i : if x[i] x[i-1] then x[i] := x[i-1] machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k 0 Legitimate States GSSI - Feb K1 K1 K1 K-1 K1 K1 K K1 K1 K1 K1 K1 GSSI - Feb Goal: Show that from an arbitrary state, the system stabilizes to a legitimate state in a finite number of steps. Intuition: N=5, K = 5 N=5, K = 4 GSSI - Feb N=5, K = 3 hw: can you get to this configuration? hw: which configurations have the same property? (Check also for all other examples.) GSSI - Feb Theorem: Assuming a centralized scheduler: starting with any initial configuration, Dijkstras 1st algorithm stabilizes iff K > N-2. Lemma 2: (liveness) At every configuration, at least one machine is privileged. GSSI - Feb Lemma 3: Starting from any configuration, P 0 will eventually make a move. Corollary: Starting at any configuration, P 0 will make an infinite number of steps. (this still does not guarantee stabilization.) machine P i : if x[i] x[i-1] then x[i] := x[i-1] machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k Lemma 1: If all states are equal then the system is stabilized. Definition: P i is called special in a configuration if GSSI - Feb Lemma 4: If P 0 is special in a configuration, then the system will eventually stabilize. Lemma 5: If in a configuration one of the states is missing, then the system will eventually stabilize. machine P i : if x[i] x[i-1] then x[i] := x[i-1] machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k GSSI - Feb Theorem: Assuming a centralized scheduler: starting with any initial configuration, Dijkstras 1st algorithm stabilizes iff K > N-2. Proof: Case 1: K< N-1 Case 2: K = N-1 Case 3: K = N Case 4: K > N GSSI - Feb Theorem: Assuming a centralized scheduler: starting with any initial configuration, Dijkstras 1st algorithm stabilizes iff K > N-2. Proof: (e.g., N=4, k=2) Case 1. k < N-1 GSSI - Feb Theorem: Assuming a centralized scheduler: starting with any initial configuration, Dijkstras 1st algorithm stabilizes iff K > N-2. Proof: Case 4. k > N Lemma 5 GSSI - Feb Theorem: Assuming a centralized scheduler: starting with any initial configuration, Dijkstras 1st algorithm stabilizes iff K > N-2. Proof: Case 3. k = N Lemma 5 Either one state is missing, or all states are distinct. Lemma 4 GSSI - Feb Theorem: Assuming a centralized scheduler: starting with any initial configuration, Dijkstras 1st algorithm stabilizes iff K > N-2. Proof: Case 2. k = N - 1 Lemma 5 Either one state is missing, or one state occurs twice. GSSI - Feb Lemma 4Either P 0 is special, or x[0] = x[i] a. i = 1 b. 1 N-1. hw: prove GSSI - Feb S = a configuration (a system state) r(S) = number of maximal runs of equal states Example: S = , f(S) = 4 r(S) + 1 if first(S) = last(S) f(S) = r(S) otherwise. Note: S is legitimate iff f(S) = 2. Use of a potential function Show: If K > N and f(S) > 2, then f(S) never increases and is eventually decreased. Recall: A common technique to prove termination and correctness GSSI - Feb Hw: 1. finish the proof. 2. use f to analyze complexity N processors 0,1,,N-1 Solution with 3-state Machines P 0: if x[0] +1=x[1] then x[0]:=x[0]-1 P N-1 : if (x[N-2] =x[0]) (x[N-1] x[0]+1) then x[N-1]:=x[0]+1; P i ( i= 1,2,,N-2) : if (x[i]+1=x[i-1]) (x[i]+1=x[i+1]) then x[i] := x[i]+1; GSSI - Feb d. Mutual exclusion Dijkstras 3 rd algorithms GSSI - Feb p0p0 p4p4 p1p1 p2p2 State : 0,1,2 p3p3 0 P 0: if x[0] +1=x[1] then x[0]:=x[0]-1 GSSI - Feb p0p0 p4p4 p1p1 p2p2 p3p3 0 P N-1 : if (x[N-2] =x[0]) (x[N-1] x[0]+1) then x[N-1]:=x[0]+1; GSSI - Feb p0p0 p4p4 p1p1 p2p2 p3p3 0 P i ( i= 1,2,,N-2) : if (x[i]+1=x[i-1]) (x[i]+1=x[i+1]) then x[i] := x[i]+1; GSSI - Feb p0p0 p4p4 p1p1 p2p2 p3p3 0 P i ( i= 1,2,,N-2) : if (x[i]+1=x[i-1]) (x[i]+1=x[i+1]) then x[i] := x[i]+1; GSSI - Feb p0p0 p4p4 p1p1 p2p2 p3p3 0 P i ( i= 1,2,,N-2) : if (x[i]+1=x[i-1]) (x[i]+1=x[i+1]) then x[i] := x[i]+1; GSSI - Feb p0p0 p4p4 p1p1 p2p2 p3p3 0 GSSI - Feb p0p0 p4p4 p1p1 p2p2 p3p3 0 GSSI - Feb p0p0 p4p4 p1p1 p2p2 p3p3 1 If all are in the same state? if (x[N-2] =x[0]) & (x[N-1] x[0]+1) then x[N-1]:=x[0]+1; if (x[i]+1=x[i-1]) v (x[i]+1=x[i+1]) then x[i] := x[i]+1; 2 2 if x[0] +1=x[1] then x[0]:=x[0] GSSI - Feb For two neighbors GSSI - Feb GSSI - Feb s: f(s) = number of + 2 x number of f(s) = x 1 = 6 GSSI - Feb Lemma 1: If there is a unique arrow in a configuration, then it moves back and forth infinitely. GSSI - Feb Corollary: It suffices to show that from any configuration we eventually reach a configuration with a unique arrow. Lemma 2: (liveness) Starting from any initial configuration, at least one process is priveleged. Lemma 3: Starting from any configuration, eventually. GSSI - Feb Proof: if during an execution -. else - f=0, but then there is no arrow, and in the next step there is a unique arrow, and then apply Lemma 1. GSSI - Feb Lemma 4: Between every two successive moves of P N-1 there is at least one move of P 0. Proof: recall: the program for P N-1 : if (x[N-2] =x[0]) (x[N-1] x[0]+1) then x[N-1]:=x[0]+1; After P N-1 makes a step x[N-1]=x[0]+1, And this condition makes P N-1 non-privileged. This condition can become true only after P 0 makes a step [ P 0: if x[0] +1=x[1] then x[0]:=x[0]-1 ]. GSSI - Feb Lemma 5: A sequence of moves in which P 0 does not move is finite. (therefore impossible by Lemma 2) Proof: by Lemma 4 in such a sequence P N-1 makes at most one step. so, from some point only P i makes a steps,. steps 1,2,3,4,5. 1,2 no change in arrows 3,4,5 decrease no. of arrows.. GSSI - Feb Lemma 5: A sequence of moves in which P 0 does not move is finite. (therefore impossible by Lemma 2) Proof (cont.): from some point only steps 1 and 2. Follows from topology. Theorem: Starting from any initial configuration, the system eventually stabilizes. GSSI - Feb Proof: by Lemma 5 P 0 makes infinite no. of moves. P 0 P 0 P 0 P 0 P 0 GSSI - Feb Proof (cont.) P 0 P 0 P 0 P 0 P 0 phases there exists a rightmost arrow, and it points to the right Falsified in each phase. P 0 P 0 P 0 P 0 P 0 F T F T F T F T F T GSSI - Feb This is done only in steps 3, 4, ok so, assume only 3 and 4. in each phase: they have f=-3. 0 and 7 f