Reaching Consensus: Why it can’t be done

Reaching Consensus:Why it can’t be done

For Distributed Algorithms 2014Presentation by Ziv Ronen

Based on “Impossibility of Distributed Consensus with One Faulty Process” By:

Michael J. Fischer, Nancy A. Lynch

Michael S. Paterson

2

Main Menu

• The problem• Why the problem is unsolvable• If time allow: how to solve the problem with

initial faulty processors

3

The Problem:• Consensus in the real world• Our mission• Model:– Objectives– Network– Possible faults

4

Consensus in the real world• There are many cases when we want that

several processors agree on an action.• Usually, it is more important that all

processors will agree on the same action then which action will be chosen.

• For example, if we have a database, we will want that any transaction will be committed by all processors or by none of them.

5

Consensus in the real world-Cont.

• Such agreement in fault free network is trivial.– For instance, we can choose a leader that tell all

the other what to do.• However, real world processors are subject to

failures– They might stop working (good case).– They might go haywire (bad case). – They might become malevolent (worse case).

6

Our mission

• We will want to find an algorithm that, for any decision in every network, will choose a single action to perform.

• However, we want that there will be at least two options, and that both of them can actually happen.

7

Our Model - objectives• We will work on a simplified problem, in which the

processors only need to agree on a number that can be either 1 (commit) or 0 (discard).

• Initially Each processor chooses is initial number randomly (simulate decisions based on the system condition).– 1 if can commit, 0 if can’t.

• Each processor need to choose an action. After the action was chosen, it can’t be redone

• In the end, all the processors need to agree on action, meaning they all choose 1 or 0

8

Our Model – objectives (cont.)

• We will required that the algorithm could return both 1 and 0 (maybe for different cases).– So “always discard” or “always commit” is not a

possible policy for our data base.

9

Our Model – Network

• We will assume fully asynchronic network– If we send a message to a non-faulty processor, it

will reach it after finite, unbounded time.• We will also assume the network is fully

connected. For generality we will also assume full knowledge of direction– so any other topology can be simulated.

10

Why asynchronic?If processor work

synchronic asynchronic

P1

P2

P1

P2

M2 M2

tick

11

Why asynchronic?But if one fail…

synchronic asynchronic

P1

P2

P1

P2

M2 M2

tick

P2 is faulty!

12

Our Model – Possible faults

• We will assume that the processors can only stop working entirely.

• We will also assume that only a single processor can malfunction in any given run.

• However, we will assume that:– Other processors can’t tell that a processor stop

working.– A processor can fail at any given time.

13

Our Model - more formally• N≥2 processors.• For each processor:

– Input value Xp{0,1}, part of the problem input.– Output value yp{0,1,b}, initially b, can only change ones.– Infinite storage

• Messages are of the form (p,m) where p is the target processor and m is the message. Any processor can send such message to any other processor.

• We will assume that every message stay in a “messages buffer” between the time it was send and received.– Initially, the buffer is empty.

• Goal: at the end, for each p1,p2: yp1 = yp2 ≠b

14

Our model – example, initial state

1X1=1Y1=b

2 X2=0Y2=b

3X3=1Y3=b

4X4=0Y4=b

Messages buffer

15

Our model – example, different state

1X1=1Y1=b

2 X2=0Y2=0

3X3=1Y3=0

4X4=0Y4=b

Messages buffer

2,m1

4,m2

4,m3

2,m2

2,m3

16

Our model – example, final state

1X1=1Y1=0

2 X2=0Y2=0

3X3=1Y3=0

4X4=0Y4=0

Messages buffer

2,m1

4,m2

2,m3

17

Why Consensus is impossible:

• Intuition • Proof– Definitions– Lemma 1 – Lemma 2– Lemma 3

18

Intuition

• Let show the intuition for why this is an impossible task.

• I will demonstrate on the problem of database consensus.– All the databases should have output value 1 if all working

databases have input value 1.– All the databases should have output value 0 if at least one

working database have input value 0.– In this case, working mean not failing at the beginning of

the algorithm.

19

Initial state

• We will choose an initial state where both results are possible.

• In our case, if processor 1 failed during the algorithm, the result might be 1.

• Otherwise, the result should be 0.

1X1=0Y1=b

2 X2=1Y2=b

3X3=1Y3=b

4X4=1Y4=b

20

case 1:

• If 1 sent is first message:

• All processors know that it can’t commit .

• The algorithm should decide 0.

1X1=0Y1=b

2 X2=1Y2=b

3X3=1Y3=b

4X4=1Y4=b0I failed to

commit

21

case 2:

• If 1 failed before sending this message,the algorithm should decide without him.

• Since all other processor can commit, the algorithm should decide 1.

1X1=0Y1=b

2 X2=1Y2=b

3X3=1Y3=b

4X4=1Y4=b1

22

Quasi failure

• Let say that a processor “quasi failed” if:– It may be alive or dead.– If he is alive, he will execute its next step after the

algorithm “finished” without him.

1X1=0Y1=b

ZZ

23

Quasi failure - Intuition

1X1=0Y1=b

1X1=0Y1=b

Schrödinger's cat Processor

24

Quasi failure – our example

• If 1 quasi failed:• The algorithm have 3

choices:1

X1=0Y1=b

2 X2=1Y2=b

3X3=1Y3=b

4X4=1Y4=b

ZZ

25

Quasi failure choices (1/3)

• Decide 0.• In this case, if processor

one actually failed:• The result will be

wrong!

1X1=0Y1=b

2 X2=1Y2=b

3X3=1Y3=b

4X4=1Y4=b

ZZ

0

26

Quasi failure choices (2/3)

• Decide 1.• In this case, if the

processor wake up:• The result will be

wrong!

1X1=0Y1=b

2 X2=1Y2=b

3X3=1Y3=b

4X4=1Y4=b

ZZ

1

27

Quasi failure choices(3/3)

• Not deciding.• In this case, if the

processor actually failed:

• The algorithm will never decide.

1X1=0Y1=b

2 X2=1Y2=b

3X3=1Y3=b

4X4=1Y4=b

ZZ

?

28

Intuition – summary• There is an initial state where both answers are possible

(Lemma 2).• There is an event in a specific processor (in our case,

processor 1 starts working and sending its message) that is occurrence, No matter when(Lemma 1), determine the outcome.

• If a processor quasi-fail, we can’t decide (because the answer depend on whether he actually fail, and we can’t know that).

• If we will not decide, then we will reach another one of those state (Lemma 3) and be stuck forever.

29

Intuition – summary(cont.)

• Remember that in the example, we forced them to agree according to some policy. In the real problem (and in the following proof) we just need them to agree on the same value, no matter which.

30

Proof – definitions (1/6)

• Configuration: the combination of the internal state (input, output, memory) for each processor and the messages in the buffer.

• Step: an action of on processor. For processor p, consists of:– Try receiving a message (removing it from the

messages buffer). If succeed, receive (p,m). If failed, receive (p,).

– Conduct computation. May send any finite amount of messages

31

Configuration and step

1X1=1Y1=b

2 X2=0Y2=bY2=1

3 4

Messages buffer

2,m1

2,m1

2,m1

Step 1

Step 2

32


• Event e=(p,m): the receiving of message m by p– Since our processors are deterministic, the change of the

configuration by step is depend only on the received message.

– The event e=(p,) is always possible for any p. • e(C): the configuration reached from C by the event

e.• Schedule: a finite or infinite sequence σ of events.

– σ(C): The final configuration from initial configuration C

33

Event and sequences

1X1=1Y1=b

2 X2=0Y2=bY2=1

3 4

Messages buffer

2,m1

2,m1

2,m1

(2,m1)

(1,)

σ =((1,),(2,m1))

34

Proof – definitions (3/6)• Reachable: configuration C is reachable from C’ if schedule

σ exists so: σ(C’) = C• Accessible configuration: Configuration C is accessible if

exists an initial configuration C’ so C is reachable from C’.• DV(C): The set {v|v≠b and p:v=yp}, or the values that

were chosen by some processor.• A protocol is partially correct if:– If configuration C is accessible, |DV(C)|≤1– Two accessible configurations C,C’ exists so: DV(C)={0},

DV(C’)={1}

35

Partially correctness

1X1=1Y1=bY1=0

2 X2=0Y2=bY2=1

3 4

Messages buffer

2,m1

2,m1

2,m1

DV(C)={} DV(C)={0}DV(C)={0,1}

36

Proof – definitions (4/6)• Nonfaulty: processor is nonfaulty if it take infinite number of

steps.• Faulty: a Non-Nonfaulty processor (stop taking step after

some time).• Admissible: a run is admissible if it contain at most one faulty

processor and the messages buffer is fair.• Deciding: a run is deciding if eventually for some processor p,

yp≠b• A protocol P is totally correct in spite of one fault if:

– P is partially correct.– Every Admissible run in P is deciding run

37

Main Theorem

• No consensus protocol is totally correct in spite of one fault

• We will assume the contrary: assume protocol P’ is totally correct in spite of one fault

38

Lemma 1

• For any two disjoint finite schedule σ1 ,σ2 and initial configuration C exists: σ1(σ2(C)) = σ2(σ1(C)) – Disjoint: involving different processors.

• Proof:– From the system definition, since σ1 ,σ2 don’t

interact.

39

Lemma 1 – visually

1X1=1Y1=b

2 X2=0Y2=bY2=1

3 4

Messages buffer

2,m1

1,m2

1,m3

2,m1 1,m21,m3

2,m1

4,m4

4,m4

4,m5

4,m5

Sequence 1Sequence 2

40

Lemma 1 – visually (opposite order)

1X1=1Y1=b

2 X2=0Y2=bY2=1

3 4

Messages buffer

2,m1

1,m2

1,m3

2,m1 1,m21,m3

2,m1

4,m4

4,m4

4,m5

4,m5

Sequence 1Sequence 2

41

Lemma 1 – visually

Normal order: Opposite order:

42


• Let FDV(C) be the union of DV(C’) for each C’ reachable from C. – If FDV(C) = {0,1}, C is bivalent.– If |FDV(C)|=1, C is univalent.– If FDV(C) = {0}, C is 0-valent.– If FDV(C) = {1}, C is 1-valent.– P’ is totally correct, so FDV(C) ≠.

• Intuitively, FDV(C) the possible decisions from configuration C.

43

Lemma 2

• Lemma: There is a bivalent initial configuration.

44

Lemma 2 – Proof (1/3)

• Assume otherwise:• From partial correctness, P’ have both

0-valent and 1-valent initial configurations.• Let call two initial configurations adjacent if they

differ only by a single processor input value.• Any two initial configurations can be joined by a

chain of adjacent configuration.• Hence, there are two adjacent 0-valent and 1-

valent initial configurations. explanation

45


• Remainder 1: there are two adjacent 0-valent and 1-valent initial configurations. – Let call them C0, C1 accordingly.

• C0, C1 are adjacent, so there is only one processor, p, that has different input value between them.

• Remainder 2: P’ is totally correct in spite of one fault.– So P’ should reach a decision even if a processor fail.

46


• Let R be an admissible run from C0 where p fail. From totally correctness in spite of one fault, R must reach a deciding run. Let σ be the corresponding schedule.

• If 1DV(σ(C0)) , then 1FDV(C0), but C0 is0-valent. So 1DV(σ(C0)), therefore DV(σ(C0))={0}

• However, since the only different between C0, C1 is p and p fail, σ is legal on C1 and σ(C0)σ(C1) (equal except p, which fail and therefore didn’t decide) and so DV(σ(C0))=DV(σ(C1)) ={0}, 0FDV(σ(C1)), but C1 is 1-valent.

47


• For any configuration C and event e=(p,m) so e(C) is legal, Let Rne(C) be the set of all configuration reachable from C without applying e.– Note that e can be applied on any C’Rne(C)

• Let eR(C) be {e(C’)| C’Rne(C)}• Let two configuration, C,C’ be called neighbors if

one is reachable from the other in a single step.– Equivalent to saying that an event e exists such that

C’=e(C) or C=e(C’)

48

Lemma 3

• If C is bivalent then for each e=(p,m), eR(C) contain bivalent state.

49


• Let assume that every DeR(C) is univalent.• C is bivalent, and therefore, for any i{0,1} exists a i-

valent configuration Ei that is reachable from C. Let σi be a schedule that fulfill Ei=σi(C).

• let the configuration Fi be:– If eσi, Fi =e(Ei)– If eσi, then σi=σi‘(e(σi‘’)). Fi =e(σi‘’(C))

• In both cases, FieR(C), and therefore Fi is i-valent– Since either Fi is reachable from Ei or vice-versa.

50


• So, eR(C) contain both 0-valent and 1-valent configuration.

• By easy induction on the length of the schedule to Fi (when e(C) is j-valent for j≠i) there exists two neighbors C0, C1 so Di =e(Ci) is i-valent for i{0,1}.

• Without loss of generality, assume C1=e’(C0)

51

“Easy Induction” (in pictures) for e(C) is 0-valent: case A (base)

C=C0

1-valent

C1

e

0-valent

e

F1

52

“Easy Induction” (in pictures) for e(C) is 0-valent: case B (step)

C

1-valent

C1

e

0-valent

e

C0

0-valent

0-valent

e e

C

F1

Induction

53

“Easy Induction” (in pictures) for e(C) is 0-valent: case C (contradiction )

C

0-valent

e

R

0-valent

bivalent

ee

e(R)eR(C), e(R) is bivalent, contradiction

F1

54


• Remainders: – e=(p,m).– C0, C1 are neighbors.

– Di =e(Ci) is i-valent for i{0,1}.– C1=e’(C0).– Lemma1: If two schedules are disjoints, you can

execute them in any order.

55


• Let e’=(p’,m’). – If p’≠p: the schedules σ=(e), σ’=(e’) are disjoints,

So by lemma1: D1=e(e’(C0))=σ(σ’(C0))=σ’(σ(C0))=e’(e(C0))=e’(D0).But then 1FDV(D0), contradiction.

– If p’=p: so lets look on a finite, deciding run when where p take no step. Since it mimic a single fault (quasi-fail) in p, and P’ is totally correct in spite of one fault, there is such run.

56

If p’≠p:

From “Impossibility of Distributed Consensus with One Faulty Process” By: Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson

57


• A deciding run Where p quasi-fail:– Let σ be the corresponding schedule.– Let A=σ(C).– A is deciding configuration, meaning |DV(A)|>0

and therefore |FDV(A)|=1(from partly correctness of P’)

– σ‘=(e’,e), σ‘’=(e) are disjoint from σ, since σ contain no event with p (p quasi-fail), and σ‘, σ‘’ contain only event with p (since p=p’).

58


• A deciding run Where p quasi-fail:– Let σ be the corresponding schedule.– Let A=σ(C).– A is deciding configuration, meaning A is univalent

(from partly correctness of P’)– σ‘=(e’,e), σ‘’=(e) are disjoint from σ, since σ

contain no event with p (p quasi-fail), and σ‘, σ‘’ contain only event with p (since p=p’).

59


• From lemma1: e(A)=σ’’(σ(C0)) = σ(σ’’(C0)) = σ(e(C0))= σ(D0),0FDV(A)

• From lemma1: e(e’(A))=σ’(σ(C0)) = σ(σ’(C0)) = σ(D1), 1FDV(A)

• But now A is bivalent, contradiction!

60

If p’=p:


61

If p’=p:


From Lemma 1

62

If p’=p:


From Lemma 1

63

If p’=p:


Two configuration That are reachable from A

64

If p’=p:


A Bivalent butσ is deciding

65

Proof – conclusion(1/4)• In order to finish the proof, we will now show an execution

that never reach a decision.• Remainder:

– A protocol P is totally correct in spite of one fault if:• P is partially correct.• Every Admissible run in P is deciding run

– A run is admissible if it contain at most one faulty processor and the messages buffer is fair.

– a run is deciding if eventually for some processor p, yp≠b (And therefore, reaching an univalent configuration).

• We will assume that P is partially correct and find an Admissible run that is not deciding

66

Proof – conclusion(2/4)• First, we will define a way to assure that the run is Admissible.

Let have a queue of the processors and define stages in the following way:– The stage end when a the first process in the process queue receive

the earliest message sent to it (or no message if none was sent).– At the end of stage, the processor is removed from the head of the

queue and enter the tail.• Since each stage end with the next processor in the queue and

with the earliest message sent to it, infinite stages will mean:– Infinite step in each processor– Every message will eventually be received.

• Therefore, the run will be admissible.

67

The run will be admissible

1 2

3 4

Processor Queue2314

P4 P3 P2 P1m4 m3 m2 m1

m5 m10 m7m6 m8

m9

Processor in the j entry will run after at most j stages (3)

Message at place j will be sent after at most N * j stages (4 * 3 = 12)

68

The run will be admissible 1

1 2

3 4

Processor Queue2314

P4 P3 P2 P1m4 m3 m2 m1

m5 m10 m7m6 m8

m9

stage

69


1 2

3 4

Processor Queue3142

P4 P3 P2 P1m4 m3 m10 m1

m5 m7m6 m8

m9

stage

70


1 2

3 4

Processor Queue1423

P4 P3 P2 P1m4 m5 m10 m1

m6 m7m8m9

stage

71


1 2

3 4

Processor Queue4231

P4 P3 P2 P1m4 m5 m10 m7

m6 m8m9

stage

72

The run will be admissible

1 2

3 4

Processor Queue2314

P4 P3 P2 P1m5 m10 m7m6 m8

m9

73

Proof – conclusion(3/4)

• We will assume that P is partially correct and find an Admissible run that is not deciding.– Now, let make sure that it is not deciding:1. Start from a bivalent configuration C (Lemma2)2. Let e denote the first message in the message queue

for the first processor in the processors queue. There is a bivalent configuration C’ reachable from C by a schedule that end by e (Lemma3).

3. C = C’ (stage end).4. Return to step 2.

74

Proof – conclusion(4/4)

• We will assume that P is partially correct and find an Admissible run that is not deciding.– Since each stage end in bivalent configuration, the

run is not deciding.• Therefore, P is not totally correct!

Q.E.D

75

THE END!Question?

exit Initially dead processors

76

Chain of adjacent configuration (d=4)

0-valent 1-valent

1X1=1Y1=b

2 X2=1Y2=b

3X3=1Y3=b

4X4=0Y4=b

1X1=0Y1=b

2 X2=0Y2=b

3X3=0Y3=b

4X4=1Y4=b

77

Chain of adjacent configuration

0-valent 1-valent

1X1=1Y1=b

2 X2=1Y2=b

3X3=1Y3=b

4X4=0Y4=b

1X1=0Y1=b

2 X2=0Y2=b

3X3=0Y3=b

4X4=1Y4=b

?-valent

1X1=0Y1=b

2 X2=1Y2=b

3X3=1Y3=b

4X4=0Y4=b

78

Chain of adjacent configuration – case1: 1-valent

0-valent 1-valent

1X1=1Y1=b

2 X2=1Y2=b

3X3=1Y3=b

4X4=0Y4=b

1X1=0Y1=b

2 X2=0Y2=b

3X3=0Y3=b

4X4=1Y4=b

1-valent

1X1=0Y1=b

2 X2=1Y2=b

3X3=1Y3=b

4X4=0Y4=b

79

Chain of adjacent configuration case2: 0-valent

0-valent 1-valent

1X1=1Y1=b

2 X2=1Y2=b

3X3=1Y3=b

4X4=0Y4=b

1X1=0Y1=b

2 X2=0Y2=b

3X3=0Y3=b

4X4=1Y4=b

0-valent

1X1=0Y1=b

2 X2=1Y2=b

3X3=1Y3=b

4X4=0Y4=b

80

Chain of adjacent configuration(d=3)

0-valent 1-valent

1X1=0Y1=b

2 X2=0Y2=b

3X3=0Y3=b

4X4=1Y4=b

0-valent

1X1=0Y1=b

2 X2=1Y2=b

3X3=1Y3=b

4X4=0Y4=b

81

Chain of adjacent configuration(d=3…2…1)

0-valent 1-valent

1X1=0Y1=b

2 X2=0Y2=b

3X3=0Y3=b

4X4=1Y4=b

0-valent

1X1=0Y1=b

2 X2=0Y2=b

3X3=0Y3=b

4X4=0Y4=b

82

Chain of adjacent configuration(d=1)

0-valent 1-valent

1X1=0Y1=b

2 X2=0Y2=b

3X3=0Y3=b

4X4=1Y4=b

1X1=0Y1=b

2 X2=0Y2=b

3X3=0Y3=b

4X4=0Y4=b

83

Initially dead processors

• Assume:– N processors.– At least L= (The majority) processors are

alive.– The processors don’t know who is alive.

• We want to reach a consensus.

84

Two stages Algorithm – stage 1

• In the first stage, we will build a distributed directed graph G.

• The graphs will be built in the following way:– Each processor have a corresponding node.– Each processor send its id to any other processor.– Each processor will wait for messages from L-1

other processors.– If a message from processor i reach processor j, an

edge (i,j) will be added to the graph.

85

stage 1 – Example (2 processor view point)

1

2 3

4

5

6

7

86


1

2 3

4

5

6

7

87


1

2 3

4

5

6

7

88

stage 1 – Example (Global View)

1

2 3

4

5

6

7

89

Two stages Algorithm – stage 2

• In the second stage, we will build a graph G+ which is the transitive closure of G, so that every processor know about enough of the graph.

• The graphs will be built in the following way:– Each processor send to all the other its:1. id.2. Initial value.3. L-1 neighbors.– Each processor wait until it received such message

from all its ancestors.

90

stage 2 – Example (processor 2 view point)

1

2 3

4

5

6

7

2 ,x2, (3,4,5)

91


1

2 3

4

5

6

7

3,x3,[2,4,5]

4,x4,[2,3,5]

5,x5,[2,4,6]

92


1

2 3

4

5

6

7

93

stage 2 – Example: transitive closure (processor 2 view point)

1

2 3

4

5

6

7

6,x6,[2,3,5]

94


1

2 3

4

5

6

7

95


1

2 3

4

5

6

7

96


1

2 3

4

5

6

7

97


1

2 3

4

5

6

7

98

Clique in G+ (1/2)• Claim: G+ contain 1, and only one, clique of size L or more that is

not fully contained in other clique.• Proof by the following steps. contain at least one:

– For each k < N, because the in-degree of each node in G is L-1, if G contain a path of size k then:• G contain a cycle of size at least L.

or• G contain a path of size k+1

– Corollary: G contain a path of size N, it contain a cycle of size at least L (because option 2 is not possible).

– Corollary: G contain a cycle of size at least L.– Since G+ is a transitive closure of G, if G contain cycle of size k then G+

contain a clique of size k.

99

Contain at least one Clique: Path of size L

A1

L-1…

100


1A

L-2…

A2

At least

L-2…

At most

1

1

101


A1

L-2…

A2

At least

L-3…

At most

2

A3

At least

L-4…

At most

1

1 1 1

102


A1

L-2…

At least

L-i…

A L-1

At least

0…

Ai

…At most

i-1

A L

…At most

L-2

…At most

L-1

… …1 1 1

At least

0…

Path of size L

103

Contain at least one Clique: Induction for k≥L

Path of size k-(L-1)

APath of

size (L-2)…

At most

L-2

104

Contain at least one Clique: Induction for k≥L, case 1:

Cycle of at least L


APath of

size (L-2)…

At most

L-2

105

Contain at least one Clique: Induction for k≥L, case 2:

Path of size k+1


APath of

size (L-2)…

At most

L-2

B

106

Contain at least one Clique:

A1

A L

Ai…… Aj …

107

Clique in G+ (2/2)

• Contain at most one clique:– If contain two, since L is the majority of node, then

there is a node in both clique.– From transitive, the node set that is a union of the

nodes in both clique is a clique.

108

Contain at most one clique

i j

Transitivity

109

Two stages Algorithm – Finish

• Claim: each living processor know about the clique.– That because each node in the graph is a child of a

processor in the clique, and therefore all nodes in the clique are ancestor of it and he will wait for them.

• The consensus: Let f be any function of the form f:({0,1} X 2|V|)->{0,1}, f known by all processor (part of there state). Then f(Unique Clique) is a binary value known by all processors.

• Consensus is reached!

110

THE END!Question?

exit

Documents

Reaching Consensus: Why it can’t be done