Nancy Lynch MIT Adriaan van Wijngaarden lecture CWI 60 th anniversary, February 9, 2006

Impossibility of Consensus in Distributed Systems…and other tales about distributed computing theory

Nancy LynchMITAdriaan van Wijngaarden lectureCWI 60th anniversary, February 9, 2006

1. Prologue Thank you! Adriaan van Wijngaarden: Numerical analysis,

programming languages, CWI leadership. My contributions: Distributed computing theory. This talk:

A general description of (what I think are) my main contributions, with history + perspective.

Highlight a particular result: Impossibility of reaching consensus in a distributed system, in the presence of failures [Fischer, Lynch, Paterson 85].

2. My introduction to distributed computing theory

1972-78: Complexity theory 1978, Georgia Tech: Distributed computing

theory Dijkstra’s mutual exclusion algorithm [Dijkstra

65] Several processes run, with arbitrary interleaving of

steps, as if concurrently. Share read/write memory. Arbitrate the usage of a single higher-level resource:

Mutual exclusion: Only one process can “own” the resource at a time.

Progress: Someone should always get the resource, when it’s available and someone wants it.

Dijkstra’s Mutual Exclusion algorithm

Initially: All flags = 0, turn is arbitrary. To get the resource, process i does the following:

Phase 1: Set flag(i) := 1 Repeatedly:

If turn = j, and flag(j) = 0, sets turn := i. When turn = i, move on to Phase 2.

Phase 2: Sets flag(i) := 2. Checks everyone else’s flag to see if any = 2.

If so, go back to Phase 1. If not, move on and get the resource.

To return the resource: Set flag(i) := 0.

Dijkstra’s Mutual Exclusion algorithm

It is not obvious that this algorithm is correct: Mutual exclusion, progress.

Properties must hold regardless of order of read and write steps.

Interleaving complications don’t arise in sequential algorithms. In general, how should we go about arguing correctness of

such algorithms? This got me interested in learning how to prove properties

of: Algorithms for systems of parallel processes that share

memory. Algorithms in which processes communicate by channels (with

possible delay). And led to work on general techniques for:

Modeling distributed algorithms precisely Using interacting state-machine models.

Proving their correctness.

Impossibility results Distributed algorithms have inherent limitations, because

they must work in badly-behaved settings: Arbitrary interleaving of process steps. Action based only on local knowledge.

With precise models, we could hope to prove impossibility results, saying that certain problems cannot be solved, in certain settings.

First example: [Cremers, Hibbard 76] Mutual exclusion with fairness: Every process who wants the

resource eventually gets it. Not solvable for two processes with one shared variable, two

values. Even if processes can use operations more powerful than

reads/writes. Burns, Fischer, and I started trying to identify other cases

where problems could provably not be solved in distributed settings

That is, to understand the nature of computability in distributed settings.

3. The next 20 years Lots of work on algorithms: Mutual

exclusion, resource allocation, clock synchronization, distributed consensus, leader election, reliable communication…

And even more work on impossibility results. And on modeling and verification methods.

Example impossibility result [Burns, Lynch 93]

Mutual exclusion for n processes, using read/write shared memory, requires at least n shared variables.

Even if: No fairness is required, just progress. Everyone can read and write all the variables. The variables can be of unbounded size.

Example: n = 2. Suppose two processes solve mutual exclusion, with progress,

using only one read/write shared variable x. Suppose process 1 arrives alone and wants the resource. By

the progress requirement, it must be able to get it. Along the way, process 1 writes to the shared variable x:

If not, process 2 wouldn’t know that process 1 was there. Then process 2 could get the resource too, contradicting mutual

exclusion.

p1 p2

x

Impossibility for mutual exclusion

Contradicts mutual exclusion.

p1 arrives p1 gets the resource

p1 writes x

p2 gets the resource

p2 writes x

p1 writes x, overwriting p2

p1 gets the resource

Mutual exclusion with n processes, using read/write shared memory, requires n shared variables:

Argument for n > 2 is more intricate. Proofs done in terms of math models. Example shows the key ideas:

A write operation to a shared variable overwrites everything previously in the shared variable.

Process sees only its own state, and values of the variables it reads---its action depends on “local knowledge”.

Impossibility for mutual exclusion

p1 p2

x1

pn

x2

Modeling and proof techniques

More and more clever, complex algorithms: [Gallager, Humblet, Spira 83] Minimum Spanning Tree

algorithm. Communication algorithms in networks with changing

connectivity [Awerbuch]. Concurrency control algorithms for distributed

databases. Atomic memory algorithms [Burns, Peterson 87],

[Vitanyi, Awerbuch 87] [Kirousis, Kranakis, Vitanyi 88],… We needed:

A simple, general math foundation for modeling algorithms precisely, and

Usable, general techniques for proving their correctness. We worked on these…

Modeling techniques I/O Automata framework

[Lynch, Tuttle, CWI Quarterly 89] I/O automaton: A state machine that can

interact, using input and output actions, with other automata or with an external environment.

Composition: Compose I/O automata to yield other I/O

automata. Model a distributed system as a composition

of process and channel automata. Levels of abstraction:

Model a system at different levels of abstraction.

Start from a high-level behavior specification.

Refine, in stages, to detailed algorithm description.

Proof techniques Invariant assertions, statements about the system

state. Prove by induction on the number of steps in an execution.

Entropy functions, to argue progress. Simulation relations:

Construct abstract version of the algorithm. Need not be a distributed algorithm.

Proof breaks into two pieces: Prove correctness of the abstract algorithm.

Interesting, involves the deep logical ideas behind the algorithm. Tractable, because the abstract version is simple.

Prove the real algorithm emulates the abstract version. A simulation relation. Tractable, generally a simple step-by-step correspondence. Does not involve the logical ideas behind the algorithm.

Example: Mutual exclusion in a tree network

From [Lynch, Tuttle, CWI Quarterly 89] Allocate a resource (fairly) among processes at the nodes of

a tree:

Algorithm: Use token to represent the single resource. Token traverses subtree of active requests systematically.

Describe abstract version: Graph with moving token. Prove the abstract version yields the needed properties. Prove a simulation relation between the real algorithm and

the abstract version.

4. FLP [Fischer, Lynch, Paterson 83] Impossibility of consensus in fault-prone

distributed systems. My best-known result… Dijkstra Prize, 2001

Distributed Consensus A set of processes in a distributed network, operating at

arbitrary speeds, want to reach agreement. E.g., about:

The value of a sensor reading. Whether to accept/reject the results of a database transaction. Abstractly, on a value in some set V.

Each process starts with initial value in V, and they want to decide on a value in V:

Agreement: Decide on the same value. Validity: It should be some process’ initial value.

The twist: A (presumably small) number of processes might be faulty, and might not participate correctly in the algorithm.

Problem appeared as: Database commit problem [Gray 78]. Byzantine agreement problem [Pease, Shostak, Lamport 80].

FLP Impossibility Result [Fischer, Lynch, Paterson 83] proved an impossibility result for

distributed consensus. Proof works even for very limited failures:

At most one process ever fails, and everyone knows this. The process may simply stop, without warning.

Original result: Processes communicate using channels (with possible delays).

Same result (essentially same proof) for read/write shared memory.

Result seemed counter-intuitive: If there are many processes, and at most one can fail, then it

seems like the rest could agree, and tell the faulty process the decision later…

But nonfaulty processes don’t know that the other process has failed.

But still, it seems like all but one of the processes could agree, then later tell the other process the decision (whether or not it has failed).

But no, this doesn’t work!

FLP Impossibility proof Proceed by contradiction---assume an algorithm

exists to solve consensus, argue based on the problem requirements that it can’t work.

Assume V = {0,1}. Notice that:

In an “extreme” execution, in which everyone starts with 0, the only allowed decision is 0.

Likewise, if everyone starts with 1, the only allowed decision is 1.

For “mixed inputs”, the requirements don’t say.

First prove that the algorithm must have the following pattern of executions: a “Hook”:

If i takes the next step after , then the only possible decision thereafter is 0.

If j takes the next step, followed by i, then the only possible decision is 1.

Thus, we can “localize” the decision point to a particular pattern of executions.

For, if not, we can maneuver the algorithm to continue executing forever, everyone continuing to take steps, and no one ever deciding.

Contradicts requirement that all the nonfaulty processes should eventually decide.

FLP Impossibility proof

j i

i0 only

1 only

A Hook

Now get a contradiction based on what processes j and i do in their respective steps.

Each reads or writes a shared variable. They must access the same variable x:

If not, then their steps are independent, so the order can’t matter.

So different orders can’t result in different decisions, contradiction.

Can’t both read x: Order of reads can’t matter, since reads don’t

change x. That leaves three cases:

i reads x and j writes x. i reads x and j reads x. Both i and j write x.


j i

i0 only

1 only

A Hook

Case 3: Both write x. What is different after i vs. j i? In one case, j writes to the variable x before i

does. But in that case, i immediately overwrites

what j wrote. So, the only difference is internal to j. If we fail j, we can run the rest of the

processes after i and after j i, and they will do exactly the same thing.

But this contradicts the fact that they must decide differently in the two cases!

Case 1: i reads x and j writes x. Similar argument.

Case 2: i writes x and j reads x. Similar argument.


j i

i0 only

1 only

A Hook

Significance of FLP Significance for distributed computing practice:

Reaching agreement is sometimes important in practice: For agreeing on aircraft altimeter readings. Database transaction commit.

FLP shows limitations on the kind of algorithm one can look for.

Cannot hope for a timing-independent algorithm that tolerates even one process stopping failure.

Main impact: Distributed computing theory1. Variations on the result:

FLP proved for distributed networks, with reliable broadcast communication.

[Loui, Abu-Amara 87] extended FLP to read/write shared memory.

[Herlihy 91] considered consensus with stronger fault-tolerance requirements:

Any number of failures. Simpler proof.

New proofs of FLP are still being produced.

Significance of FLP

2. Ways to circumvent the impossibility result: Using limited timing information

[Dolev, Dwork, Stockmeyer 87]. Using randomness [Ben-Or 83][Rabin 83].

Weaker guarantees: Small probability of a wrong decision, or Probability of terminating approaches 0 as time

approaches infinity.

Significance of FLP

3. New, “stabilizing” version of the requirements: Agreement, validity must hold always. Termination required only if system behavior “stabilizes” for a while:

No new failures. Timing (of process steps, messages) within “normal” bounds.

Has good solutions, both theoretically and in practice. [Dwork, Lynch, Stockmeyer 88] algorithm:

Keeps trying to choose a leader, who tries to coordinate agreement. Many attempts can fail. Once system stabilizes, unique leader is chosen, coordinates agreement. The tricky part: Ensuring failed attempts don’t lead to inconsistent decisions.

[Lamport 89] Paxos algorithm. Improves on [DLS] by allowing more concurrency, and by having a funny story. Refined, engineered for practical use.

[Chandra, Hadzilacos, Toueg 96] Failure detectors. Services that encapsulate use of time in stabilizing algorithms. Developed algorithms like [DLS], [Lamport], using failure detectors. Studied properties of failure detectors, identified weakest FD to solve

consensus.

Significance of FLP4. Characterizing computability in distributed

systems, in the presence of failures. E.g., k-consensus: At most k different decisions occur

overall. Problem defined by [Chaudhuri 93]. Characterization of computability in distributed settings:

Solvable for k-1 process failures but not for k failures. Algorithm for k-1 failures: [Chaudhuri 93]. Matching impossibility result:

[Chaudhuri 93] Partial progress, using arguments like FLP. [Herlihy, Shavit 93], [Borowsky, Gafni 93], [Saks, Zaharoglu

93] Godel Prize, 2004. Techniques from algebraic topology: Sperner’s Lemma. Used to obtain k-dimensional analogue of the Hook.

Open questions related to FLP

Characterize exactly what problems can be solved in distributed systems:

Based on problem type, number of processes, and number of failures.

Which problems can be used to solve which others?

Exactly what information about timing and/or failures must be provided to processes in order to make various unsolvable problems solvable?

For example, what is the weakest failure detector that allows solution of k-consensus with k failures?

5. Modeling Frameworks Recall I/O automata [Lynch, Tuttle 87].

State machines that interact using input and output actions.

Good for describing asynchronous distributed systems: no timing assumptions.

Components take steps at arbitrary speeds Steps can interleave arbitrarily. Supports system description and analysis using

composition and levels of abstraction.

I/O Automata are adequate for much of distributed computing theory.

But not for everything…

Timed I/O Automata We need also to model and analyze timing aspects of

systems. Timed I/O Automata, extension of I/O Automata

[Lynch, Vaandrager 92, 94, 96], [Kaynar, Segala, L, V 05]. Trajectories describe evolution of state over a time interval. Can be used to describe:

Time bounds, e.g., on message delay, process speeds. Local clocks, used by processes to schedule steps.

Used for time performance analysis. Used to model hybrid systems:

Real-world objects (vehicles, airplanes, robots,…) + computer programs.

Hybrid I/O Automata [Lynch, Segala, Vaandrager 03] Also allows continuous interactions between components.

Applications: Timing-based distributed algorithms, hybrid systems.

Probabilistic I/O Automata,…

[Segala 94] Probabilistic I/O Automata, Probabilistic Timed I/O Automata.

Express random choices, random system behavior.

Current work: Improving PIOA Composition, simulation relations.

Current work: Integrating PIOA with TIOA and HIOA.

The combination should allow modeling and analysis of any kind of distributed system we can think of.

6. New Challenges [Distributed Algorithms 96]:

Summarizes basic results of distributed computing theory, ca. 1996. Asynchronous algorithms, plus a few timing-dependent algorithms. Fixed, wired networks.

Still some open questions, e.g., general characterizations of computability.

New frontiers in distributed computing theory: E.g., algorithms for mobile wireless networks. Much worse behaved than traditional wired networks.

No one knows who the participating processes are. The set of participants may change Mobility

Much harder to program. So, this area needs a theory!

New algorithms. New modeling and analysis methods. New impossibility results, giving the limits of what is possible in such

networks. The entire area is wide open for new theoretical work.

Distributed algorithms for mobile wireless networks

My group (and others) are now working in this area, developing algorithms, proving impossibility results.

Clock synchronization, consensus, reliable communication,… One approach to algorithm design: Virtual Node

Layers. Use the existing network to implement (emulate) a better-

behaved network, as a higher level of abstraction. Use the Virtual Node Layer to implement applications. We are exploring VNLs, both theoretically and

experimentally*.

*Note: Using CWI’s Python language…

7. Epilogue Overview of our work in distributed computing

theory, especially Impossibility results. Models and proof methods.

Emphasis on FLP impossibility result, for consensus in fault-prone distributed systems.

Thanks to my collaborators:

Yehuda Afek, Myla Archer, Eshrat Arjomandi, James Aspnes, Paul Attie, Hagit Attiya, Ziv Bar-Joseph, Bard Bloom, Alan Borodin, Elizabeth Borowsky, James Burns, Ran Canetti, Soma Chaudhuri, Gregory Chockler, Brian Coan, Ling Cheung, Richard DeMillo, Murat Demirbas, Roberto DePrisco, Harish Devarajan, Danny Dolev, Shlomi Dolev, Ekaterina Dolginova, Cynthia Dwork, Rui Fan, Alan Fekete, Michael Fischer, Rob Fowler, Greg Frederickson, Eli Gafni, Stephen Garland, Rainer Gawlick, Chryssis Georgiou, Seth Gilbert, Kenneth Goldman, Nancy Griffeth, Constance Heitmeyer, Maurice Herlihy, Paul Jackson, Henrik Jensen, Frans Kaashoek, Dilsun Kaynar, Idit Keidar, Roger Khazan, Jon Kleinberg, Richard Ladner, Butler Lampson, Leslie Lamport, Hongping Lim, Moses Liskov, Carolos Livadas, Victor Luchangco, John Lygeros, Dahlia Malkhi, Yishay Mansour, Panayiotis Mavrommatis, Michael Merritt, Albert Meyer, Sayan Mitra, Calvin Newport, Tina Nolte, Michael Paterson, Boaz Patt-Shamir, Olivier Pereira, Gary Peterson, Shlomit Pinter, Anna Pogosyants, Stephen Ponzio, Sergio Rajsbaum, David Ratajczak, Isaac Saias, Russel Schaffer, Roberto Segala, Nir Shavit, Liuba Shrira, Alex Shvartsman, Mark Smith, Jorgen Sogaaard-Andersen, Ekrem Soylemez, John Spinelli, Eugene Stark, Larry Stockmeyer, Joshua Tauber, Mark Tuttle, Shinya Umeno, Frits Vaandrager, George Varghese, Da-Wei Wang, William Weihl, H.P.Weinberg, Jennifer Welch, Lenore Zuck,……and others I have forgotten to list.

Thank you!

Documents

Nancy Lynch MIT Adriaan van Wijngaarden lecture CWI 60 th anniversary, February 9, 2006