Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”

Fault ToleranceFault Tolerance

2

Fault tolerance terminologyFault tolerance terminology

“dependability” - extent to which reliance can justifiably be placed on service.General concept

“reliability” - continuity of servicemetric: mean time between failures (MBTF)

“availability” - readiness for usage

“safety” - avoidance of catastrophic effects on environment

“security” - resistance to unauthorized access.

3

Faults, errors, failuresFaults, errors, failures

“fault” - component malfunction

“error” - system state is wrong

“failure” - system departs from specification

fault error failure

4

SystemSystem

System

Environment

componentsfaul

t

failure

5

Coping with faultsCoping with faults

Reduce/eliminate faults in components.

Fault tolerancePrevent faults from becoming failuresusually through redundancy.

6

Types of faults (fault models)Types of faults (fault models)

Fault tolerance algorithms dependent on fault models.

“Crash fault” or “stop fault” - faulty component stops responding. No incorrect state changes in component.

“Timing fault” - response is too early or late.

“Byzantine fault” - arbitrary behavior. Can be considered adversarial (imagine worst case).

7

The agreement problemThe agreement problem

Processors may fail

… so, use multiple processors

… but then, processors may disagree, causing failures.

Need a principled approach to distributed agreement

8

Example: AFTI 16 (from J. Rushby)Example: AFTI 16 (from J. Rushby)

“Advanced Fighter Technology Integration F16

Triple-redundant digital flight-control system (DFCS) with analog backup

DFCS design was “asynchronous”processors ran independently

sample sensor, evaluate control law, send command to actuator

actuator averages or selects from commandsGeneral Dynamics felt synchronization would

introduce a single point of failure.

9

AFTI 16 problemsAFTI 16 problems

Processors can get widely varying sensor readings because of timing differences

Reconfiguration can cause sudden changes in control (“thumps”).Need to allow wide range of “plausible values”

before declaring a processor “bad”Bad sensor reading drags average downSensor finally crosses threshhold and is

called “bad”average suddenly snaps back when sensor is

excluded.

10

AFTI 16 problems (cont)AFTI 16 problems (cont)

Processor states can diverge rapidlyespecially when different processors go into

different control modes.

Design complexity70% of application code was for redundancy

managementControl laws had to be modified to ramp

changes in and out smoothly

11

AFTI 16 flight test, Flight 36AFTI 16 flight test, Flight 36

“Departure” from control laws for 3 seconds

acceleration exceeded -4g, then +7g

Angle of attack went to -10 degrees, then +20 degrees

Aircraft rolled 360 degreees

Cause: side air probe cut out at high angle of attack

Analysis showed this would cause complete failure of DFCS for several areas of flight envelope

12

AFTI 16 flight 44AFTI 16 flight 44

Each channel declared the others failedasynchronous operation, timing skew, sensor

noise

analog backup not selectedsimultaneous failure of two channels not

anticipated

Aircraft flown home on a single digital channel (not designed for this)

There were no hardware failures.

13

AFTI 16 Analysis (NASA)AFTI 16 Analysis (NASA)

Nearly all failure indications were design oversights related to asynchronous operation

Failures due to lack of understanding of interactions amongAir data systemredundancy management softwareflight control laws (decision points, thumps,

ramp-in/out)

Moral of the story: Reliability through redundancy is a lot harder than it looks.

14

Distributed consensusDistributed consensus

Goal: multiple processors agree on something in the presence of various kinds of faults and errors

Intellectually difficultAlgorithms are trickyProofs are subtleSensitive to assumptions

Synchronous vs. asynchronous Communication mechanism Fault models

Many papers written

15

Synchronous vs. asynchronousSynchronous vs. asynchronous

Synchronous: Processors run in lock-stepHard to implement - model may be unrealistic

Requires clock synchronization.Consensus is easier

Asynchronous: Processors run at arbitrary speedEasier to implement - model is conservativeIn most models, consensus problem is

provably unsolvable.

16

Synchronous vs. asynchronousSynchronous vs. asynchronous

Semi-synchronousBounds on how far out-of-sync processors

can getModel is fairly realisticConsensus is almost as easy as synchronous

17

Fault modelsFault models

Goal: Make claims such as: “the system will continue to function if any single processor stops.”

More conservative fault models:Fault tolerance is harderBut, if successful, stronger claims can be

madeFewer assumptions = simpler FMEA, easier

“certification”

A lot of models have been proposed.

18

Process fault modelsProcess fault models

“Stopping fault” - process stops sending messagesdoes not restartdoes not send wrong messagesliberal (easy) model

“Byzantine fault” - process behaves arbitrarilyName comes from cute “Byzantine generals”

metaphorMay send arbitrary messages, enter arbitrary

statesEquivalent to “evil” behavior, for our purposes

19

Synchronous agreement with stopping faultsSynchronous agreement with stopping faults Multiple processes want to “agree” on a

value

Applicationssensor readings among redundant processorsdecide what time it isdecide which of a group of processors are

broken and should be removed from system.

20

Synchronous agreement - propertiesSynchronous agreement - properties

Each process starts with an initial value, processes end with a decision value.

Agreement: all good processes decide on same values.

Validity: if all processors start with same value, that value is the final decision value.

Termination: All good processes eventually decide.

21

Flood set algorithmFlood set algorithm

Assumption: There is a dedicated link between each pair of processes

No more than f processes can stop

Each process has an initial value v

Each process accumulates a set W of all the values it has ever seen.On each round, every process sends its W set

to every other processEvery process sets W to the union of the old

value and all the new values coming in from others.

22

Flood setFlood set

After f rounds, every process looks at W. If W has only one value, choose that value.Else, choose 0 (a predetermined default).

23

Flood set correctnessFlood set correctness

In f+1 rounds, there must be at least one round in which no processes stopAt most f processes can stop, and processes

cannot stop more than once.

If no process stops in round r, W will be the same in all good processes in subsequent rounds.All good processes successfully send all values

in W to all other good processes, so all processes will have same W after the round.

After this, nothing can get added to any W sets, so it doesn’t matter whether more processes stop.

24

Flood set correctnessFlood set correctness

So, after f+1 rounds, all non-stopped processes have same W setsIf W has only one value, all processes pick this

value.Else all processes pick 1.

25

Flood set exampleFlood set example

3 processes, 1 fault, default value = 0

P1 P2 P3

V0 A A B

W in round 0 {A} {A} {B}

final

W in round 1 {A,B} {A} -

something

something

something

P3Dies after

sending W to

but not P1 P2

W in round 2 {A,B} {A,B} -

Www

s

W sets for

,

are same

P1P2

-00

Blank here

blank here

blank here

Choose default

because |W|>1

26

Flood set efficiencyFlood set efficiency

O((f + 1) n2) messages

f+1 rounds

n processes send n messages per round

O((f+1)n3) values are sent (each message

may have a set of up to n values)

27

Optimized flood setOptimized flood set

Note: If W has more than one element, process doesn’t need to know what is in it.

Idea: Every process sends only first two distinct values.Every process sends its initial value on first round If process receives a different value, it sends it out on

next round

Correctness proof: run Flood and OptFlood in parallelsame initial values, stopping patternW sets have more than one value iff OptFlood process

gets two values.

28

OptFlood efficiencyOptFlood efficiency

2 n2 messages

n processes send at most two messages to n other processes.

O(n2) values are sent

29

Byzantine agreementByzantine agreement

Goal: non-faulty processes should agree on a value.E.g., message receivede.g., sensor value

Faults may cause arbitrary behaviorarbitrary values communicateddifferent values communicated to different receivers

Advantage: reduces fault analysis

Disadvantage: hard or impossible to do.

30

Byzantine agreement propertiesByzantine agreement properties

Agreement: All good processes agree on a value

Validity: If source of value was non-faulty, agreed upon value is the same.

31

Asynchronous agreementAsynchronous agreement

Asynchronous model: Message transmission takes arbitrary time.Processes run at arbitrary speeds.

Theorem: There is no algorithm that reaches agreement in an asynchronous model with even one Byzantine failureFine print: Details of conditions, communication

This is one of the most important results about distributed systems.

32

Synchronous agreementSynchronous agreement

Synchronous model: Processes can communicate in a sequence of rounds. All processes complete a round before next round begins.

The agreement problem is solvable in this model.

Theorem: Tolerating k Byzantine faults requires > 3k processes.

So “Triple modular redundancy” can’t handle Byzantine faults.

Practical case: 1 Byzantine fault, 4 processes.

Assumes full connectivity (connections between each pair of processors).

33

Synchronous agreement with one faultSynchronous agreement with one fault

Single transmitter communicates value to all processes.

Round 0: Transmitter sends value to n-1 receivers.Values are sent correctly if transmitter is not faulty.

Round 1: Each receiver sends value to n-2 other receivers. Receivers record all values separately. Intuition: receivers compare notes on what transmitter

told them.

Each receiver choose majority value of all values it received. If no majority, use pre-arranged default value.

34

consensus

Finally, receivers

take majority of all

answers

P2

P1

P3

Rcvr

Xmtr

These are the

round 0 values

P1 P2 P3

Example 1- faulty transmitterExample 1- faulty transmitter

1

1

1

1

1

1

2

2

2

1

1

1

P1 P2 P3Round 0: faulty xmtr sends

varying results to rcvrs.

Round 1: rcvrs

exchange

values (reliably)

1 1 2

35

consensus

There is no majority,

so rcvrs use default

P2

P1

P3

Rcvr

Xmtr

These are the

round 0 values

P1 P2 P3

Example 2- faulty transmitterExample 2- faulty transmitter

1

1

1

2

2

2

3

3

3

0

0

0



Round 1: rcvrs

exchange

values (reliably)

1 2 3

36

consensus

Majority computes

correct values for

processes 2,3

P2

P1

P3

Rcvr

Xmtr

These are the

round 0 values

P1 P2 P3

Example 3- faulty receiverExample 3- faulty receiver

2

1

3

1

1

1

1

1

1

1

5

1



Process 1

sends bogus values

1 1 1

Process 1 is

broken, so result

is not required to be

correct

37

General caseGeneral case

Previous algorithm can be generalized to handle more Byzantine faults.

General results: k faults require k+1 (k?) rounds, 3k+1 processors

Number of messages grows exponentially with number of rounds

Intuition: “Pn said that Pn-1 said that ... p1 said that p0 said that the value was x”There are exponentially many chains pn ... p0.

38

Hybrid Byzantine agreementHybrid Byzantine agreement

Idea: Free bonus reliability with the purchase of Byzantine agreement.

Handles Byzantine faults, plus some more simpler faults

Symmetric fault: process sends same wrong value to everyone.

Nonmalicious fault: process sends a recognizable error value.

Advantages: If processors have these faults, we can tolerate more

faulty processors These faults are more probable than true Byzantine

faults - so this increases reliability

39

Hybrid Byzantine agreementHybrid Byzantine agreement Modify previous algorithm by adding special error

value “E”.Nonmalicious faults send E value (other faults may send

E, also).Majority algorithm first removes E values.

Theorem: Algorithm reaches agreement if

n > 2a + 2s + b + ra = Byzantine, s = symmetric, b = nonmalicious, r =

number of rounds (excluding first transmission).Previous case: a=1, s=0, b=0, r=1, so n > 3With 6 processors, can deal with 1 Byzantine + 2

nonmalicious faults.or 1 Byzantine and 1 symmetric ... but just 1 Byzantine in previous algorithm

40

VariationsVariations

Synchronous communication is difficultCompromise between synchronous and

asynchronous: real-time constraints.

“Authentication” - agreement can be made less costly by using digital signatures transmitter digitally signs messagesprocesses can’t lie about who said what.can handle any number of faults (in synchronous

model).

May assume different network connectivitySome links in network missing

41

SummarySummary

Fault tolerance is tricky. Redundancy does not necessarily buy reliability.

Byzantine models can account for unforeseen fault types.

Byzantine agreement is impossible in some models.

There exist practical algorithms for Byzantine agreement if synchronous communication is available.

There are deep theoretical results in this area.

Documents

Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”