Upload
leon-golden
View
220
Download
0
Embed Size (px)
Citation preview
Fault ToleranceFault Tolerance
2
Fault tolerance terminologyFault tolerance terminology
“dependability” - extent to which reliance can justifiably be placed on service.General concept
“reliability” - continuity of servicemetric: mean time between failures (MBTF)
“availability” - readiness for usage
“safety” - avoidance of catastrophic effects on environment
“security” - resistance to unauthorized access.
3
Faults, errors, failuresFaults, errors, failures
“fault” - component malfunction
“error” - system state is wrong
“failure” - system departs from specification
fault error failure
4
SystemSystem
System
Environment
componentsfaul
t
failure
5
Coping with faultsCoping with faults
Reduce/eliminate faults in components.
Fault tolerancePrevent faults from becoming failuresusually through redundancy.
6
Types of faults (fault models)Types of faults (fault models)
Fault tolerance algorithms dependent on fault models.
“Crash fault” or “stop fault” - faulty component stops responding. No incorrect state changes in component.
“Timing fault” - response is too early or late.
“Byzantine fault” - arbitrary behavior. Can be considered adversarial (imagine worst case).
7
The agreement problemThe agreement problem
Processors may fail
… so, use multiple processors
… but then, processors may disagree, causing failures.
Need a principled approach to distributed agreement
8
Example: AFTI 16 (from J. Rushby)Example: AFTI 16 (from J. Rushby)
“Advanced Fighter Technology Integration F16
Triple-redundant digital flight-control system (DFCS) with analog backup
DFCS design was “asynchronous”processors ran independently
sample sensor, evaluate control law, send command to actuator
actuator averages or selects from commandsGeneral Dynamics felt synchronization would
introduce a single point of failure.
9
AFTI 16 problemsAFTI 16 problems
Processors can get widely varying sensor readings because of timing differences
Reconfiguration can cause sudden changes in control (“thumps”).Need to allow wide range of “plausible values”
before declaring a processor “bad”Bad sensor reading drags average downSensor finally crosses threshhold and is
called “bad”average suddenly snaps back when sensor is
excluded.
10
AFTI 16 problems (cont)AFTI 16 problems (cont)
Processor states can diverge rapidlyespecially when different processors go into
different control modes.
Design complexity70% of application code was for redundancy
managementControl laws had to be modified to ramp
changes in and out smoothly
11
AFTI 16 flight test, Flight 36AFTI 16 flight test, Flight 36
“Departure” from control laws for 3 seconds
acceleration exceeded -4g, then +7g
Angle of attack went to -10 degrees, then +20 degrees
Aircraft rolled 360 degreees
Cause: side air probe cut out at high angle of attack
Analysis showed this would cause complete failure of DFCS for several areas of flight envelope
12
AFTI 16 flight 44AFTI 16 flight 44
Each channel declared the others failedasynchronous operation, timing skew, sensor
noise
analog backup not selectedsimultaneous failure of two channels not
anticipated
Aircraft flown home on a single digital channel (not designed for this)
There were no hardware failures.
13
AFTI 16 Analysis (NASA)AFTI 16 Analysis (NASA)
Nearly all failure indications were design oversights related to asynchronous operation
Failures due to lack of understanding of interactions amongAir data systemredundancy management softwareflight control laws (decision points, thumps,
ramp-in/out)
Moral of the story: Reliability through redundancy is a lot harder than it looks.
14
Distributed consensusDistributed consensus
Goal: multiple processors agree on something in the presence of various kinds of faults and errors
Intellectually difficultAlgorithms are trickyProofs are subtleSensitive to assumptions
Synchronous vs. asynchronous Communication mechanism Fault models
Many papers written
15
Synchronous vs. asynchronousSynchronous vs. asynchronous
Synchronous: Processors run in lock-stepHard to implement - model may be unrealistic
Requires clock synchronization.Consensus is easier
Asynchronous: Processors run at arbitrary speedEasier to implement - model is conservativeIn most models, consensus problem is
provably unsolvable.
16
Synchronous vs. asynchronousSynchronous vs. asynchronous
Semi-synchronousBounds on how far out-of-sync processors
can getModel is fairly realisticConsensus is almost as easy as synchronous
17
Fault modelsFault models
Goal: Make claims such as: “the system will continue to function if any single processor stops.”
More conservative fault models:Fault tolerance is harderBut, if successful, stronger claims can be
madeFewer assumptions = simpler FMEA, easier
“certification”
A lot of models have been proposed.
18
Process fault modelsProcess fault models
“Stopping fault” - process stops sending messagesdoes not restartdoes not send wrong messagesliberal (easy) model
“Byzantine fault” - process behaves arbitrarilyName comes from cute “Byzantine generals”
metaphorMay send arbitrary messages, enter arbitrary
statesEquivalent to “evil” behavior, for our purposes
19
Synchronous agreement with stopping faultsSynchronous agreement with stopping faults Multiple processes want to “agree” on a
value
Applicationssensor readings among redundant processorsdecide what time it isdecide which of a group of processors are
broken and should be removed from system.
20
Synchronous agreement - propertiesSynchronous agreement - properties
Each process starts with an initial value, processes end with a decision value.
Agreement: all good processes decide on same values.
Validity: if all processors start with same value, that value is the final decision value.
Termination: All good processes eventually decide.
21
Flood set algorithmFlood set algorithm
Assumption: There is a dedicated link between each pair of processes
No more than f processes can stop
Each process has an initial value v
Each process accumulates a set W of all the values it has ever seen.On each round, every process sends its W set
to every other processEvery process sets W to the union of the old
value and all the new values coming in from others.
22
Flood setFlood set
After f rounds, every process looks at W. If W has only one value, choose that value.Else, choose 0 (a predetermined default).
23
Flood set correctnessFlood set correctness
In f+1 rounds, there must be at least one round in which no processes stopAt most f processes can stop, and processes
cannot stop more than once.
If no process stops in round r, W will be the same in all good processes in subsequent rounds.All good processes successfully send all values
in W to all other good processes, so all processes will have same W after the round.
After this, nothing can get added to any W sets, so it doesn’t matter whether more processes stop.
24
Flood set correctnessFlood set correctness
So, after f+1 rounds, all non-stopped processes have same W setsIf W has only one value, all processes pick this
value.Else all processes pick 1.
25
Flood set exampleFlood set example
3 processes, 1 fault, default value = 0
P1 P2 P3
V0 A A B
W in round 0 {A} {A} {B}
final
W in round 1 {A,B} {A} -
something
something
something
P3Dies after
sending W to
but not P1 P2
W in round 2 {A,B} {A,B} -
Www
s
W sets for
,
are same
P1P2
-00
Blank here
blank here
blank here
Choose default
because |W|>1
26
Flood set efficiencyFlood set efficiency
O((f + 1) n2) messages
f+1 rounds
n processes send n messages per round
O((f+1)n3) values are sent (each message
may have a set of up to n values)
27
Optimized flood setOptimized flood set
Note: If W has more than one element, process doesn’t need to know what is in it.
Idea: Every process sends only first two distinct values.Every process sends its initial value on first round If process receives a different value, it sends it out on
next round
Correctness proof: run Flood and OptFlood in parallelsame initial values, stopping patternW sets have more than one value iff OptFlood process
gets two values.
28
OptFlood efficiencyOptFlood efficiency
2 n2 messages
n processes send at most two messages to n other processes.
O(n2) values are sent
29
Byzantine agreementByzantine agreement
Goal: non-faulty processes should agree on a value.E.g., message receivede.g., sensor value
Faults may cause arbitrary behaviorarbitrary values communicateddifferent values communicated to different receivers
Advantage: reduces fault analysis
Disadvantage: hard or impossible to do.
30
Byzantine agreement propertiesByzantine agreement properties
Agreement: All good processes agree on a value
Validity: If source of value was non-faulty, agreed upon value is the same.
31
Asynchronous agreementAsynchronous agreement
Asynchronous model: Message transmission takes arbitrary time.Processes run at arbitrary speeds.
Theorem: There is no algorithm that reaches agreement in an asynchronous model with even one Byzantine failureFine print: Details of conditions, communication
This is one of the most important results about distributed systems.
32
Synchronous agreementSynchronous agreement
Synchronous model: Processes can communicate in a sequence of rounds. All processes complete a round before next round begins.
The agreement problem is solvable in this model.
Theorem: Tolerating k Byzantine faults requires > 3k processes.
So “Triple modular redundancy” can’t handle Byzantine faults.
Practical case: 1 Byzantine fault, 4 processes.
Assumes full connectivity (connections between each pair of processors).
33
Synchronous agreement with one faultSynchronous agreement with one fault
Single transmitter communicates value to all processes.
Round 0: Transmitter sends value to n-1 receivers.Values are sent correctly if transmitter is not faulty.
Round 1: Each receiver sends value to n-2 other receivers. Receivers record all values separately. Intuition: receivers compare notes on what transmitter
told them.
Each receiver choose majority value of all values it received. If no majority, use pre-arranged default value.
34
consensus
Finally, receivers
take majority of all
answers
P2
P1
P3
Rcvr
Xmtr
These are the
round 0 values
P1 P2 P3
Example 1- faulty transmitterExample 1- faulty transmitter
1
1
1
1
1
1
2
2
2
1
1
1
P1 P2 P3Round 0: faulty xmtr sends
varying results to rcvrs.
Round 1: rcvrs
exchange
values (reliably)
1 1 2
35
consensus
There is no majority,
so rcvrs use default
P2
P1
P3
Rcvr
Xmtr
These are the
round 0 values
P1 P2 P3
Example 2- faulty transmitterExample 2- faulty transmitter
1
1
1
2
2
2
3
3
3
0
0
0
P1 P2 P3Round 0: faulty xmtr sends
varying results to rcvrs.
Round 1: rcvrs
exchange
values (reliably)
1 2 3
36
consensus
Majority computes
correct values for
processes 2,3
P2
P1
P3
Rcvr
Xmtr
These are the
round 0 values
P1 P2 P3
Example 3- faulty receiverExample 3- faulty receiver
2
1
3
1
1
1
1
1
1
1
5
1
P1 P2 P3Round 0: faulty xmtr sends
varying results to rcvrs.
Process 1
sends bogus values
1 1 1
Process 1 is
broken, so result
is not required to be
correct
37
General caseGeneral case
Previous algorithm can be generalized to handle more Byzantine faults.
General results: k faults require k+1 (k?) rounds, 3k+1 processors
Number of messages grows exponentially with number of rounds
Intuition: “Pn said that Pn-1 said that ... p1 said that p0 said that the value was x”There are exponentially many chains pn ... p0.
38
Hybrid Byzantine agreementHybrid Byzantine agreement
Idea: Free bonus reliability with the purchase of Byzantine agreement.
Handles Byzantine faults, plus some more simpler faults
Symmetric fault: process sends same wrong value to everyone.
Nonmalicious fault: process sends a recognizable error value.
Advantages: If processors have these faults, we can tolerate more
faulty processors These faults are more probable than true Byzantine
faults - so this increases reliability
39
Hybrid Byzantine agreementHybrid Byzantine agreement Modify previous algorithm by adding special error
value “E”.Nonmalicious faults send E value (other faults may send
E, also).Majority algorithm first removes E values.
Theorem: Algorithm reaches agreement if
n > 2a + 2s + b + ra = Byzantine, s = symmetric, b = nonmalicious, r =
number of rounds (excluding first transmission).Previous case: a=1, s=0, b=0, r=1, so n > 3With 6 processors, can deal with 1 Byzantine + 2
nonmalicious faults.or 1 Byzantine and 1 symmetric ... but just 1 Byzantine in previous algorithm
40
VariationsVariations
Synchronous communication is difficultCompromise between synchronous and
asynchronous: real-time constraints.
“Authentication” - agreement can be made less costly by using digital signatures transmitter digitally signs messagesprocesses can’t lie about who said what.can handle any number of faults (in synchronous
model).
May assume different network connectivitySome links in network missing
41
SummarySummary
Fault tolerance is tricky. Redundancy does not necessarily buy reliability.
Byzantine models can account for unforeseen fault types.
Byzantine agreement is impossible in some models.
There exist practical algorithms for Byzantine agreement if synchronous communication is available.
There are deep theoretical results in this area.