View
214
Download
0
Tags:
Embed Size (px)
Citation preview
R
R
R
Fault Tolerant Computing
R
R
RAcknowledgements
• The following lectures are based on materials from the following sources;– S. Kulkarni– J. Rushby– J. Knight
R
R
RObjectives
• Exposure to area of Critical Systems
• What it means to have a fault-tolerant system
• Specification techniques for representing critical properties
• How to Design Fault tolerance into a system
R
R
R
Reliability and Recovery
• Reliability – Probability that a system will not fail at time t if it
was operating properly at time 0.
• Recovery – Process of restoring consistency after a failure
R
R
RDependability
• Dependability:– How much one may rely on the quality of
services delivered– Quality of service depends on:
• Correctness• Continuity of service
R
R
RTerms
• Failure: malfunction • Fault: condition that might lead to failure • Error: an incorrect response indicates a
fault is present • Faults may be:
o permanent o intermittent o transient
R
R
RTerms (cont’d)
• Graceful Degradation • system is operational, but degraded, after faults
• Fail-safe • system execution is safe after the fault
• Stabilizing • system recovers to a consistent state after the
fault
• Masking • the user of the system does not see any
unintended behavior due to faults
R
R
RTerms (cont’d)
• Mean Time to Failure (MTTF) – expected value of system failure time
• Mean Time to Repair (MTTR) – expected value of system repair time
• Mean Time Between Failure – expected time between successive failures MTBF =
MTTF + MTTR
• Fault Tolerance – ability to continue operation after occurrence of faults
R
R
RDesign Decisions
• Fault detection • Fault confinement • Fault diagnosis • Repair and/or reconfigure • Redundancy
– Hardware: extra hardware – Information: redundancy bits – Software: diagnosis software, extra software – Temporal: re-execute software to recover from
intermittent faults
R
R
RSafety vs Reliability
• Reliability: – concerns occurrence of failures– System failures defined in terms of system
services
• Safety: concerns occurrence of accidents– Unplanned events that result in death, inury,
illness, damage, loss of property or evironmental harm
– Defined in terms of external consequences
R
R
RTypes of Faults
• Omission failure – server omits to respond to an input (fail-silent failure)
• Timing failure – response is functionally correct, but untimely - can
be early timing failure or late timing failure – (performance failure)
• Response failure – incorrect response – if output value incorrect (value failure) – state transition incorrect (state transition failure)
R
R
RTypes of Faults (cont’d)
• Crash failure – if after a first omission, a server omits to produce output until it
restarts
• Amnesia crash – server restarts in a predefined initial state that does not depend
on the inputs seen before crash
• Partial amnesia crash – some part of the state is the same before the crash; rest is in
predefined initial state
• Pause crash – server restarts in the state it had before the crash
• Halting crash – crashed server never restarts
R
R
RExamples
• OS crashed followed by reboots in initial state
• Database server crash followed by recovery of a database state that reflects all transactions before the crash
• Communication server occasionally loses messages but does not delay messages (omission failure)
• Excessive message transmission or message processing delay (communication performance failure)
• Alteration of a message due to random noise during transmission (response failure)
R
R
RHierarchical Failure Masking
• A failure of a certain type at a lower level can propagate as a different kind of failure at a higher level abstraction.
• Value Error at the physical layer (e.g., 2 bits corrupted) propagates as omission error at data link layer
R
R
RGroup Failure Masking
• To ensure a service remains available to clients despite server failure,
– one can implement a group of redundant, physically independent servers.
• The group masks the failure of a member.
• Hierarchical masking requires:
– users to implement resource failure-masking attempts as exception handling code.
• In group masking,
– individual members failures are entirely hidden from users by group management mechanisms.
R
R
RGroup Failure Masking (cont’d)
• Group output is a function of outputs of individual group members. – fastest member – distinguished member – result of majority vote
• A server able to mask any k concurrent member failures will be termed k-fault tolerant– e.g., a primary/standby group of k servers with
members ranked as primary, 1st backup, 2nd backup, ..., can mask k-1 failures.
R
R
RSome Formalism
Programs• A Program consists of:
– a finite set of variables– a finite set of actions
– where • guard is a boolean expression over program variables, and • statement updates program variables
• Modifications – guards may contain receive from channels – statements may contain sends/receive
guard statement
R
R
RComputation
• A program computation is a ``fair'' sequence of steps, where in each step an action whose guard is true has its statement executed – In one step, multiple guards may be true. – If guard of some action is true continuously, then
that action would eventually be chosen for execution.
Notes • A program computation is a sequence of states
R
R
RSpecification
• A specification is a set of sequences of states.
• What does it mean for a program, p to satisfy a specification sp from a set of states S?
– every computation of p that starts from a state in S is in sp .
R
R
RExamples of specifications
• Let S be a predicate. – invariant :
Invariant(S) = {seq: S is true in each state of seq}A sequence seq is in invariant(S) iff S is true in each
state in seq.
– Closure• Closed(S) =
– {seq: Ai: I >= 0:• ‘ S is true in the ith state of seq’ => • ‘S is true in the (I+1)th state of seq’ }
• If S ever becomes true, it continues to be true.
R
R
RExamples of specifications (cont’d)
Let R and S be predicates.• leads-to: • R leads-to S =
{seq: (Ai: i>= 0:‘R is true in ith state of seq’
=>(Ek: k >=i :
‘S is true in kth state of seq’)
)
}
R
R
RExamples of specifications (cont’d)
• Mutual Exclusion– invariant( (j <> k) => ~(cs.j /\ cs.k) )– (Aj :: (req.j leads-to cs.j))
• Leader Election– invariant ( ( j<>k) => ~(leader.j s /\ leader.k) )– true leads-to (Ej:: leader.j)
• Load Balancing– true leads-to
• (Aj,k:: |load.j - load.k| =< bound)
R
R
RSafety and Liveness
• Safety specification – A sequence ``does nothing bad'' – No sequence has a bad prefix
• Let sp be a specification.
– sp is a safety specification– iff
(A s:: s ~element_of sp
=>
(E a: a is a prefix of s: (Ab:: ab ~element_of sp)))
R
R
RLiveness Specification
• Liveness specification– A sequence “does something good”– Every finite prefix has a good extension
• Let sp be a specification– sp is a liveness specification
• iff– (A a:: (E b:: ab element_of sp))
R
R
RFaults
• A fault is an action that can change the program state
• All faults– (be they crash, failstop, omission,
corruption, timing, Byzantine, intruders, or ...)
• can be thus viewed as perturbations on the system
R
R
RFaults (cont’d)
• A program computation in the presence of faults is a sequence of steps where – in each step either program action
executes or fault action executes
– the program actions are fairly executed
– the fault occurrences are finite
R
R
RRepresentation of Faults
• Communication faults
– Let c denote the sequence of messages on a channel.
– Let m1 and m2 be messages, and let seqm be a sequence of messages.
• Message Loss c = < seqm, m1> => c = < seqm>
• Message Duplication c = < seqm ,m1> => c = < seqm,m1,m1>
• Message Reorder c = < seqm,m1,m2> => c = < seqm,m2,m1>
R
R
RRepresentation of Faults (cont’d)
• Amnesia/Transient faults. • Let c denote all the variables of a process.
– True => c=??
R
R
R Representation of Permanent Faults
• Fail-stop fault : – Upon fail-stop, a process does nothing;
– it does not execute any action and
– it does not send any messages.
• Introduce an auxiliary variable up.j at process j
• Add up.j to the guard of each action of j
– If processes can detect failure of other processes, then they can do so using variable up.
R
R
RRepresentation of Permanent Faults
• Byzantine Faults:– Introduce an auxiliary variable b.j at
process j
– Add these actions as faults ~b.j => b.j = true
b.j => state.j=??
R
R
R•Goal of Fault-tolerance Design
• Starting from some initial states, S, – If the program executes alone then the original specification,
sp, is satisfied
– If the program executes in the presence of faults then the fault-tolerant specification, sp', is satisfied.
• The fault-tolerance specification depends upon the type of the desired fault-tolerance, e.g., – for masking sp' = sp
– for fail-safe sp' = `safety specification of sp'
R
R
RRepresentation of Permanent Faults
• Fault-tolerant systems are rarely designed from scratch!!!
• One needs to modify a fault-intolerant system to add fault-tolerance
– Need for reuse the fault-intolerant program.
• Fault-tolerant systems need to be modified to deal with new faults.
– Need for incremental design
• Need to perform several activities while developing fault-tolerant systems.
– manual or automated design, testing, verification, synthesis, ...
– desirable to have a unified framework that allows to perform these activities.
R
R
ROverall Design
ManualDesign
Testing
AutomatedSynthesis
TheoremProving
ModelChecking
Refinement
R
R
ROverall Design (cont’d)
• Should separate concerns of functionality and fault-tolerance. – Should use components that are responsible
for fault-tolerance alone.
• Should provide structural continuity while performing these tasks. – Should be able to use the same components
while performing the above tasks.