R R R Fault Tolerant Computing. R R R Acknowledgements The following lectures are based on materials from the following sources; –S. Kulkarni –J. Rushby

R

R

R

Fault Tolerant Computing

R

R

RAcknowledgements

• The following lectures are based on materials from the following sources;– S. Kulkarni– J. Rushby– J. Knight

R

R

RObjectives

• Exposure to area of Critical Systems

• What it means to have a fault-tolerant system

• Specification techniques for representing critical properties

• How to Design Fault tolerance into a system

R

R

R

Reliability and Recovery

• Reliability – Probability that a system will not fail at time t if it

was operating properly at time 0.

• Recovery – Process of restoring consistency after a failure

R

R

RDependability

• Dependability:– How much one may rely on the quality of

services delivered– Quality of service depends on:

• Correctness• Continuity of service

R

R

RTerms

• Failure: malfunction • Fault: condition that might lead to failure • Error: an incorrect response indicates a

fault is present • Faults may be:

o permanent o intermittent o transient

R

R

RTerms (cont’d)

• Graceful Degradation • system is operational, but degraded, after faults

• Fail-safe • system execution is safe after the fault

• Stabilizing • system recovers to a consistent state after the

fault

• Masking • the user of the system does not see any

unintended behavior due to faults

R

R

RTerms (cont’d)

• Mean Time to Failure (MTTF) – expected value of system failure time

• Mean Time to Repair (MTTR) – expected value of system repair time

• Mean Time Between Failure – expected time between successive failures MTBF =

MTTF + MTTR

• Fault Tolerance – ability to continue operation after occurrence of faults

R

R

RDesign Decisions

• Fault detection • Fault confinement • Fault diagnosis • Repair and/or reconfigure • Redundancy

– Hardware: extra hardware – Information: redundancy bits – Software: diagnosis software, extra software – Temporal: re-execute software to recover from

intermittent faults

R

R

RSafety vs Reliability

• Reliability: – concerns occurrence of failures– System failures defined in terms of system

services

• Safety: concerns occurrence of accidents– Unplanned events that result in death, inury,

illness, damage, loss of property or evironmental harm

– Defined in terms of external consequences

R

R

RTypes of Faults

• Omission failure – server omits to respond to an input (fail-silent failure)

• Timing failure – response is functionally correct, but untimely - can

be early timing failure or late timing failure – (performance failure)

• Response failure – incorrect response – if output value incorrect (value failure) – state transition incorrect (state transition failure)

R

R

RTypes of Faults (cont’d)

• Crash failure – if after a first omission, a server omits to produce output until it

restarts

• Amnesia crash – server restarts in a predefined initial state that does not depend

on the inputs seen before crash

• Partial amnesia crash – some part of the state is the same before the crash; rest is in

predefined initial state

• Pause crash – server restarts in the state it had before the crash

• Halting crash – crashed server never restarts

R

R

RExamples

• OS crashed followed by reboots in initial state

• Database server crash followed by recovery of a database state that reflects all transactions before the crash

• Communication server occasionally loses messages but does not delay messages (omission failure)

• Excessive message transmission or message processing delay (communication performance failure)

• Alteration of a message due to random noise during transmission (response failure)

R

R

RHierarchical Failure Masking

• A failure of a certain type at a lower level can propagate as a different kind of failure at a higher level abstraction.

• Value Error at the physical layer (e.g., 2 bits corrupted) propagates as omission error at data link layer

R

R

RGroup Failure Masking

• To ensure a service remains available to clients despite server failure,

– one can implement a group of redundant, physically independent servers.

• The group masks the failure of a member.

• Hierarchical masking requires:

– users to implement resource failure-masking attempts as exception handling code.

• In group masking,

– individual members failures are entirely hidden from users by group management mechanisms.

R

R

RGroup Failure Masking (cont’d)

• Group output is a function of outputs of individual group members. – fastest member – distinguished member – result of majority vote

• A server able to mask any k concurrent member failures will be termed k-fault tolerant– e.g., a primary/standby group of k servers with

members ranked as primary, 1st backup, 2nd backup, ..., can mask k-1 failures.

R

R

RSome Formalism

Programs• A Program consists of:

– a finite set of variables– a finite set of actions

– where • guard is a boolean expression over program variables, and • statement updates program variables

• Modifications – guards may contain receive from channels – statements may contain sends/receive

guard statement

R

R

RComputation

• A program computation is a ``fair'' sequence of steps, where in each step an action whose guard is true has its statement executed – In one step, multiple guards may be true. – If guard of some action is true continuously, then

that action would eventually be chosen for execution.

Notes • A program computation is a sequence of states

R

R

RSpecification

• A specification is a set of sequences of states.

• What does it mean for a program, p to satisfy a specification sp from a set of states S?

– every computation of p that starts from a state in S is in sp .

R

R

RExamples of specifications

• Let S be a predicate. – invariant :

Invariant(S) = {seq: S is true in each state of seq}A sequence seq is in invariant(S) iff S is true in each

state in seq.

– Closure• Closed(S) =

– {seq: Ai: I >= 0:• ‘ S is true in the ith state of seq’ => • ‘S is true in the (I+1)th state of seq’ }

• If S ever becomes true, it continues to be true.

R

R

RExamples of specifications (cont’d)

Let R and S be predicates.• leads-to: • R leads-to S =

{seq: (Ai: i>= 0:‘R is true in ith state of seq’

=>(Ek: k >=i :

‘S is true in kth state of seq’)

)

}

R

R

RExamples of specifications (cont’d)

• Mutual Exclusion– invariant( (j <> k) => ~(cs.j /\ cs.k) )– (Aj :: (req.j leads-to cs.j))

• Leader Election– invariant ( ( j<>k) => ~(leader.j s /\ leader.k) )– true leads-to (Ej:: leader.j)

• Load Balancing– true leads-to

• (Aj,k:: |load.j - load.k| =< bound)

R

R

RSafety and Liveness

• Safety specification – A sequence ``does nothing bad'' – No sequence has a bad prefix

• Let sp be a specification.

– sp is a safety specification– iff

(A s:: s ~element_of sp

=>

(E a: a is a prefix of s: (Ab:: ab ~element_of sp)))

R

R

RLiveness Specification

• Liveness specification– A sequence “does something good”– Every finite prefix has a good extension

• Let sp be a specification– sp is a liveness specification

• iff– (A a:: (E b:: ab element_of sp))

R

R

RFaults

• A fault is an action that can change the program state

• All faults– (be they crash, failstop, omission,

corruption, timing, Byzantine, intruders, or ...)

• can be thus viewed as perturbations on the system

R

R

RFaults (cont’d)

• A program computation in the presence of faults is a sequence of steps where – in each step either program action

executes or fault action executes

– the program actions are fairly executed

– the fault occurrences are finite

R

R

RRepresentation of Faults

• Communication faults

– Let c denote the sequence of messages on a channel.

– Let m1 and m2 be messages, and let seqm be a sequence of messages.

• Message Loss c = < seqm, m1> => c = < seqm>

• Message Duplication c = < seqm ,m1> => c = < seqm,m1,m1>

• Message Reorder c = < seqm,m1,m2> => c = < seqm,m2,m1>

R

R

RRepresentation of Faults (cont’d)

• Amnesia/Transient faults. • Let c denote all the variables of a process.

– True => c=??

R

R

R Representation of Permanent Faults

• Fail-stop fault : – Upon fail-stop, a process does nothing;

– it does not execute any action and

– it does not send any messages.

• Introduce an auxiliary variable up.j at process j

• Add up.j to the guard of each action of j

– If processes can detect failure of other processes, then they can do so using variable up.

R

R

RRepresentation of Permanent Faults

• Byzantine Faults:– Introduce an auxiliary variable b.j at

process j

– Add these actions as faults ~b.j => b.j = true

b.j => state.j=??

R

R

R•Goal of Fault-tolerance Design

• Starting from some initial states, S, – If the program executes alone then the original specification,

sp, is satisfied

– If the program executes in the presence of faults then the fault-tolerant specification, sp', is satisfied.

• The fault-tolerance specification depends upon the type of the desired fault-tolerance, e.g., – for masking sp' = sp

– for fail-safe sp' = `safety specification of sp'

R

R

RRepresentation of Permanent Faults

• Fault-tolerant systems are rarely designed from scratch!!!

• One needs to modify a fault-intolerant system to add fault-tolerance

– Need for reuse the fault-intolerant program.

• Fault-tolerant systems need to be modified to deal with new faults.

– Need for incremental design

• Need to perform several activities while developing fault-tolerant systems.

– manual or automated design, testing, verification, synthesis, ...

– desirable to have a unified framework that allows to perform these activities.

R

R

ROverall Design

ManualDesign

Testing

AutomatedSynthesis

TheoremProving

ModelChecking

Refinement

R

R

ROverall Design (cont’d)

• Should separate concerns of functionality and fault-tolerance. – Should use components that are responsible

for fault-tolerance alone.

• Should provide structural continuity while performing these tasks. – Should be able to use the same components

while performing the above tasks.

Documents

R R R Fault Tolerant Computing. R R R Acknowledgements The following lectures are based on materials from the following sources; –S. Kulkarni –J. Rushby