SADIYA FARHEEN

8/6/2019 SADIYA FARHEEN

1/25

Session :

Feb-Jun 2011

FAULT TOLERANCE & FAULTFAULT TOLERANCE & FAULTTOLERANCE ARCHITECTURESTOLERANCE ARCHITECTURES

In Critical Systems DevelopmentIn Critical Systems Development

Under the guidance of

Mr. Manjunath C.R.

Asst. Prof., SBMJCE

By,Sadiya Farheen

10MT6ECS10

SBMJCE, Jain University


2/25

FAULT TOLERANCE

In critical situations, software systems must be

fault tolerant.

Fault tolerance is required where there are

high availability requirements or where systemfailure costs are very high.

Fault tolerance means that the system can

continue in operation in spite of software

failure.

2

Session :

Feb-Jun 2011


3/25

FAULT TOLERANCE

ACTIONS

Fault detection

Damage assessment

Fault recovery

Fault repair

3

Session :

Feb-Jun 2011


4/25

FAULT DETECTION

The first stage of fault tolerance is to detect that a fault (an

erroneous system state) has occurred or will occur.

Ex. Insulin pump software:

4

Session :

Feb-Jun 2011

/ / The d ose o f insul in to be de l ivered m ust always be greater/ / than zero and less that some d ef ined m axim um sing le dose

insul in_do se >= 0 & insul in_dose < = insul in_rese rvoir_con tents

// The total am ount of insul in del ivered in a day m ust be less/ / than or equal to a def ined d aily maximum dose

cum ulative_dose


5/25

Types of fault detection

Preventative fault detection

- The fault detection mechanism is initiated

before the state change is committed.

Retrospective fault detection

- The fault detection mechanism is initiated after

the system state has been changed.

5

Session :

Feb-Jun 2011


6/25

Implementation of

preventative fault detection

Session :

Feb-Jun 2011

6

class Posit iveEvenInteger {

int va l = 0 ;

Pos itive Even Integer ( int n ) t hrow s Num ericExce pt ion

{

if (n < 0 | n%2 = = 1)

throw new N ume ricExcept ion () ;

else

val = n ;

} / / P ositiveEve nI ntege r


7/25

Session :

Feb-Jun 2011

7

p ub lic v o id a ss ig n ( in t n ) th ro s u e ri c x ce ptio n

{if (n < 0 | n 2 = = 1 )

th ro ne u e ric xception ();else

val = n ;

} // as sig n

int toIn teg er (){

return va l ;

} //to Integ er

boolean e qual s ( os itive ven In teger n ){

return (val == n .val) ;

} // eq uals

} // os it ive ve n


8/25

Damage Assessment

Analyse system state to judge the extent of corruption

caused by a system failure.

The assessment must check what parts of the state

space have been affected by the failure.

Generally based on validity functions that can be

applied to the state elements to assess if their value is

within an allowed range.

Session :

Feb-Jun 2011

8


9/25

Session :

Feb-Jun 2011

9

c lass R obustArray {

// C he cks that al l the objec ts in an a rray of ob jects

/ / conform to som e def ined constraint

boo lean [] check State ;C hec kableO bject [ ] theR obu stArray ;

R obu stArray (Che ckableO bject [ ] theArray)

{checkS tate = new boolean [ theA rray.length] ;theRobus tArray = theArray ;

} //Rob ustArray

Interface CheckableObject {

public boolean check();}


10/25

Session :

Feb-Jun 2011

10

public vo id assess Da mag e ( ) throw s ArrayD ama gedEx cept ion

{

boo lean h asBeenD ama ged = fa lse ;

for ( int i= 0 ; i


11/25

Damage assessment

techniques

Checksums

Pointers

Watch dog timers

Session :

Feb-Jun 2011

11


12/25

Fault recovery and repair

Forward recovery

- Apply repairs to a corrupted system state.

Backward recovery

- Restore the system state to a known safe state.

Forward recovery is usually application specific

- domain knowledge is required to compute

possible state corrections.

Backward error recovery is simpler. Details of a

safe state are maintained and this replaces the

corrupted system state.

Session :

Feb-Jun 2011

12


13/25

Forward recovery

Corruption of data coding

- Error coding techniques which add redundancy to coded

data can be used for repairing data corrupted during

transmission.

Redundant pointers- When redundant pointers are included in data structures

(e.g. two-way lists), a corrupted list or filestore may be

rebuilt if a sufficient number of pointers are uncorrupted

- Often used for database and file system repair.

Session :

Feb-Jun 2011

13


14/25

Backward recovery

Transactions are a frequently used method of

backward recovery. Changes are not applied until

computation is complete. If an error occurs, the

system is left in the state preceding the transaction.

Periodic checkpoints allow system to 'roll-back' to a

correct state.

Session :

Feb-Jun 2011

14


15/25

Safe sort procedure

A sort operation monitors its own execution and

assesses if the sort has been correctly executed.

It maintains a copy of its input so that if an error

occurs, the input is not corrupted.

Based on identifying and handling exceptions.

Possible in this case as the condition for avalid sort is

known. However, in many cases it is difficult to write

validity checks.

Session :

Feb-Jun 2011

15


16/25

Session :

Feb-Jun 2011

16

c la ss a fe o rt {

s tat ic v o id sort ( int [] in tarra y, in t order ) thro s o rt rror

{

int [ ] copy = ne in t [ int arra y.leng th];

/ / co py t he inpu t ar ray

for ( int i = 0; i < inta rra y.leng th ; i++)

co py [i ] = i nt arra y [i ] ;try {

ort.bub bleso rt (in tarra y, intarra y.leng th, o rder) ;


17/25

Session :

Feb-Jun 2011

17

i f (order == o rt.asce nding)

for (int i = 0; i i ntarra y [i+1])

th ro ne or t rro r () ;

elsefor (int i = 0; i intarray [i])

th ro ne or t rro r () ;

} // try block

catc h ( o rt rr o r e )

{

for (int i = 0; i < inta rra y.leng th ; i++ )intarray [i] = cop y [i] ;

th ro ne or t rro r (" rra y no t s orted ") ;

} //catch

} // sor t

} // a fe o rt


18/25

Fault tolerant architecture

Defensive programming cannot cope with faults that

involve interactions between the hardware and the

software.

Where systems have high availability requirements, a

specific architecture designed to support fault

tolerance may be required.

This must tolerate both hardware and software failure.

Session :

Feb-Jun 2011

18


19/25

Hardware fault tolerance

Triple Modular Redundancy(TMR) to cope with hardware failure

Session :

Feb-Jun 2011

19


20/25

Software analogies to TMR

N-version programming- The same specification is implemented in a number of

different versions by different teams. All versions computesimultaneously and the majority output is selected using avoting system.

- This is the most commonly used approach e.g. in manymodels of the Airbus commercial aircraft.

Recovery blocks- A number ofexplicitly different versions of the same

specification are written and executed in sequence.

- An acceptance test is used to select the output to betransmitted.

20

Session :

Feb-Jun 2011


21/25

N-version programming

21

Session :

Feb-Jun 2011


22/25

Recovery blocks

Session :

Feb-Jun 2011

22


23/25

Key points

Exceptions are used to support error management in

dependable systems.

The four aspects of program fault tolerance are failure

detection, damage assessment, fault recovery and

fault repair.

N-version programming and recovery blocks are

alternative approaches to fault-tolerant architectures.

Session :

Feb-Jun 2011

23


24/25

QUERIES??

24


25/25

THANK YOU FOLKS!!!!

25

Documents

SADIYA FARHEEN