SADIYA FARHEEN

Embed Size (px)

Citation preview

  • 8/6/2019 SADIYA FARHEEN

    1/25

    Session :

    Feb-Jun 2011

    FAULT TOLERANCE & FAULTFAULT TOLERANCE & FAULTTOLERANCE ARCHITECTURESTOLERANCE ARCHITECTURES

    In Critical Systems DevelopmentIn Critical Systems Development

    Under the guidance of

    Mr. Manjunath C.R.

    Asst. Prof., SBMJCE

    By,Sadiya Farheen

    10MT6ECS10

    SBMJCE, Jain University

  • 8/6/2019 SADIYA FARHEEN

    2/25

    FAULT TOLERANCE

    In critical situations, software systems must be

    fault tolerant.

    Fault tolerance is required where there are

    high availability requirements or where systemfailure costs are very high.

    Fault tolerance means that the system can

    continue in operation in spite of software

    failure.

    2

    Session :

    Feb-Jun 2011

  • 8/6/2019 SADIYA FARHEEN

    3/25

    FAULT TOLERANCE

    ACTIONS

    Fault detection

    Damage assessment

    Fault recovery

    Fault repair

    3

    Session :

    Feb-Jun 2011

  • 8/6/2019 SADIYA FARHEEN

    4/25

    FAULT DETECTION

    The first stage of fault tolerance is to detect that a fault (an

    erroneous system state) has occurred or will occur.

    Ex. Insulin pump software:

    4

    Session :

    Feb-Jun 2011

    / / The d ose o f insul in to be de l ivered m ust always be greater/ / than zero and less that some d ef ined m axim um sing le dose

    insul in_do se >= 0 & insul in_dose < = insul in_rese rvoir_con tents

    // The total am ount of insul in del ivered in a day m ust be less/ / than or equal to a def ined d aily maximum dose

    cum ulative_dose

  • 8/6/2019 SADIYA FARHEEN

    5/25

    Types of fault detection

    Preventative fault detection

    - The fault detection mechanism is initiated

    before the state change is committed.

    Retrospective fault detection

    - The fault detection mechanism is initiated after

    the system state has been changed.

    5

    Session :

    Feb-Jun 2011

  • 8/6/2019 SADIYA FARHEEN

    6/25

    Implementation of

    preventative fault detection

    Session :

    Feb-Jun 2011

    6

    class Posit iveEvenInteger {

    int va l = 0 ;

    Pos itive Even Integer ( int n ) t hrow s Num ericExce pt ion

    {

    if (n < 0 | n%2 = = 1)

    throw new N ume ricExcept ion () ;

    else

    val = n ;

    } / / P ositiveEve nI ntege r

  • 8/6/2019 SADIYA FARHEEN

    7/25

    Session :

    Feb-Jun 2011

    7

    p ub lic v o id a ss ig n ( in t n ) th ro s u e ri c x ce ptio n

    {if (n < 0 | n 2 = = 1 )

    th ro ne u e ric xception ();else

    val = n ;

    } // as sig n

    int toIn teg er (){

    return va l ;

    } //to Integ er

    boolean e qual s ( os itive ven In teger n ){

    return (val == n .val) ;

    } // eq uals

    } // os it ive ve n

  • 8/6/2019 SADIYA FARHEEN

    8/25

    Damage Assessment

    Analyse system state to judge the extent of corruption

    caused by a system failure.

    The assessment must check what parts of the state

    space have been affected by the failure.

    Generally based on validity functions that can be

    applied to the state elements to assess if their value is

    within an allowed range.

    Session :

    Feb-Jun 2011

    8

  • 8/6/2019 SADIYA FARHEEN

    9/25

    Session :

    Feb-Jun 2011

    9

    c lass R obustArray {

    // C he cks that al l the objec ts in an a rray of ob jects

    / / conform to som e def ined constraint

    boo lean [] check State ;C hec kableO bject [ ] theR obu stArray ;

    R obu stArray (Che ckableO bject [ ] theArray)

    {checkS tate = new boolean [ theA rray.length] ;theRobus tArray = theArray ;

    } //Rob ustArray

    Interface CheckableObject {

    public boolean check();}

  • 8/6/2019 SADIYA FARHEEN

    10/25

    Session :

    Feb-Jun 2011

    10

    public vo id assess Da mag e ( ) throw s ArrayD ama gedEx cept ion

    {

    boo lean h asBeenD ama ged = fa lse ;

    for ( int i= 0 ; i

  • 8/6/2019 SADIYA FARHEEN

    11/25

    Damage assessment

    techniques

    Checksums

    Pointers

    Watch dog timers

    Session :

    Feb-Jun 2011

    11

  • 8/6/2019 SADIYA FARHEEN

    12/25

    Fault recovery and repair

    Forward recovery

    - Apply repairs to a corrupted system state.

    Backward recovery

    - Restore the system state to a known safe state.

    Forward recovery is usually application specific

    - domain knowledge is required to compute

    possible state corrections.

    Backward error recovery is simpler. Details of a

    safe state are maintained and this replaces the

    corrupted system state.

    Session :

    Feb-Jun 2011

    12

  • 8/6/2019 SADIYA FARHEEN

    13/25

    Forward recovery

    Corruption of data coding

    - Error coding techniques which add redundancy to coded

    data can be used for repairing data corrupted during

    transmission.

    Redundant pointers- When redundant pointers are included in data structures

    (e.g. two-way lists), a corrupted list or filestore may be

    rebuilt if a sufficient number of pointers are uncorrupted

    - Often used for database and file system repair.

    Session :

    Feb-Jun 2011

    13

  • 8/6/2019 SADIYA FARHEEN

    14/25

    Backward recovery

    Transactions are a frequently used method of

    backward recovery. Changes are not applied until

    computation is complete. If an error occurs, the

    system is left in the state preceding the transaction.

    Periodic checkpoints allow system to 'roll-back' to a

    correct state.

    Session :

    Feb-Jun 2011

    14

  • 8/6/2019 SADIYA FARHEEN

    15/25

    Safe sort procedure

    A sort operation monitors its own execution and

    assesses if the sort has been correctly executed.

    It maintains a copy of its input so that if an error

    occurs, the input is not corrupted.

    Based on identifying and handling exceptions.

    Possible in this case as the condition for avalid sort is

    known. However, in many cases it is difficult to write

    validity checks.

    Session :

    Feb-Jun 2011

    15

  • 8/6/2019 SADIYA FARHEEN

    16/25

    Session :

    Feb-Jun 2011

    16

    c la ss a fe o rt {

    s tat ic v o id sort ( int [] in tarra y, in t order ) thro s o rt rror

    {

    int [ ] copy = ne in t [ int arra y.leng th];

    / / co py t he inpu t ar ray

    for ( int i = 0; i < inta rra y.leng th ; i++)

    co py [i ] = i nt arra y [i ] ;try {

    ort.bub bleso rt (in tarra y, intarra y.leng th, o rder) ;

  • 8/6/2019 SADIYA FARHEEN

    17/25

    Session :

    Feb-Jun 2011

    17

    i f (order == o rt.asce nding)

    for (int i = 0; i i ntarra y [i+1])

    th ro ne or t rro r () ;

    elsefor (int i = 0; i intarray [i])

    th ro ne or t rro r () ;

    } // try block

    catc h ( o rt rr o r e )

    {

    for (int i = 0; i < inta rra y.leng th ; i++ )intarray [i] = cop y [i] ;

    th ro ne or t rro r (" rra y no t s orted ") ;

    } //catch

    } // sor t

    } // a fe o rt

  • 8/6/2019 SADIYA FARHEEN

    18/25

    Fault tolerant architecture

    Defensive programming cannot cope with faults that

    involve interactions between the hardware and the

    software.

    Where systems have high availability requirements, a

    specific architecture designed to support fault

    tolerance may be required.

    This must tolerate both hardware and software failure.

    Session :

    Feb-Jun 2011

    18

  • 8/6/2019 SADIYA FARHEEN

    19/25

    Hardware fault tolerance

    Triple Modular Redundancy(TMR) to cope with hardware failure

    Session :

    Feb-Jun 2011

    19

  • 8/6/2019 SADIYA FARHEEN

    20/25

    Software analogies to TMR

    N-version programming- The same specification is implemented in a number of

    different versions by different teams. All versions computesimultaneously and the majority output is selected using avoting system.

    - This is the most commonly used approach e.g. in manymodels of the Airbus commercial aircraft.

    Recovery blocks- A number ofexplicitly different versions of the same

    specification are written and executed in sequence.

    - An acceptance test is used to select the output to betransmitted.

    20

    Session :

    Feb-Jun 2011

  • 8/6/2019 SADIYA FARHEEN

    21/25

    N-version programming

    21

    Session :

    Feb-Jun 2011

  • 8/6/2019 SADIYA FARHEEN

    22/25

    Recovery blocks

    Session :

    Feb-Jun 2011

    22

  • 8/6/2019 SADIYA FARHEEN

    23/25

    Key points

    Exceptions are used to support error management in

    dependable systems.

    The four aspects of program fault tolerance are failure

    detection, damage assessment, fault recovery and

    fault repair.

    N-version programming and recovery blocks are

    alternative approaches to fault-tolerant architectures.

    Session :

    Feb-Jun 2011

    23

  • 8/6/2019 SADIYA FARHEEN

    24/25

    QUERIES??

    24

  • 8/6/2019 SADIYA FARHEEN

    25/25

    THANK YOU FOLKS!!!!

    25