Dugan Comp Sys Fta Tutor

Embed Size (px)

Citation preview

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    1/83

    slide(1)

    RELIABILITY and MAINTAINABILITY Symposium

    Fault Tree Analysisof Computer-Based Systems

    Joanne Bec hta DuganProfessor of Elec tric a l & Computer Eng ineering

    University of Virginia([email protected])

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    2/83

    slide(2)

    RELIABILITY and MAINTAINABILITY Symposium

    Presentation Outline

    I. Introduction to fault trees

    II. Fault tree analysis of an example control system

    III. Fault trees as design aid for software systems

    IV. Adapting the fault tree to analysis of computer-based systems

    V. Dynamic fault trees for modeling sequential behavior

    VI. Modular approach to fault tree analysis

    VII. Sensitivity analysis

    VIII. Summary and Conclusions

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    3/83

    slide(3)

    RELIABILITY and MAINTAINABILITY Symposium

    Introduction to fault tree analysis

    Fault trees provide a good framework for both qualitative andquantitative analysis because they have both a logical (boolean

    algebra) and probabilistic basis.

    What is a fault tree?

    not a tree (in the graph-theoretic sense)

    a graphical representation of a logical function shows logical relationship between an event (failure) and its causes provides a logical framework for expressing combinations of compo-

    nent failures that can lead to system failure

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    4/83

    slide(4)

    RELIABILITY and MAINTAINABILITY Symposium

    Why use fault tree analysis?

    A fault tree model provides a logical framework for analyzing the failurebehavior of a system.

    A fault tree model precisely documents which failure scenarios havebeen considered and which have not.

    Fault tree analysis can be used to support engineering and

    management decisions, trade-off analysis and risk assessment.

    The fault tree model has a well-defined boolean algebraic andprobabilistic basis which relates probability calculations to boolean

    logic functions.

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    5/83

    slide(5)

    RELIABILITY and MAINTAINABILITY Symposium

    Basic Static Fault Tree Constructs

    Basic Events

    Basic Event: corresponds to a basic failure

    Characterized by failure rate or failure probability

    event (usually a component failure) in the system.

    k * name

    Undeveloped Basic Event: A basic event that is notcompletely developed, usually because of unavailable

    information.

    Characterized by failure rate or failure probability

    name

    name

    Replicated Basic Event; represents k statisticallyidentical copies of a component

    Characterized by failure rate or failure probability

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    6/83

    slide(6)

    RELIABILITY and MAINTAINABILITY Symposium

    Static fault tree gates

    m/n

    AND gate - output event occurs

    only if ALL input events occur

    OR gate - output event occurs

    if one or more input events occur

    m/n gate - output event occurs ifm or more of the n inputs occur

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    7/83

    slide(7)

    RELIABILITY and MAINTAINABILITY Symposium

    Example Fault Tree

    Washing Machine Overflows

    fill mode too long

    valve

    stuck open

    timeout

    control

    failed

    full

    sensorfailed

    Structure Function:

    Fail = valve_failed OR

    F = A + BC

    (timer_failed AND sensor_failed)

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    8/83

    slide(8)

    RELIABILITY and MAINTAINABILITY Symposium

    Probabilistic Fault Tree Analysis of Example

    F = A + BC

    Pr[F] = Pr[ A + BC]

    = Pr[A] + Pr[BC] - Pr[ABC]

    Suppose

    Pr[A] = 0.01

    Pr[B] = 0.05

    Pr[C] = 0.075= 0.01 + 0.00375 - 0.0000375

    = 0.0137125(all failures are independent)

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    9/83

    slide(9)

    RELIABILITY and MAINTAINABILITY Symposium

    Fault Tree Analysis - Cutsets

    Most fault tree analysis techniques start with the generation of cutsets

    A cutset is a set of basic events; if all the basic events in a cutset occur,then the top event (system failure) occurs.

    A mincut (minimum cutset) is one that contains no redundant elements.If an element is removed from a mincut, it ceases to be a mincut.

    Washing Machine Overflows

    fill mode too long

    valve

    stuck open

    timeout

    control

    failed

    full

    sensor

    failed

    Cutsets:

    {valve} (single point of failure)

    {timer, sensor}

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    10/83

    slide(10)

    RELIABILITY and MAINTAINABILITY Symposium

    Cutset generation by example

    G2

    G5

    G3

    G1

    G4

    A1 A2 A3 A4

    A5

    G3 {A4,A5}

    {A1,A3}

    {A2,A3}

    {A2,A4}

    {A1,A4}

    G2 {G4,G5}

    {A1,G5}

    {A2,G5}

    G1

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    11/83

    slide(11)

    RELIABILITY and MAINTAINABILITY Symposium

    Probabilistic Analysis using Cutsets

    The probability of system failure is simply the probability that one ormore of the cutsets occur.

    But the cutsets are not disjoint so we cannot sum their individual

    probabilities. We must account for the overlap of the events.

    Prob(failure) = Prob(valve failure) + Prob(both timer and sensor fail)- Prob(all three fail)

    Washing Machine Overflows

    fill mode too long

    valve

    stuck open

    timeout

    control

    failed

    full

    sensor

    failed

    Cutsets:

    {valve} (single point of failure)

    {timer, sensor}

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    12/83

    slide(12)

    RELIABILITY and MAINTAINABILITY Symposium

    Probabilistic Analysis using Inclusion-Exclusion

    Pr(A) + Pr(B) + Pr(C)- Pr(A and B) - Pr(B and C) - Pr(A and C)+ Pr(A and B and C)

    ABCAB

    BCAC

    C

    BA

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    13/83

    slide(13)

    RELIABILITY and MAINTAINABILITY Symposium

    Probabilistic Analysis using Sum-of-Disjoint-Products

    Pr(A) + Pr(not A and B) + Pr (not A and not B and C)

    A AB

    _

    ABC__

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    14/83

    slide(14)

    RELIABILITY and MAINTAINABILITY Symposium

    Probabilistic Ana lysis

    using Binary Decision Diagrams

    A1

    B1

    A2

    0

    A2

    B2

    0

    0 1

    0

    AB

    B1

    1

    A1 B1

    A2

    AB

    B2

    BDD representationFault tree model

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    15/83

    slide(15)

    RELIABILITY and MAINTAINABILITY Symposium

    Presentation Outline

    I. Introduction to fault trees

    II. Fault tree analysis of an example control system

    III. Fault trees as design aid for software systems

    IV. Adapting the fault tree to analysis of computer-based systems

    V. Dynamic fault trees for modeling sequential behavior

    VI. Modular approach to fault tree analysis

    VII. Sensitivity analysis

    VIII. Summary and Conclusions

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    16/83

    slide(16)

    RELIABILITY and MAINTAINABILITY Symposium

    An example control system including software

    Consider a simple tank level and flow control system.

    The key features of this system are:

    a water tank, fed by a water pump on the inflow and regulated bycontrol and stop valves on the inflow and outflow pipes.

    a tank and level control system with three sensors (level, inflow andoutflow) implemented in software

    a tank bypass to prevent overflow, controlled by the three stop

    valves

    valve actuation and control implemented in software

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    17/83

    slide(17)

    RELIABILITY and MAINTAINABILITY Symposium

    Function of example system

    The function is to maintain the water level and downstream flow rateat particular values by opening and closing control valves cv1 and cv2.

    The controller receives inputs from the three sensors, implements thecontrol logic and then gives commands to the two control valves andthe two stop valves.

    (This example is adapted from an example in: S. Guarro, M. Yau and M. Motamed,Development of tools for safety analysis of control software in advanced reactors,

    NUREG/CR-6465, April 1996.)

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    18/83

    slide(18)

    RELIABILITY and MAINTAINABILITY Symposium

    Diagram of example system

    ControlValvecv2

    FlowSensor

    FlowSensor

    Normally

    Closed

    valve v2

    Normally

    Open

    valve v1

    Check

    valveCheckvalve

    Digital

    Controller

    PumpControlValvecv1

    LevelSensor

    Normally

    Open

    valve v3

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    19/83

    slide(19)

    RELIABILITY and MAINTAINABILITY Symposium

    Control Flow

    close v1

    open v2

    open v3

    cv1 = min

    cv2 = max

    open v1

    close v2

    open v3

    calculate cv1

    calculate cv2

    open v1

    close v2

    close v3cv1 - max

    cv2 = min

    measurementflow measurementflowdownstreamupstream

    level

    level too high

    level too low

    levelwithin

    bounds

    compare level

    with level

    set points

    Drain water

    from tank

    control valves

    positions for

    calculate

    in tank

    replenish water

    flow set point

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    20/83

    slide(20)

    RELIABILITY and MAINTAINABILITY Symposium

    Software Operational modes

    very

    lowveryhigh

    UnderflowOverflow

    correct operation

    non-critical failure behavior

    critical failure behavior

    low normal high

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    21/83

    slide(21)

    RELIABILITY and MAINTAINABILITY Symposium

    System Failure modes

    The control system is not designed to be fault tolerant, so that mosthardware failures present either an overflow or underflow hazard.These are the two system-level failure modes being considered. Ifeither of the control valves or any of the three sensors fail, the systemfails, as the software will be unable to control the system. A tank leak,pipe leak or pump failure are also considered single points of failure.

    Further, an underflow can occur if valve v1 or v2 fails, thus preventing

    proper inflow, unless valve v3 can be closed. Therefore, if v3 and eitherv1 or v2 fail, the system fails. Overflow can occur if valve v3 fails, thuspreventing outflow, unless v1 can be closed. Thus the failure of both v1and v3 leads to system failure.

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    22/83

    slide(22)

    RELIABILITY and MAINTAINABILITY Symposium

    Software Failure Modes

    Software failures have been characterized as an improper change ofstate. Software failures are further classified as critical and non-critical.

    Non-critical software failures lead to less than optimal performance butdo not lead to system failure.

    A software failure when the tank water level is very low or very highcan lead to system failure.

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    23/83

    slide(23)

    RELIABILITY and MAINTAINABILITY Symposium

    Fault Tree Model for Example System

    SW-VH-Failure Control Valve 1

    Control Valve 2

    Stop Valve 3Stop Valve 1 Level sensor

    inflow sensor

    outflow sensor

    pump failure

    pipe leak

    tank leak

    Mechanical systemsControl valvesOverflow Underflow

    SW-VL-Failure

    Stop Valve 3Stop Valve 2 Stop Valve 3Stop Valve 1

    Processor

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    24/83

    slide(24)

    RELIABILITY and MAINTAINABILITY Symposium

    Presentation Outline

    I. Introduction to fault trees

    II. Fault tree analysis of an example control system

    III. Fault trees as design aid for software systems

    IV. Adapting the fault tree to analysis of computer-based systems

    V. Dynamic fault trees for modeling sequential behavior

    VI. Modular approach to fault tree analysis

    VII. Sensitivity analysis

    VIII. Summary and Conclusions

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    25/83

    slide(25)

    RELIABILITY and MAINTAINABILITY Symposium

    Fault trees as a design aid for software systems

    Fault tree analysis can help to insure that the software system does notdowhat it is notsupposed to do. (As contrasted with a formal designreview which helps insure that the software doeswhat it issupposed todo.)

    For robust software systems, fault trees can help identify high-riskareas (either quantitatively or qualitatively).

    Can manage risk by preventive or protective measures applied toidentified high-risk areas.

    exhaustive testing formal methods

    exception handling acceptance tests interlocks redesign

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    26/83

    slide(26)

    RELIABILITY and MAINTAINABILITY Symposium

    Using fault trees to manage risk

    AND gates can be protected by disallowing one of the inputs[1]

    exhaustive testing or formal proof to show module cannot fail test for failure condition and provide recovery routine

    OR gate can be protected by disallowing allinputs or by providingdetection and recovery point. (The detection and recovery routinesmust be simple enough to be certifiably correct.)

    [1] Herbert Hecht and Myron Hecht, Fault Tolerant Software. In D.K. Pradhan, editor, Fault-Tolerant Computing: Theory and Techniques, volume 2, pages 658-696. Prentice-Hall, 1986.

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    27/83

    slide(27)

    RELIABILITY and MAINTAINABILITY Symposium

    Example Risk Mitigation

    G2

    G5

    G3

    G1

    G4

    A1 A2 A3 A4

    A5

    Suppose basic events represent software modules.

    Can protect G3 by preventing failure of module A4

    - exhaustive testing

    - proof of correctness

    Then can protect G2 by

    -preventing failure of A3

    - or by providing detection and recovery handler for G4

    - preventing failure of both A1 and A2

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    28/83

    slide(28)

    RELIABILITY and MAINTAINABILITY Symposium

    Presentation Outline

    I. Introduction to fault trees

    II. Fault tree analysis of an example control system

    III. Fault trees as design aid for software systems

    IV. Adapting the fault tree to analysis of computer-based systems

    V. Dynamic fault trees for modeling sequential behavior

    VI. Modular approach to fault tree analysis

    VII. Sensitivity analysis

    VIII. Summary and Conclusions

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    29/83

    slide(29)

    RELIABILITY and MAINTAINABILITY Symposium

    Modeling Fault-Tolerant Computer Systems

    Fault Tolerant Computer (FTC) systems can actively handle manyfaults and errors that may occur.

    Because FTC are adaptive and flexible, faulty components can beswitched out automatically, and spares switched in.

    However, adaptability and flexibility often result in increasedcomplexity. Increased complexity can mean decreased reliability.

    If the fault tolerance mechanisms (error detection, recovery,reconfiguration) fail, this failure could lead to overall system failure,even if adequate functioning resources remain.

    A coverage model is used to analyze the behavior of the computersystem in the presence of a fault. The results of the coverage modelare then incorporated into the overall system model.

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    30/83

    slide(30)

    RELIABILITY and MAINTAINABILITY Symposium

    Covered vs. Uncovered Faults

    A coveredfault is one from which the system can automatically

    recover. Recovery from transient does not change system state. Recovery from a permanent fault discards faulty component.

    An uncoveredfault is one which leads to immediate system failure,

    regardless of the state of the system.

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    31/83

    slide(31)

    RELIABILITY and MAINTAINABILITY Symposium

    How to estimate coverage?

    If a working or prototype version of the system exists, or if enoughinformation is available about a system being designed, then coverage

    probabilities can be estimated.

    A model of the recovery process can be developed. The parameters forthe model can be measured from a fault injection on the workingprototype and or estimated from data collected in the field.

    A detailed simulation model of the system recovery process can bedeveloped.

    If the details of the recovery process are not known, reasonable

    parameters can be deduced from other, similar systems.

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    32/83

    slide(32)

    RELIABILITY and MAINTAINABILITY Symposium

    General structure of a coverage model

    The entry point to the model is the occurrence of the fault, andthe three exits (R,C, and S) are the three possible outcomes.

    CoverageModel

    permanent coverage

    R exit

    C exit

    S exit

    single-point failure

    transient restoration(covered fault)

    (uncovered failure)

    (covered failure}

    fault occurs

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    33/83

    slide(33)

    RELIABILITY and MAINTAINABILITY Symposium

    R exit: Transient Restoration

    Correct recognition of and recovery from a transient fault. A transientis usually caused by external or environmental factors, such asexcessive heat or a glitch in the power line.

    The vast majority of faults are transient.

    Successful recovery from a transient fault restores the system to anoperational state without discarding any components - for exampleby masking the error, retrying an instruction, or rolling back to aprevious checkpoint.

    Reaching this exit successfully requires:

    timely detection of an error produced by the fault; performance of an effective recovery procedure; and swift disappearance of the fault (the cause of the error).

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    34/83

    slide(34)

    RELIABILITY and MAINTAINABILITY Symposium

    C exit: Permanent coverage

    Determination of the permanent nature of the fault, and the successfulisolation and removal of the faulty component.

    S exit: Single Point failure

    A single fault causes the system to fail, generally whenan undetected error propagates through the system, or if the faultyunit cannot be isolated and the system cannot be reconfigured.

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    35/83

    slide(35)

    RELIABILITY and MAINTAINABILITY Symposium

    Typical fault recovery for a processorA processor contains built-in test circuitry so that error checking occursconcurrently with instruction execution. If an error is detected, theinstruction is retried immediately. Partial results are stored in case theretry is unsuccessful, so that the computation can be continued fromsome intermediate point (called a checkpoint).

    The process of continuing a computation from a previously saved

    checkpoint is called a rollback. In some cases the fault is such that therollback is not successful, so the computation must start over after asystem-level recovery procedure is invoked.

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    36/83

    slide(36)

    RELIABILITY and MAINTAINABILITY Symposium

    Example coverage model for processors

    Wait

    Retry

    Rollback

    Recovery

    Permanent

    Permanent

    Coverage

    Exit C

    Failure

    Exit S

    Exit R

    RestorationTransient

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    37/83

    slide(37)

    RELIABILITY and MAINTAINABILITY Symposium

    Example of transient restoration

    Transient Restoration attempt: Assume that the fault is transient, andbegin a multi-step recovery procedure that continues as long as anerror is detected. If an error persists after all three steps have been

    performed, then a permanent recovery procedure must be invoked.

    Step 1: Wait for 0.1 second and do nothing. If the fault is transient itmay disappear during this time, allowing rollback to succeed.

    Step 2: Retry the current instruction several times, for as long as ahalf-second. The probability that the retry will be successful (i.e., noerror is detected) is 0.5.

    Step 3: If an error persists, perform a rollback to a previous check-point, followed by recomputation, taking 2 sec. total. The rollback suc-ceeds in removing the error 80% of the time.

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    38/83

    slide(38)

    RELIABILITY and MAINTAINABILITY Symposium

    Example of permanent coverage and

    single-point failure

    If an error still persists after the rollback, it is assumed to be caused bya permanent fault, and a system level permanent fault recovery

    process is begun, to remove the offending processor from the set ofactive units and to reconfigure the system to continue without it.The permanent fault recovery process succeeds with probability 0.875.

    The permanent coverage procedure is invoked against a a persistent

    transient fault as well as against a permanent fault.

    If the permanent fault recovery process fails, then a single-point failureis said to occur.

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    39/83

    slide(39)

    RELIABILITY and MAINTAINABILITY Symposium

    Coverage model for memories

    Single bit

    Memory error

    Error masked

    in zero time

    Multiple bit

    Memory error

    Attempt

    recovery

    Error Occurs

    successful unsuccessful

    not detecteddetected

    0.980.02

    0.05

    0.850.15

    0.95

    Transient

    Restoration

    Exit R

    Permanent

    Coverage

    Exit C

    Failure

    Exit S

    Failure

    Exit S

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    40/83

    slide(40)

    RELIABILITY and MAINTAINABILITY Symposium

    Example of recovery process for memory faults

    The memory uses an error correcting code, so a single-bit error isalways detectable and correctable, and no reconfiguration is required.If 98% of all memory faults affect only a single bit, thenthe probability of reaching the R exit is 0.98.

    The 2% of faults that affect more than one memory bit are 95%detectable. When a multiple memory error is detected, the affectedportion of memory is discarded, the memory mapping function is

    updated, and the needed information is reloaded from a previouscheckpoint and updated to represent the current state of the system.

    Experimentation on a prototype system revealed that this recoveryfrom the detected multiple memory errors works 85% of the time.

    Thus, the probability of reaching the C exit is the probability that amultiple fault occurs, is detected, and is recovered from is:

    c 0.02 0.95 0.85( ) 0.01615= =

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    41/83

    slide(41)

    RELIABILITY and MAINTAINABILITY Symposium

    Single point failure for memory faults

    There are two paths to the single point failure exit.

    The memory fault causes a single-point failure if a multiple-bit error isnot detected (with probability 0.02 x 0.05)

    A multiple-bit memory error is detected, but the attempted recovery isnot successful, with probability 0.02 x 0.95 x 0.15

    Thus the probability of single point failure is the sum of these twocases, or 0.00385.

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    42/83

    slide(42)

    RELIABILITY and MAINTAINABILITY Symposium

    Example - 3P2M system

    3 Processors and 2 memories connected via a bus. 2 Processors, onememory and the bus are needed for correct operation.

    Bus

    Processors

    Memories

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    43/83

    slide(43)

    RELIABILITY and MAINTAINABILITY Symposium

    3P2M fault tree

    2/3

    System Failure

    P3P1 P2

    B

    M2M1

    Processors Memories

    Bus

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    44/83

    slide(44)

    RELIABILITY and MAINTAINABILITY Symposium

    Adding coverage to fault tree

    P3P2P1 M2 M1

    B

    System Failure

    MemoriesProcessors

    2/3

    Bus

    R R R R R

    S S S S SC C C C C

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    45/83

    slide(45)

    RELIABILITY and MAINTAINABILITY Symposium

    Probabilities for covered and uncovered basic

    events

    Note that the events fail covered and fail uncovered are mutuallyexclusive (i.e. not independent).

    No fault

    or transient

    restoration

    covered fault

    uncovered

    fault

    Pr[component fault] = p

    Pr[component operational] = q

    Pr[covered failure] = cp

    Pr[Uncovered failure] = sp

    Pr[No fault or transient restoration] = q + rp

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    46/83

    slide(46)

    RELIABILITY and MAINTAINABILITY Symposium

    Presentation Outline

    I. Introduction to fault trees

    II. Fault tree analysis of an example control system

    III. Fault trees as design aid for software systems

    IV. Adapting the fault tree to analysis of computer-based systems

    V. Dynamic fault trees for modeling sequential behavior

    VI. Modular approach to fault tree analysis

    VII. Sensitivity analysis

    VIII. Summary and Conclusions

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    47/83

    slide(47)

    RELIABILITY and MAINTAINABILITY Symposium

    Sequence dependencies

    Traditional fault trees cannot model sequence dependent failures, in

    which the orderthat events occur is important.

    We define special purpose gates for modeling sequencedependencies, and solve the resulting fault tree as a Markov chain.

    The development of a correct Markov model for a complex system canbe difficult. Our approach is to use the fault tree for model developmentand automatically convert the fault tree to the equivalent Markov chain.The dynamic fault tree model is considerably simpler than theequivalent Markov chain.

    Coverage models are automatically added to the resulting Markovchain which is solved via a numerical differential equation solver.

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    48/83

    slide(48)

    RELIABILITY and MAINTAINABILITY Symposium

    Example of sequence dependency

    If switch fails after primary fails (and after spare is activated) then the

    system is still operational.

    If the switch fails before the primary fails, then the spare cannot beactivated and the system fails, even though the spare is operational.

    Failure criteria depends on orderin which failures occur. This systemcan be solved correctly via a Markov model.

    Switch

    Primary

    Spare

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    49/83

    slide(49)

    RELIABILITY and MAINTAINABILITY Symposium

    Sequence dependency gates

    Several special purpose gates have been added to the traditional faulttree gates. These special dynamicgates capture sequencedependencies which frequently arise when modeling fault tolerantcomputer systems. If a dynamic gate is part of a fault tree then it issolved via a Markov chain, rather than by using traditional methods.

    The special dynamic gates include:

    Functional dependencygate for modeling situations where one com-ponents correct operation is dependent upon the correct operation ofsome other component

    Sparegate for modeling cold, warm and hot pooled spares

    Priority-ANDgate for modeling ordered ANDing of events. Note thatmany traditional fault trees include the Priority AND gate; most simplyapproximate with an AND gate

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    50/83

    slide(50)

    RELIABILITY and MAINTAINABILITY Symposium

    dependent basic events

    are forced to occur when trigger

    event occurs

    FDEP

    (may be subtree)

    Trigger event

    FDEP produces no logical output.

    Its only effect is on propogating failures.

    Functional Dependency gate

    S t

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    51/83

    slide(51)

    RELIABILITY and MAINTAINABILITY Symposium

    fail at (possibly) reduced ratebefore being switched into active use

    Primary

    SPARE

    Spare units which are assumed tocomponent

    Spare gate

    Priority AND gate

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    52/83

    slide(52)

    RELIABILITY and MAINTAINABILITY Symposium

    Priority-AND gate

    which may occur in any order

    A B

    occur, and A occurs before B

    Output occurs if A and B

    Inputs (may be subtrees)

    Cascading Priority AND gates

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    53/83

    slide(53)

    RELIABILITY and MAINTAINABILITY Symposium

    Cascading Priority-AND gates

    A B

    A before B before C

    C

    HECS: Hypothetical Example Computer System

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    54/83

    slide(54)

    RELIABILITY and MAINTAINABILITY Symposium

    HECS: Hypothetical Example Computer System

    Operator console

    Operator

    & Software

    A2

    A1Memory

    InterfaceUnit 1

    M2M1 M3 M4 M5

    Cold Spare A

    Bus

    Redundant

    Memory

    InterfaceUnit 2

    HECS system description

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    55/83

    slide(55)

    RELIABILITY and MAINTAINABILITY Symposium

    HECS system description

    HECS consists of dual-redundant processors A1 and A2 and a coldspare which can replace either upon failure. A cold spare is one whichis assumed not to fail before being used.

    HECS has 5 memory units; three are required. These memory unitsare connected to the bus via two memory interface units. If the memoryinterface unit fails, the memory units connected to it are unusable.Memory unit 3 (M3) is connected to both interfaces for redundancy;thus M3 is accessible as long as either interface unit is operational.

    There is also a human operator who interfaces with the system via aconsole, and runs some software application.

    HECS requires at least one of the three A processors, at least 3 of thememory units, at least one of the redundant busses, and the operator,console and software to be operating correctly.

    Modeling the cold spares

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    56/83

    slide(56)

    RELIABILITY and MAINTAINABILITY Symposium

    Modeling the cold spares

    Notice that the cold spare is shared between the two processors. Firstto fail is replaced with the spare; the spare is then unavailable if theother fails.

    Cold Spare Cold Spare

    A1 A2

    and spare

    Cold Spare

    A

    A processors

    Modeling the memory units

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    57/83

    slide(57)

    RELIABILITY and MAINTAINABILITY Symposium

    Modeling the memory units

    M5M3

    Functional

    Dependency

    M4M2M1

    Functional

    Dependency

    Functional

    Dependency

    MIU 1MIU 2

    3/5

    memory

    units

    HECS system-level fault tree model

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    58/83

    slide(58)

    RELIABILITY and MAINTAINABILITY Symposium

    HECS system level fault tree model

    hypothetical system failure

    operator

    console software

    console

    Operator

    Cold Spare Cold Spare

    A1 A2

    M5M3

    Functional

    Dependency

    M4M2M1

    Functional

    Dependency

    Functional

    Dependency

    MIU 1MIU 2

    2*Bus

    operator,

    console & SW

    A processors

    and spare

    Cold Spare

    3/5

    memory

    units

    A

    Example: Fault Tolerant Parallel Processor

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    59/83

    slide(59)

    RELIABILITY and MAINTAINABILITY Symposium

    Example: Fault Tolerant Parallel Processor

    Processingelements

    Network

    elements

    NE2 NE4

    NE3

    NE1

    FTPP configuration #1

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    60/83

    slide(60)

    RELIABILITY and MAINTAINABILITY Symposium

    FTPP configuration #1

    One spare per triad

    D1 C1 B1 A1

    A2

    B2

    C2

    D2

    A3 B3 C3 D3

    AS

    BS

    CS

    DS

    NE2 NE4

    NE3

    NE1

    Fault tree for FTPP configuration #1

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    61/83

    slide(61)RELIABILITY and MAINTAINABILITY Symposium

    Fault tree for FTPP configuration #1

    3/4 3/4 3/4 3/4

    A1 A2 A3 AS B2 B3B1 C2 C3 D1 D2 D3 DSC1BS CS

    NE4NE3NE2NE1

    FDEPFDEP FDEP FDEP

    B1A1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 AS BS CS DS

    FTPP configuration #2

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    62/83

    slide(62)RELIABILITY and MAINTAINABILITY Symposium

    g

    One spare per NE

    B3

    D1

    S2

    A3 C1 D2 S3

    C2

    D3

    S4

    A1B2C3S1

    B1

    A2

    NE2 NE4

    NE3

    NE1

    Failure conditions for FTPP configuration #2

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    63/83

    slide(63)RELIABILITY and MAINTAINABILITY Symposium

    g

    Consider as an example, the first member of the A triad, specificallycomponent A1. Now A1 will fail if

    both A1 and its spare (S1) fails OR if either of the other processors on the same NE fail before A1does,

    thus using the spare first. In this case there will be no spare availablewhen A1 fails.

    A1

    B2 C3

    A1S1

    Fault tree for FTPP configuration #2

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    64/83

    slide(64)RELIABILITY and MAINTAINABILITY Symposium

    g

    A2 A2A1

    B2 C3

    A1S1 S2

    B3 D1

    A3 S3

    C1 D2

    A3 B1 B1 B2 B3 B3S4 B2 S1 S2

    C2 D3 C3 A1 D1 A2

    C1 C1 C2 C2 C3 C3S3 S4 S1

    D2 A3 D3 B1 A1 B2

    D1 D1 D2 D2 D3 DS2 S3 S4

    A2 B3 A3 C1 B1 C2

    2/3 2/32/32/3

    FDEP

    NE1

    A1 B2 C3 S1

    FDEP

    NE2

    A2 B3 D1 S2

    FDEP

    B1

    NE4

    C2 D3 S4

    FDEP

    A3 C1 D2 S3

    NE3

    Alternative Fault Tree for configuration #2

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    65/83

    slide(65)RELIABILITY and MAINTAINABILITY Symposium

    NE4

    AS BS CS DS

    FDEP

    A3 B3 C3 D3

    NE3

    FDEP

    A2 B2 C2 D2

    NE2

    FDEP

    NE1

    B1A1 C1 D1

    FDEP

    A1 A2 A3 B1 B2 B3 C1 C2 C3

    Spare Spare Spare SpareSpare Spare Spare Spare Spare Spare Spare

    D1 D2 D3

    AS

    BS

    CS

    DS

    Spare

    2/3 2/32/3 2/3

    Mission Avionics System Example

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    66/83

    slide(66)RELIABILITY and MAINTAINABILITY Symposium

    The suc c ess/ fa ilure of the system is d riven by the need

    to p rovide c erta in software func tiona lity

    c rew sta tion managementsc ene & obsta c le p roc essingloc a l pa th genera tionsystem management func tionsvehic le ma nag ement

    Fault to leranc e is ac hieved via redundant p roc essors(hot spares), poolsof c old sparesand redundant buses.

    MAS t hit t

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    67/83

    slide(67)RELIABILITY and MAINTAINABILITY Symposium

    MAS system architecture

    Background Data Bus

    Mission Management Bus

    vehiclemgmt 1a

    vehiclemgmt 1b

    vehiclemgmt 2b

    VMSPARE 2

    VMSPARE 1

    vehiclemgmt 2a

    Vehicle Management Bus

    Memory 1 Memory 2

    scene&obstacle a

    crewstation b

    crewstation a

    scene &obstacle b

    local pathgen. a

    local pathgen. b

    systemmgmt a

    systemmgmt b

    SPARE 1

    SPARE 2

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    68/83

    Redundant Software Architecture

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    69/83

    slide(69)RELIABILITY and MAINTAINABILITY Symposium

    Next Consider the p a th genera tion (Pa thGen) andsc ene&obstac le (S&O) func tions:

    Eac h func tion needs a sing le p roc essor to p rovide fullfunctionality.

    There is a lso a reduc ed version of ea c h func tion tha t c anprovide minimum func tiona lity (Pa thGenMin and S&OMin).

    In the event o f a detec ted softwa re fault in Pa thGen thesystem c an switc h to Pa thGenMin (simila rly for S&O)

    Further, if there a re no longer 2 full p roc essors ava ilab le, thesystem will switc h to Pa thGenMin a nd S&OMin running on a

    sing le p roc essor.

    Redundant Software MAS model

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    70/83

    slide(70)RELIABILITY and MAINTAINABILITY Symposium

    Loss

    S1

    S2

    Software

    Sys Mgt1a

    CSP

    Sys Mgt

    Sys Mgt1b

    CSP

    Sys1a

    Sys1b

    Path Gen1a

    CSP

    Path Gen

    Path Gen1b

    CSP

    Path1a

    Path1b

    S&O

    S&O 1b

    CSP

    S&O1b

    S&O 1a

    CSP

    S&O1a

    S&O SW

    CSP

    S&OSWfull

    S&OSWMin

    PathSWfull

    PathSWMin

    Path SW

    CSP

    One Proc

    Minimize

    FDEP

    Crew

    Both Proc

    Crew 1a

    CSP

    Crew 1b

    CSP

    Crew1a

    Crew1b

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    71/83

    Presentation Outline

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    72/83

    slide(72)RELIABILITY and MAINTAINABILITY Symposium

    I. Introduction to fault trees

    II. Fault tree analysis of an example control system

    III. Fault trees as design aid for software systems

    IV. Adapting the fault tree to analysis of computer-based systems

    V. Dynamic fault trees for modeling sequential behavior

    VI. Modular approach to fault tree analysis

    VII. Sensitivity analysis

    VIII. Summary and Conclusions

    Modular Approach to fault tree analysis

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    73/83

    slide(73)RELIABILITY and MAINTAINABILITY Symposium

    Given a fault

    tree model as input

    Find Independent subtrees

    Solve each subtree separately

    Combine the results

    The best of both worlds

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    74/83

    slide(74)RELIABILITY and MAINTAINABILITY Symposium

    Divide-and-c onquer helps p roduc e models tha t ma ynot be too la rge to solve.

    For sta tic modules (c onta ining AND, OR, K-of-M) ga tes,

    use the fast and effic ient BDD (Binary Dec ision Diagram)approach.

    For dynamic modules, c onvert to equiva lent Ma rkov

    model for solution.

    Different solution methods c an a id in va lida tion a ndtesting.

    Modula riza tion a llows c onsidera tion of d ifferent solutionmethods (i.e. simula tion).

    Modular Solution of HECS

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    75/83

    slide(75)RELIABILITY and MAINTAINABILITY Symposium

    operator

    consolesoftware

    console

    Operator

    Cold Spare Cold Spare

    A1 A2

    M5

    M3

    Functional

    Dependency

    M4M2M1

    Functional

    Dependency

    Functional

    Dependency

    MIU 1MIU 2

    type: static

    type: dynamic

    independent subtree 2

    Independent subtree 1

    operator,

    console & SW

    A processors

    and spare

    Cold Spare

    3/5

    memory

    units

    A

    2*Bus

    Independent subtree 4 (buses)type: static

    hypothetical system failure

    type:dynamic

    independent subtree 3 (memories)

    Presentation Outline

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    76/83

    slide(76)RELIABILITY and MAINTAINABILITY Symposium

    I. Introduction to fault trees

    II. Fault tree analysis of an example control system

    III. Fault trees as design aid for software systems

    IV. Adapting the fault tree to analysis of computer-based systems

    V. Dynamic fault trees for modeling sequential behavior

    VI. Modular approach to fault tree analysis

    VII. Sensitivity analysis

    VIII. Summary and Conclusions

    Sensitivity analysis

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    77/83

    slide(77)RELIABILITY and MAINTAINABILITY Symposium

    Reliability Analysis tells only part of the story.

    What are the weak points in the system?

    How do my results change with changing input parameters?

    What is the most cost-effective way to improve reliability?

    These questions require sensitivity analysis of reliability analysisresults.

    Modular approach to sensitivity analysis

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    78/83

    slide(78)RELIABILITY and MAINTAINABILITY Symposium

    pp y y

    Sensitivity analysis (also called importance analysis) can use partialderivative.

    Sensitivity results from different submodels can be easily combinedusing chain-rule.

    Sensitivity analysis for BDD is almost free while calculating reliability.

    Sensitivity analysis for Markov chain is more troublesome but we havedeveloped an interesting, efficient approximation.

    Example: Cardiac Assist System

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    79/83

    slide(79)RELIABILITY and MAINTAINABILITY Symposium

    TEDTSController

    Battery

    PowerSupply

    PaceLeads

    PrimaryCPU

    BackupCPU

    TEDTSCoil

    MotorAmplifier

    MotorCableMotorPump

    CrossbarSwitch

    WSP

    Cardiac Assist System Fault TreeSystem

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    80/83

    slide(80)RELIABILITY and MAINTAINABILITY Symposium

    FDEP WSP

    TEDTS

    Coil

    TEDTS

    Contr.Battery

    Motor

    Cable.

    MotorMotor

    Amp.Backup

    CPU

    Primary

    CPU

    System

    Superv.

    Crossbar

    Switch.

    Pace

    LeadsPump

    Power

    Supply

    System

    Failure

    Pump,Motor

    Leads

    Power

    TEDTS

    CPU

    Motor

    Section

    TEDTS

    Trigger

    M1

    M2

    M3

    M4

    M5

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    81/83

    Summary

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    82/83

    slide(82)

    RELIABILITY and MAINTAINABILITY Symposium

    The DFT(dynamic fault tree) methodology is ideally suited for theanalysis of computer-based systems.

    DFTuses a modular approach to FTA, detecting modules using a fastand efficient algorithm.

    Modules are classified as static or dynamic, depending on the types ofgates included.

    Static modules are solved using the BDD approach; dynamic modulesare solved using Markov chain methods.

    Coverage models can assess the effect of complex recoverymechanisms.

    Dynamic gates can allow modeling of sequence dependencies thatarise from complex redundancy management.

  • 8/14/2019 Dugan Comp Sys Fta Tutor

    83/83

    slide(83)

    RELIABILITY and MAINTAINABILITY Symposium

    Software for Dynamic Fault Tree Analysis

    Galileo/ ASSAPis a software package for fault tree

    analysis which embodies the DFT approach.

    (Being developed for NASA Langley Research

    Center, expected completion Nov. 2001. Beta ver-sion available for evaluation.)