Dugan Comp Sys Fta Tutor

8/14/2019 Dugan Comp Sys Fta Tutor

1/83

slide(1)

RELIABILITY and MAINTAINABILITY Symposium

Fault Tree Analysisof Computer-Based Systems

Joanne Bec hta DuganProfessor of Elec tric a l & Computer Eng ineering

University of Virginia([email protected])


2/83

slide(2)


Presentation Outline

I. Introduction to fault trees

II. Fault tree analysis of an example control system

III. Fault trees as design aid for software systems

IV. Adapting the fault tree to analysis of computer-based systems

V. Dynamic fault trees for modeling sequential behavior

VI. Modular approach to fault tree analysis

VII. Sensitivity analysis

VIII. Summary and Conclusions


3/83

slide(3)


Introduction to fault tree analysis

Fault trees provide a good framework for both qualitative andquantitative analysis because they have both a logical (boolean

algebra) and probabilistic basis.

What is a fault tree?

not a tree (in the graph-theoretic sense)

a graphical representation of a logical function shows logical relationship between an event (failure) and its causes provides a logical framework for expressing combinations of compo-

nent failures that can lead to system failure


4/83

slide(4)


Why use fault tree analysis?

A fault tree model provides a logical framework for analyzing the failurebehavior of a system.

A fault tree model precisely documents which failure scenarios havebeen considered and which have not.

Fault tree analysis can be used to support engineering and

management decisions, trade-off analysis and risk assessment.

The fault tree model has a well-defined boolean algebraic andprobabilistic basis which relates probability calculations to boolean

logic functions.


5/83

slide(5)


Basic Static Fault Tree Constructs

Basic Events

Basic Event: corresponds to a basic failure

Characterized by failure rate or failure probability

event (usually a component failure) in the system.

k * name

Undeveloped Basic Event: A basic event that is notcompletely developed, usually because of unavailable

information.


name

name

Replicated Basic Event; represents k statisticallyidentical copies of a component



6/83

slide(6)


Static fault tree gates

m/n

AND gate - output event occurs

only if ALL input events occur

OR gate - output event occurs

if one or more input events occur

m/n gate - output event occurs ifm or more of the n inputs occur


7/83

slide(7)


Example Fault Tree

Washing Machine Overflows

fill mode too long

valve

stuck open

timeout

control

failed

full

sensorfailed

Structure Function:

Fail = valve_failed OR

F = A + BC

(timer_failed AND sensor_failed)


8/83

slide(8)


Probabilistic Fault Tree Analysis of Example

F = A + BC

Pr[F] = Pr[ A + BC]

= Pr[A] + Pr[BC] - Pr[ABC]

Suppose

Pr[A] = 0.01

Pr[B] = 0.05

Pr[C] = 0.075= 0.01 + 0.00375 - 0.0000375

= 0.0137125(all failures are independent)


9/83

slide(9)


Fault Tree Analysis - Cutsets

Most fault tree analysis techniques start with the generation of cutsets

A cutset is a set of basic events; if all the basic events in a cutset occur,then the top event (system failure) occurs.

A mincut (minimum cutset) is one that contains no redundant elements.If an element is removed from a mincut, it ceases to be a mincut.


fill mode too long

valve

stuck open

timeout

control

failed

full

sensor

failed

Cutsets:

{valve} (single point of failure)

{timer, sensor}


10/83

slide(10)


Cutset generation by example

G2

G5

G3

G1

G4

A1 A2 A3 A4

A5

G3 {A4,A5}

{A1,A3}

{A2,A3}

{A2,A4}

{A1,A4}

G2 {G4,G5}

{A1,G5}

{A2,G5}

G1


11/83

slide(11)


Probabilistic Analysis using Cutsets

The probability of system failure is simply the probability that one ormore of the cutsets occur.

But the cutsets are not disjoint so we cannot sum their individual

probabilities. We must account for the overlap of the events.

Prob(failure) = Prob(valve failure) + Prob(both timer and sensor fail)- Prob(all three fail)


fill mode too long

valve

stuck open

timeout

control

failed

full

sensor

failed

Cutsets:

{valve} (single point of failure)

{timer, sensor}


12/83

slide(12)


Probabilistic Analysis using Inclusion-Exclusion

Pr(A) + Pr(B) + Pr(C)- Pr(A and B) - Pr(B and C) - Pr(A and C)+ Pr(A and B and C)

ABCAB

BCAC

C

BA


13/83

slide(13)


Probabilistic Analysis using Sum-of-Disjoint-Products

Pr(A) + Pr(not A and B) + Pr (not A and not B and C)

A AB

_

ABC__


14/83

slide(14)


Probabilistic Ana lysis

using Binary Decision Diagrams

A1

B1

A2

0

A2

B2

0

0 1

0

AB

B1

1

A1 B1

A2

AB

B2

BDD representationFault tree model


15/83

slide(15)












16/83

slide(16)


An example control system including software

Consider a simple tank level and flow control system.

The key features of this system are:

a water tank, fed by a water pump on the inflow and regulated bycontrol and stop valves on the inflow and outflow pipes.

a tank and level control system with three sensors (level, inflow andoutflow) implemented in software

a tank bypass to prevent overflow, controlled by the three stop

valves

valve actuation and control implemented in software


17/83

slide(17)


Function of example system

The function is to maintain the water level and downstream flow rateat particular values by opening and closing control valves cv1 and cv2.

The controller receives inputs from the three sensors, implements thecontrol logic and then gives commands to the two control valves andthe two stop valves.

(This example is adapted from an example in: S. Guarro, M. Yau and M. Motamed,Development of tools for safety analysis of control software in advanced reactors,

NUREG/CR-6465, April 1996.)


18/83

slide(18)


Diagram of example system

ControlValvecv2

FlowSensor

FlowSensor

Normally

Closed

valve v2

Normally

Open

valve v1

Check

valveCheckvalve

Digital

Controller

PumpControlValvecv1

LevelSensor

Normally

Open

valve v3


19/83

slide(19)


Control Flow

close v1

open v2

open v3

cv1 = min

cv2 = max

open v1

close v2

open v3

calculate cv1

calculate cv2

open v1

close v2

close v3cv1 - max

cv2 = min

measurementflow measurementflowdownstreamupstream

level

level too high

level too low

levelwithin

bounds

compare level

with level

set points

Drain water

from tank

control valves

positions for

calculate

in tank

replenish water

flow set point


20/83

slide(20)


Software Operational modes

very

lowveryhigh

UnderflowOverflow

correct operation

non-critical failure behavior

critical failure behavior

low normal high


21/83

slide(21)


System Failure modes

The control system is not designed to be fault tolerant, so that mosthardware failures present either an overflow or underflow hazard.These are the two system-level failure modes being considered. Ifeither of the control valves or any of the three sensors fail, the systemfails, as the software will be unable to control the system. A tank leak,pipe leak or pump failure are also considered single points of failure.

Further, an underflow can occur if valve v1 or v2 fails, thus preventing

proper inflow, unless valve v3 can be closed. Therefore, if v3 and eitherv1 or v2 fail, the system fails. Overflow can occur if valve v3 fails, thuspreventing outflow, unless v1 can be closed. Thus the failure of both v1and v3 leads to system failure.


22/83

slide(22)


Software Failure Modes

Software failures have been characterized as an improper change ofstate. Software failures are further classified as critical and non-critical.

Non-critical software failures lead to less than optimal performance butdo not lead to system failure.

A software failure when the tank water level is very low or very highcan lead to system failure.


23/83

slide(23)


Fault Tree Model for Example System

SW-VH-Failure Control Valve 1

Control Valve 2

Stop Valve 3Stop Valve 1 Level sensor

inflow sensor

outflow sensor

pump failure

pipe leak

tank leak

Mechanical systemsControl valvesOverflow Underflow

SW-VL-Failure

Stop Valve 3Stop Valve 2 Stop Valve 3Stop Valve 1

Processor


24/83

slide(24)












25/83

slide(25)


Fault trees as a design aid for software systems

Fault tree analysis can help to insure that the software system does notdowhat it is notsupposed to do. (As contrasted with a formal designreview which helps insure that the software doeswhat it issupposed todo.)

For robust software systems, fault trees can help identify high-riskareas (either quantitatively or qualitatively).

Can manage risk by preventive or protective measures applied toidentified high-risk areas.

exhaustive testing formal methods

exception handling acceptance tests interlocks redesign


26/83

slide(26)


Using fault trees to manage risk

AND gates can be protected by disallowing one of the inputs[1]

exhaustive testing or formal proof to show module cannot fail test for failure condition and provide recovery routine

OR gate can be protected by disallowing allinputs or by providingdetection and recovery point. (The detection and recovery routinesmust be simple enough to be certifiably correct.)

[1] Herbert Hecht and Myron Hecht, Fault Tolerant Software. In D.K. Pradhan, editor, Fault-Tolerant Computing: Theory and Techniques, volume 2, pages 658-696. Prentice-Hall, 1986.


27/83

slide(27)


Example Risk Mitigation

G2

G5

G3

G1

G4

A1 A2 A3 A4

A5

Suppose basic events represent software modules.

Can protect G3 by preventing failure of module A4

- exhaustive testing

- proof of correctness

Then can protect G2 by

-preventing failure of A3

- or by providing detection and recovery handler for G4

- preventing failure of both A1 and A2


28/83

slide(28)












29/83

slide(29)


Modeling Fault-Tolerant Computer Systems

Fault Tolerant Computer (FTC) systems can actively handle manyfaults and errors that may occur.

Because FTC are adaptive and flexible, faulty components can beswitched out automatically, and spares switched in.

However, adaptability and flexibility often result in increasedcomplexity. Increased complexity can mean decreased reliability.

If the fault tolerance mechanisms (error detection, recovery,reconfiguration) fail, this failure could lead to overall system failure,even if adequate functioning resources remain.

A coverage model is used to analyze the behavior of the computersystem in the presence of a fault. The results of the coverage modelare then incorporated into the overall system model.


30/83

slide(30)


Covered vs. Uncovered Faults

A coveredfault is one from which the system can automatically

recover. Recovery from transient does not change system state. Recovery from a permanent fault discards faulty component.

An uncoveredfault is one which leads to immediate system failure,

regardless of the state of the system.


31/83

slide(31)


How to estimate coverage?

If a working or prototype version of the system exists, or if enoughinformation is available about a system being designed, then coverage

probabilities can be estimated.

A model of the recovery process can be developed. The parameters forthe model can be measured from a fault injection on the workingprototype and or estimated from data collected in the field.

A detailed simulation model of the system recovery process can bedeveloped.

If the details of the recovery process are not known, reasonable

parameters can be deduced from other, similar systems.


32/83

slide(32)


General structure of a coverage model

The entry point to the model is the occurrence of the fault, andthe three exits (R,C, and S) are the three possible outcomes.

CoverageModel

permanent coverage

R exit

C exit

S exit

single-point failure

transient restoration(covered fault)

(uncovered failure)

(covered failure}

fault occurs


33/83

slide(33)


R exit: Transient Restoration

Correct recognition of and recovery from a transient fault. A transientis usually caused by external or environmental factors, such asexcessive heat or a glitch in the power line.

The vast majority of faults are transient.

Successful recovery from a transient fault restores the system to anoperational state without discarding any components - for exampleby masking the error, retrying an instruction, or rolling back to aprevious checkpoint.

Reaching this exit successfully requires:

timely detection of an error produced by the fault; performance of an effective recovery procedure; and swift disappearance of the fault (the cause of the error).


34/83

slide(34)


C exit: Permanent coverage

Determination of the permanent nature of the fault, and the successfulisolation and removal of the faulty component.

S exit: Single Point failure

A single fault causes the system to fail, generally whenan undetected error propagates through the system, or if the faultyunit cannot be isolated and the system cannot be reconfigured.


35/83

slide(35)


Typical fault recovery for a processorA processor contains built-in test circuitry so that error checking occursconcurrently with instruction execution. If an error is detected, theinstruction is retried immediately. Partial results are stored in case theretry is unsuccessful, so that the computation can be continued fromsome intermediate point (called a checkpoint).

The process of continuing a computation from a previously saved

checkpoint is called a rollback. In some cases the fault is such that therollback is not successful, so the computation must start over after asystem-level recovery procedure is invoked.


36/83

slide(36)


Example coverage model for processors

Wait

Retry

Rollback

Recovery

Permanent

Permanent

Coverage

Exit C

Failure

Exit S

Exit R

RestorationTransient


37/83

slide(37)


Example of transient restoration

Transient Restoration attempt: Assume that the fault is transient, andbegin a multi-step recovery procedure that continues as long as anerror is detected. If an error persists after all three steps have been

performed, then a permanent recovery procedure must be invoked.

Step 1: Wait for 0.1 second and do nothing. If the fault is transient itmay disappear during this time, allowing rollback to succeed.

Step 2: Retry the current instruction several times, for as long as ahalf-second. The probability that the retry will be successful (i.e., noerror is detected) is 0.5.

Step 3: If an error persists, perform a rollback to a previous check-point, followed by recomputation, taking 2 sec. total. The rollback suc-ceeds in removing the error 80% of the time.


38/83

slide(38)


Example of permanent coverage and

single-point failure

If an error still persists after the rollback, it is assumed to be caused bya permanent fault, and a system level permanent fault recovery

process is begun, to remove the offending processor from the set ofactive units and to reconfigure the system to continue without it.The permanent fault recovery process succeeds with probability 0.875.

The permanent coverage procedure is invoked against a a persistent

transient fault as well as against a permanent fault.

If the permanent fault recovery process fails, then a single-point failureis said to occur.


39/83

slide(39)


Coverage model for memories

Single bit

Memory error

Error masked

in zero time

Multiple bit

Memory error

Attempt

recovery

Error Occurs

successful unsuccessful

not detecteddetected

0.980.02

0.05

0.850.15

0.95

Transient

Restoration

Exit R

Permanent

Coverage

Exit C

Failure

Exit S

Failure

Exit S


40/83

slide(40)


Example of recovery process for memory faults

The memory uses an error correcting code, so a single-bit error isalways detectable and correctable, and no reconfiguration is required.If 98% of all memory faults affect only a single bit, thenthe probability of reaching the R exit is 0.98.

The 2% of faults that affect more than one memory bit are 95%detectable. When a multiple memory error is detected, the affectedportion of memory is discarded, the memory mapping function is

updated, and the needed information is reloaded from a previouscheckpoint and updated to represent the current state of the system.

Experimentation on a prototype system revealed that this recoveryfrom the detected multiple memory errors works 85% of the time.

Thus, the probability of reaching the C exit is the probability that amultiple fault occurs, is detected, and is recovered from is:

c 0.02 0.95 0.85( ) 0.01615= =


41/83

slide(41)


Single point failure for memory faults

There are two paths to the single point failure exit.

The memory fault causes a single-point failure if a multiple-bit error isnot detected (with probability 0.02 x 0.05)

A multiple-bit memory error is detected, but the attempted recovery isnot successful, with probability 0.02 x 0.95 x 0.15

Thus the probability of single point failure is the sum of these twocases, or 0.00385.


42/83

slide(42)


Example - 3P2M system

3 Processors and 2 memories connected via a bus. 2 Processors, onememory and the bus are needed for correct operation.

Bus

Processors

Memories


43/83

slide(43)


3P2M fault tree

2/3

System Failure

P3P1 P2

B

M2M1

Processors Memories

Bus


44/83

slide(44)


Adding coverage to fault tree

P3P2P1 M2 M1

B

System Failure

MemoriesProcessors

2/3

Bus

R R R R R

S S S S SC C C C C


45/83

slide(45)


Probabilities for covered and uncovered basic

events

Note that the events fail covered and fail uncovered are mutuallyexclusive (i.e. not independent).

No fault

or transient

restoration

covered fault

uncovered

fault

Pr[component fault] = p

Pr[component operational] = q

Pr[covered failure] = cp

Pr[Uncovered failure] = sp

Pr[No fault or transient restoration] = q + rp


46/83

slide(46)












47/83

slide(47)


Sequence dependencies

Traditional fault trees cannot model sequence dependent failures, in

which the orderthat events occur is important.

We define special purpose gates for modeling sequencedependencies, and solve the resulting fault tree as a Markov chain.

The development of a correct Markov model for a complex system canbe difficult. Our approach is to use the fault tree for model developmentand automatically convert the fault tree to the equivalent Markov chain.The dynamic fault tree model is considerably simpler than theequivalent Markov chain.

Coverage models are automatically added to the resulting Markovchain which is solved via a numerical differential equation solver.


48/83

slide(48)


Example of sequence dependency

If switch fails after primary fails (and after spare is activated) then the

system is still operational.

If the switch fails before the primary fails, then the spare cannot beactivated and the system fails, even though the spare is operational.

Failure criteria depends on orderin which failures occur. This systemcan be solved correctly via a Markov model.

Switch

Primary

Spare


49/83

slide(49)


Sequence dependency gates

Several special purpose gates have been added to the traditional faulttree gates. These special dynamicgates capture sequencedependencies which frequently arise when modeling fault tolerantcomputer systems. If a dynamic gate is part of a fault tree then it issolved via a Markov chain, rather than by using traditional methods.

The special dynamic gates include:

Functional dependencygate for modeling situations where one com-ponents correct operation is dependent upon the correct operation ofsome other component

Sparegate for modeling cold, warm and hot pooled spares

Priority-ANDgate for modeling ordered ANDing of events. Note thatmany traditional fault trees include the Priority AND gate; most simplyapproximate with an AND gate


50/83

slide(50)


dependent basic events

are forced to occur when trigger

event occurs

FDEP

(may be subtree)

Trigger event

FDEP produces no logical output.

Its only effect is on propogating failures.

Functional Dependency gate

S t


51/83

slide(51)


fail at (possibly) reduced ratebefore being switched into active use

Primary

SPARE

Spare units which are assumed tocomponent

Spare gate

Priority AND gate


52/83

slide(52)


Priority-AND gate

which may occur in any order

A B

occur, and A occurs before B

Output occurs if A and B

Inputs (may be subtrees)

Cascading Priority AND gates


53/83

slide(53)


Cascading Priority-AND gates

A B

A before B before C

C

HECS: Hypothetical Example Computer System


54/83

slide(54)


HECS: Hypothetical Example Computer System

Operator console

Operator

& Software

A2

A1Memory

InterfaceUnit 1

M2M1 M3 M4 M5

Cold Spare A

Bus

Redundant

Memory

InterfaceUnit 2

HECS system description


55/83

slide(55)


HECS system description

HECS consists of dual-redundant processors A1 and A2 and a coldspare which can replace either upon failure. A cold spare is one whichis assumed not to fail before being used.

HECS has 5 memory units; three are required. These memory unitsare connected to the bus via two memory interface units. If the memoryinterface unit fails, the memory units connected to it are unusable.Memory unit 3 (M3) is connected to both interfaces for redundancy;thus M3 is accessible as long as either interface unit is operational.

There is also a human operator who interfaces with the system via aconsole, and runs some software application.

HECS requires at least one of the three A processors, at least 3 of thememory units, at least one of the redundant busses, and the operator,console and software to be operating correctly.

Modeling the cold spares


56/83

slide(56)


Modeling the cold spares

Notice that the cold spare is shared between the two processors. Firstto fail is replaced with the spare; the spare is then unavailable if theother fails.

Cold Spare Cold Spare

A1 A2

and spare

Cold Spare

A

A processors

Modeling the memory units


57/83

slide(57)


Modeling the memory units

M5M3

Functional

Dependency

M4M2M1

Functional

Dependency

Functional

Dependency

MIU 1MIU 2

3/5

memory

units

HECS system-level fault tree model


58/83

slide(58)


HECS system level fault tree model

hypothetical system failure

operator

console software

console

Operator


A1 A2

M5M3

Functional

Dependency

M4M2M1

Functional

Dependency

Functional

Dependency

MIU 1MIU 2

2*Bus

operator,

console & SW

A processors

and spare

Cold Spare

3/5

memory

units

A

Example: Fault Tolerant Parallel Processor


59/83

slide(59)


Example: Fault Tolerant Parallel Processor

Processingelements

Network

elements

NE2 NE4

NE3

NE1

FTPP configuration #1


60/83

slide(60)



One spare per triad

D1 C1 B1 A1

A2

B2

C2

D2

A3 B3 C3 D3

AS

BS

CS

DS

NE2 NE4

NE3

NE1

Fault tree for FTPP configuration #1


61/83

slide(61)RELIABILITY and MAINTAINABILITY Symposium


3/4 3/4 3/4 3/4

A1 A2 A3 AS B2 B3B1 C2 C3 D1 D2 D3 DSC1BS CS

NE4NE3NE2NE1

FDEPFDEP FDEP FDEP

B1A1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 AS BS CS DS



62/83


g

One spare per NE

B3

D1

S2

A3 C1 D2 S3

C2

D3

S4

A1B2C3S1

B1

A2

NE2 NE4

NE3

NE1

Failure conditions for FTPP configuration #2


63/83


g

Consider as an example, the first member of the A triad, specificallycomponent A1. Now A1 will fail if

both A1 and its spare (S1) fails OR if either of the other processors on the same NE fail before A1does,

thus using the spare first. In this case there will be no spare availablewhen A1 fails.

A1

B2 C3

A1S1



64/83


g

A2 A2A1

B2 C3

A1S1 S2

B3 D1

A3 S3

C1 D2

A3 B1 B1 B2 B3 B3S4 B2 S1 S2

C2 D3 C3 A1 D1 A2

C1 C1 C2 C2 C3 C3S3 S4 S1

D2 A3 D3 B1 A1 B2

D1 D1 D2 D2 D3 DS2 S3 S4

A2 B3 A3 C1 B1 C2

2/3 2/32/32/3

FDEP

NE1

A1 B2 C3 S1

FDEP

NE2

A2 B3 D1 S2

FDEP

B1

NE4

C2 D3 S4

FDEP

A3 C1 D2 S3

NE3

Alternative Fault Tree for configuration #2


65/83


NE4

AS BS CS DS

FDEP

A3 B3 C3 D3

NE3

FDEP

A2 B2 C2 D2

NE2

FDEP

NE1

B1A1 C1 D1

FDEP

A1 A2 A3 B1 B2 B3 C1 C2 C3

Spare Spare Spare SpareSpare Spare Spare Spare Spare Spare Spare

D1 D2 D3

AS

BS

CS

DS

Spare

2/3 2/32/3 2/3

Mission Avionics System Example


66/83


The suc c ess/ fa ilure of the system is d riven by the need

to p rovide c erta in software func tiona lity

c rew sta tion managementsc ene & obsta c le p roc essingloc a l pa th genera tionsystem management func tionsvehic le ma nag ement

Fault to leranc e is ac hieved via redundant p roc essors(hot spares), poolsof c old sparesand redundant buses.

MAS t hit t


67/83


MAS system architecture

Background Data Bus

Mission Management Bus

vehiclemgmt 1a

vehiclemgmt 1b

vehiclemgmt 2b

VMSPARE 2

VMSPARE 1

vehiclemgmt 2a

Vehicle Management Bus

Memory 1 Memory 2

scene&obstacle a

crewstation b

crewstation a

scene &obstacle b

local pathgen. a

local pathgen. b

systemmgmt a

systemmgmt b

SPARE 1

SPARE 2


68/83

Redundant Software Architecture


69/83


Next Consider the p a th genera tion (Pa thGen) andsc ene&obstac le (S&O) func tions:

Eac h func tion needs a sing le p roc essor to p rovide fullfunctionality.

There is a lso a reduc ed version of ea c h func tion tha t c anprovide minimum func tiona lity (Pa thGenMin and S&OMin).

In the event o f a detec ted softwa re fault in Pa thGen thesystem c an switc h to Pa thGenMin (simila rly for S&O)

Further, if there a re no longer 2 full p roc essors ava ilab le, thesystem will switc h to Pa thGenMin a nd S&OMin running on a

sing le p roc essor.

Redundant Software MAS model


70/83


Loss

S1

S2

Software

Sys Mgt1a

CSP

Sys Mgt

Sys Mgt1b

CSP

Sys1a

Sys1b

Path Gen1a

CSP

Path Gen

Path Gen1b

CSP

Path1a

Path1b

S&O

S&O 1b

CSP

S&O1b

S&O 1a

CSP

S&O1a

S&O SW

CSP

S&OSWfull

S&OSWMin

PathSWfull

PathSWMin

Path SW

CSP

One Proc

Minimize

FDEP

Crew

Both Proc

Crew 1a

CSP

Crew 1b

CSP

Crew1a

Crew1b


71/83



72/83










Modular Approach to fault tree analysis


73/83


Given a fault

tree model as input

Find Independent subtrees

Solve each subtree separately

Combine the results

The best of both worlds


74/83


Divide-and-c onquer helps p roduc e models tha t ma ynot be too la rge to solve.

For sta tic modules (c onta ining AND, OR, K-of-M) ga tes,

use the fast and effic ient BDD (Binary Dec ision Diagram)approach.

For dynamic modules, c onvert to equiva lent Ma rkov

model for solution.

Different solution methods c an a id in va lida tion a ndtesting.

Modula riza tion a llows c onsidera tion of d ifferent solutionmethods (i.e. simula tion).

Modular Solution of HECS


75/83


operator

consolesoftware

console

Operator


A1 A2

M5

M3

Functional

Dependency

M4M2M1

Functional

Dependency

Functional

Dependency

MIU 1MIU 2

type: static

type: dynamic

independent subtree 2

Independent subtree 1

operator,

console & SW

A processors

and spare

Cold Spare

3/5

memory

units

A

2*Bus

Independent subtree 4 (buses)type: static

hypothetical system failure

type:dynamic

independent subtree 3 (memories)



76/83










Sensitivity analysis


77/83


Reliability Analysis tells only part of the story.

What are the weak points in the system?

How do my results change with changing input parameters?

What is the most cost-effective way to improve reliability?

These questions require sensitivity analysis of reliability analysisresults.

Modular approach to sensitivity analysis


78/83


pp y y

Sensitivity analysis (also called importance analysis) can use partialderivative.

Sensitivity results from different submodels can be easily combinedusing chain-rule.

Sensitivity analysis for BDD is almost free while calculating reliability.

Sensitivity analysis for Markov chain is more troublesome but we havedeveloped an interesting, efficient approximation.

Example: Cardiac Assist System


79/83


TEDTSController

Battery

PowerSupply

PaceLeads

PrimaryCPU

BackupCPU

TEDTSCoil

MotorAmplifier

MotorCableMotorPump

CrossbarSwitch

WSP

Cardiac Assist System Fault TreeSystem


80/83


FDEP WSP

TEDTS

Coil

TEDTS

Contr.Battery

Motor

Cable.

MotorMotor

Amp.Backup

CPU

Primary

CPU

System

Superv.

Crossbar

Switch.

Pace

LeadsPump

Power

Supply

System

Failure

Pump,Motor

Leads

Power

TEDTS

CPU

Motor

Section

TEDTS

Trigger

M1

M2

M3

M4

M5


81/83

Summary


82/83

slide(82)


The DFT(dynamic fault tree) methodology is ideally suited for theanalysis of computer-based systems.

DFTuses a modular approach to FTA, detecting modules using a fastand efficient algorithm.

Modules are classified as static or dynamic, depending on the types ofgates included.

Static modules are solved using the BDD approach; dynamic modulesare solved using Markov chain methods.

Coverage models can assess the effect of complex recoverymechanisms.

Dynamic gates can allow modeling of sequence dependencies thatarise from complex redundancy management.


83/83

slide(83)


Software for Dynamic Fault Tree Analysis

Galileo/ ASSAPis a software package for fault tree

analysis which embodies the DFT approach.

(Being developed for NASA Langley Research

Center, expected completion Nov. 2001. Beta ver-sion available for evaluation.)

Documents

Dugan Comp Sys Fta Tutor