Safety Critical Systems

Safety Critical Systems

Eight steps to safety

• Identify the hazards

• Determine the risks

• Define the safety measures

• Create safe requirements

• Create safe designs

• Implement safety

• Assure the safety process

• Test, test, test










Safety analysis

Handled at thearchitectural leveland mechanistic level

Safety Analysis

• You must identify the hazards of the system

• You must identify the faults that can lead to hazards

• You must define safety control measures to handle hazards

• These culminate in the Hazard Analysis

• The Hazard Analysis feeds into the Requirements Specification










Hazard Causes

• Release of energy– electromagnetism (microwave oven)– radiation (nuclear power plant)– electricity (electrocution hazard from

ECG leads)– heat (infant warmer)– kinetic (runaway train)

• Release of toxins

Hazard Causes

• Interference with life support or other safety-related function

• Misleading safety personnel

• Failure to alarm– alarming too much - Therac 25. These

were ignored and people were killed

Types of Hazards

• Actions– inappropriate system actions taken• F-18 pilot pulling up landing gear

– appropriate system actions not taken

• Timing– too soon– too late– fault latency time

Types of Hazards

• Sequence– skipping actions– actions out of order

• Amount– too much– too little

Example Hazards

• Actions– incorrectly energizing a medical

treatment laser– failure to engage landing gear

• Timing– cardiac pacemaker paces too fast– flight control surface adjusted too

slowly

Example Hazards

• Sequence– empty the vat, THEN add the reagent– out of sequence network packets

controlling industrial robot

• Amount– electrocution from muscle stimulator– too little oxygen delivered to ventilator

patient

Means of Hazard Control• Obviation; the possibility of the hazard can be

removed by being made physically impossible– use incompatible fasteners to prevent cross

connections

• Education; the hazard can be handled by educating the users so that they won’t create hazardous conditions through equipment misuse– don’t look down the barrel when cleaning your rifle

Means of Hazard Control

• Alarming; announcing the hazard to the user when it appears so that they can take appropriate action– alarming when the heart stops beating

• Interlocks; the hazard can be removed by using secondary devices and/or logic to intercede when a hazard presents itself– car won’t start unless it is in “Park”


• Internal checking; the hazard can be handled by ensuring that a system can detect that it is malfunctioning prior to an incident– CRC checks data for corruption

whenever it is accessed

• Safety equipment– goggles, gloves


• Restricting access to potential hazards so that only knowledgeable users have such access– using passwords to prevent

inadvertently starting service mode

• Labelling– “High Voltage High Voltage -- DO NOT TOUCH”

Hazard Analysis

Hazard Level ofrisk

Tolerance timeT1

Fault Likelihood

Detectiontime

ControlMeasure

Exposuretime

Hypo-ventilation

Severe 5 min Ventilatorfans

rare 30 sec Independentpressurealarm,action bydoctor

1 min

EsphagealIntubation

often 30 sec C)2 sensoralarm

1 min

Usermisattachesbreathingcircuit

often 0 Noncompatiblemechanicalfastenersused

0

Overpressure

Severe 250 ms Releasevalvefailure

rare 50 ms Secondaryvalve opens

55 ms

Hazardous condition

How bad if itoccurs?

How long can it be tolerated

How can thishappen?

Howfrequently?

How long todiscover?

What do youdo about it?

How long isthe exposureto hazard?

When is a system safe enough?

• (Minimal) No hazards in the absence of faults

• (Minimal) No hazards in the presence of any single point failure– a common mode failure common mode failure is a single point failure

that affects multiple channels– a latent fault latent fault is an undetected fault which allows

another fault to cause a hazard

• Your mileage may vary depending on the risk introduced by your system

Safety Measures• You cannot depend on a safety measure that You cannot depend on a safety measure that

you cannot test!you cannot test!

CAN bus with 2 nodes provides a CRC on messages checked at the chip level, but the chips provide no way of testing to see if it is working.

Therefore, it cannot be relied on as a safety Therefore, it cannot be relied on as a safety measuremeasure

Fail-Safe States

• Off– Emergency stop -- immediately cut power– Production stop -- stop after the current

task– Protection stop -- shut down without

removing power

• Partial Shutdown– Degraded level of functionality

Fail-Safe States

• Hold– No functionality, but with safety actions

taken

• Manuel or External control

• Restart (reboot)










Risk Assessment

• For each hazard– determine the potential severity– determine the likelihood of the hazard– determine how long the user is exposed

to the hazard– determine whether the risk can be

removed

TUV Risk Level Determination Chart

W3 W2 W1

1234567

8

-1234567

--123456

S1

S2

S3

S4

E1

E2

E1E2

G1G2G1G2

Risk parametersS: Extent of damage

S1: slight injuryS2: severe irreversible injury, to one of more persons or the death of a single personS3: death of several personsS4: Catestrophic consequences, several deaths

E: Exposure timeE1: seldom to relatively infrequentE2: frequent to continuous

G: Hazard PreventionG1: possible under certain conditionsG2: hardly possible

W: Occurrence probability of hazardous eventW1: very lowW2: lowW3: relatively high

Sample Risk Assessments

Device Hazard Extent ofdamage

Exposuretime

HazardPrevention

Probability TUV Risklevel

Microwaveoven

Irradiation S2 E2 G2 W3 5

Pacemaker Pace tooslowly

S2 E2 G2 W3 5

Pace toofast

S2 E2 G2 W3 5

Powerstation

Explosion S3 E1 -- W3 6

Airliner Crash S4 E2 G2 W2 8










Safety Measures

• Safety measures do one of the following– remove the hazard– reduce the risk– identify the hazard to supervisory control

• The purpose of the safety measure is to ensure the system remains in a safe state

Safety Measures

Component Fault/Error Software class

12

Examples of acceptable measures

Interrupt handlingand execution

no interrupt or toofrequent

rq functional test; or time-slotmonitoring

no interrupt or toofrequent andinterrupt relatedto differentsources

rq comparison of redundantfunctional channles by either;- reciprocal comparison- independent hardwarecomparator- independent time-slot and logicalmonitoring

• Adequacy of measures

• safety measures mut be able to reliably detect the fault

• safety measures must be able to take appropriate actions

Risk Reduction

• Identify the fault• Take corrective action, either– use redundancy to correct and move on

• feedforward error correction (Hamming codes)

– redo the computational step• feedback error detection (take corrective

action first)

– go to a fail-safe state

Fault Identification at Run-Time

• Faults must be identified in < TO

• Fault identification requires redundancy

• Redundancy can be in terms of– channel– device– data– control

} Architectural

} Detailed design

Fault Identification at Run-Time

• Redundancy may be either– homogenous (random faults only)

• does not detect errors• peform functions the same way on the same

thing multiple times

– heterogenous (systematic and random faults)• includes errors -> present in allall channels• perform processing differently and hopefully you

didn’t make the same mistake!

Fault Tree Analysis Symbology

An event that results from acombination of events througha logic gate

A basic fault event that requires nofurther development

A fault event because the eventis inconsequential or the necessaryinformation is not available

An event that is expected to occurnormally

A condition that must bepresent to produce the output of a gate

Transfer

AND gate (also OR gate)

NOT gate

Subset of Pacemaker Fault Analysis

Condition or event to avoid

Secondary conditions or events

Primary or fundamental faults

Pacing tooslowly

Time-basefault

Invalidpacing rate

Shutdownfault

Watchdogfailure

Crystalfailure

Badcommandrate

Datacorruptedin vivo

Ratecommandcorrupted

CRChardwarefailure

Softwarefailure

CPUH/Wfailure

OR

OR

ORAND

AND










Safe Requirements

• Requirements specification follows initial hazard analysis

• Specific requirements should track back to hazard analysis –must be shown to FDA, etc

• Architectural framework should be selected with safety needs in mind– has the hooks in place










Use Good Design Practices

• Good design practices allow you to–manage complexity– view the system at various levels of abstraction– zoom in on a particular area of interest– identify hot spots of special concern– have consistent quality– easily test– build and use high quality components

• Regulatory agencies look at this!!

Use Good Design Practices• Manage your requirements– trace requirements to design elements– trace design elements back to requirements

remote communications

class a class b

class cclass d class e

requirementsspecification

design model

use cases

adjusttrajectory

remotecommunication


• Use iterative development– integrating many times finds more

defects– iterative prototypes can result in more

reliable and safe systems


• Use component-based design architectures– third party components may be very well

tested in they are in wide use– require bug lists from component vendors

• this bit Microsoft once


• Use Visual Modeling– UML–Ward-Mellor

• Use executable models– animate models– execute and debug at modeling level of

abstraction


• Use frameworks– a framework is a partially completed

application which is specialized by the user• Microsoft foundation classes

• Object Execution Framework

– frameworks reduce the work of developing new applications

– frameworks rely on well-tested patterns


pattern

pattern pattern

pattern

pattern

+

=

Framework User Model

System

80-90% of application code is housekeeping code


• Use Configuration Management– only use unit-testing components in

builds

dataaquisition

drivers

OS

parameters

CM DatabaseSYSTEM


• Design for test– product testing– built-in-testing to ensure• invariants are truly invariant

– functional invariants– quality of service invariants (e.g. performance)

• faults are detected

Good Design Practices• Isolate Safety Functions– Safety-relevant systems are 200-300% more

effort to produce– Isolation of safety systems allows more

expedient development– Care must be taken that the safety system is

truly isolated so that a defect in the non-safety system cannot affect the safety system• different processor

• different heavy-weight tasks (depends on the OS)sub system system

Safety Critical Patterns

Safety Architecture Patterns

• Protected Single-Channel Pattern

• Dual-Channel Pattern– Homogenous Dual Channel Pattern– Heterogenous Peer-Channel Pattern– Sanity Check Pattern– Actuator-Monitor Pattern

• Voting Multichannel Pattern

Protected Single Channel Pattern

• Within the single channel, mechanisms exist to identify and handle faults

• All faults must be detected within the fault tolerance time

• May be impossible– to test for all faults within the fault tolerance time– to remove common mode failures from the single

channel

• Generally, less recurring system cost– no additional hardware required

Protected Single Channel Pattern

Single Channel Train Braking System

If I’m not getting life ticks, I’ll shut down!

Dual Channel Architecture Patterns

• Separation of safety-relevant from non-safety relevant where possible

• Separation of monitoring from control• Generally easier to meet safety

requirements– timing– common mode failures

• Generally higher recurring system cost– additional hardware required

Homogenous Dual-Channel Pattern

• Identical channels used

• Channels may operate simultaneously (Multichannel Vote Pattern)

• Channels may operate in series (Backup Pattern)

• Good at identifying random faults but not systematic faults

• Low R&D cost, higher recurring cost

Homogenous Dual-Channel Pattern

Heterogeneous Peer-Channel Pattern

• Equal weight, differently implemented channels–may use algorithmic inversion to recreate

initial data–may use different algorithm–may use different teams (not fool proof

because of hot spots that can cause failures)

• Good at identifying both random and systematic faults


• Generally safest, but higher R&D and recurring cost


Sanity Check Pattern

• A primary actuator channel does real computations

• A light-weight secondary channel checks the reasonableness of the primary channel

• Good for detection of both random and systematic faults

• May not detect faults which result in small variance

• Relatively inexpensive to implement

Monitor-Actuator Pattern

• Separates actuation from the monitoring of that actuation

• If the actuator channel fails, the monitor channel detects it

• If the monitor channel fails, the actuator channel continues correctly

• Requires fault isolation to be single-fault tolerant– actuator channel cannot use the monitor itself

Monitor-Actuator Pattern

Dual-Channel Design Architecture

Safety Executive Pattern

• Large scale architectural pattern• Controller subsystem (safety executive)• One or more watchdog subsystems– check on system health– ensure proper actuation is occurring

• One or more actuation channels• Recovery subsystem (Fail safe

processing channel)


• Appropriate when– A set of fail-safe system states needs to be

entered when failures identified– Determination of failures is complex– Several safety-related system actions are

controlled simultaneously– Safety-related actions are not independent– Determining proper safety action in the

event of a failure can be complex











Detailed Design for Safety

• Make it right before you make it fast– simple, clear algorithms and code– optimize only the 10%-20% of code

which affects performance– use “safe” language subsets– ensure you haven’t introduced any

common mode failures


• Thoroughly test– unit test and peer review– integration test– validation test


• Verify that it remains right throughout program execution– exceptions

– invariant assertions

– range checking

– index and boundary checking

• When its not right during execution, then make it right with corrective or protective measures


• Use “safe” language subsets– strong compile-time checking

• if you use C, use “lint”

– strong run-time checking– exception handling– avoid void*– avoid error prone statements and syntax

• you can make C++ safe but its not safe out of the box


• Language choice– Compile time checking (C versus Ada)– Run-time checking (C versus Ada)

• Exceptions versus error codes

• Language selection– “C treats you like a consenting adult.

Pascal treats you like a naughty child. Ada treats you like a criminal”

Pascal exampleProgram WontCompile;

type

MySubRange = 0 .. 20;

Day = (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday);

var

MyVar: MySubRange

MyDate: Day;

begin

MyVar := 9; { will not compile -- range error! }

MyDate := 0;{ will not compile -- wrong type! }

end.

Ada exampleProcedure MyProc is

Var

MyArray: array (1..10) of integer;

j: integer;

b: byte;

begin

for j in 0 .. 10 loop

MyArray(j) := j^6; -- raises exception on first time

--through

end loop;

b := MyArray(10); -- will fail run-time range check

end MyProc;

Exceptions

• Some languages (Pascal, Modula-2) have a draconian error handling policy– exception raised and program terminated– not good for embedded systems

• Ada and C++ allow run time recovery through user-defined exceptions and exception handlers

Exceptions

• A lot of extra code to check the statement

a[j] := b;


• Do not allow ignoring of error indications– checking of return values is a manuel

process– user of the function must remember each

and every time– easy to circumvent this error handling

system

• Separate normal code from error handling code


• Handle errors at the lowest level with sufficient context to correct the problem

Error handling codea = getfone(&b, &c);

if (a) {

switch (a) {

case 1: ..

case 2: ..

}

d = getftwo(b,c)

if (d) {

switch (a) {

case 1: ..

case 2: ..

}

}

in this code the normal execution path is:in this code the normal execution path is:

a = getfone(&b,&c);a = getfone(&b,&c);d = getftwo(b,c);d = getftwo(b,c);

Built-in exception types• procedure enqueue (q: in out queue; v: in FLOAT)

is• begin– if full (q) then• raise overflow;

– end if;– q.body(q.head + q.length) mod qSize := v;– q.length := q.length + 1;

• end enqueue;

Caller of the sequence handles exception

• procedure testQ(q: in out queue) is

• begin

– for j in 1 .. 10 loop

• enqueue(q, random(1000))

– end loop;

– exception

• when overflow =>

• puts(“Test failed due to queue overflow”);

• end testQ

C++ exception handling

• Extends capabilities beyond that of Ada• Exceptions extended by type rather than

value– possible to create hierarchies of exception

classes and catch by thrown subclass type– class can contain different types of

information about the kind of device that failed– this facilitates error recovery, debugging, and

user error reporting

Making C++ safe

• Overloading the [ ] operator with index range checking improves the safety of arrays

• Make classes of scalars and overload the assignment operator allows additional range and value checking


• Data Validity Checks– CRC (16 bit or 32 bit)

• identifies all single or dual bit errors• detects high percentage of multiple bit errors• table or compute-driven• chips are available

– checksum– redundant storage

• ones complement


• Redundancy should be set every write access

• Data should be checked every read access

ANSI C++ Exception Class Hierarchy

exception

logic error

domain error out of range

invalidargument length error

runtime error

range error

overflow error










Safety Process (Development)

• Do Hazard Analysis early and often

• Track safety measures from hazard analysis to– requirments specification– design– code– validation tests

• Test safety measures with fault seeding

Safety Process (Deployment)

• Install safely– ensure proper means are used to set up system– safety measures are installed and checked

• Deploy safely– ensure safety measures are periodically

checked and serviced– Do not turn off safety measures

• Decommission safely– removal of hazardous materials

Concept

Overall scope definition

Hazard and riskanalysis

Overall safetyrequirements

Safety requirementsallocation

SRS E/E/PESrealization

Overall installation and commissioning

Overall safety validation

Overall operation and maintenance

Decommissioning

Overall modificationand retrofit

Overall planning

Ops &mainten.planning

Validationplanning

Install.planning

SRS: other technologyrealization

External risk reduct.facilities

SRS; Safety Related SystemE/E/PES; Electrical/Electronic/Programmableelectronic system

IEC Overall Safety Lifecycle










Safety in Testing in R&D• Use fault-seeding– Unit (class) testing

• white box

• procedural invariant violation assertions

• peer reviews

– Integration testing• grey box

– Validation testing• black box

• externally caused faults

• (Grey box) internally seeded faults

Safety Testing During Operation

• Power on Self-Test (POST)– Check for latent faults– All safety measures must be tested at power on

and periodically• RAM (stuck-at, shorts, cell failures)

• ROM

• Flash

• Disks

• CPU

• Interfaces

• Buses

Safety Testing During Operation

• Built-In Tests– Repeats some of POST– Data integrity checks– Index and pointer validity checking– Subrange value invariant assertions– Proper functioning

• Watchdogs• Reasonableness checks• Lifeticks

A simplified Example: A linear Accelerator

Unsafe Linear Accelerator

CPU

Beam IntensityBeam Duration

1. Set Dose2. Start Beam3. End Beam Sensor

Radiation Dose

Fault Tree Analysis

OR

AND

Over radiation

Radiationcommandinvalid

OR

EMI Softwaredefect

Shutofftimerfailure

CPU Halted

OR

EMI CPUfailure

Softwaredefect

Beamengaged

Hazards of the Linear Accelerator

Hazard Level ofrisk

ToleranceTime T1

Fault Likelihood Detectiontime

Controlmeasure

Exposure time

Overradiation

Severe 100 ms CPUlocksup

rare 50 ms SafetyCPUcheckslifetick at2 5 ms

50m ms

Corrupt datasettings

often 10 ms 32 bitCRCs ondatacheckedeveryaccess

15 ms

Underradiation

Moderate

2 weeks corrupt datasetting

often 10 ms 32 bitCRCs ondatacheckedeveryaccess

15 ms

Inadvertantradiation onpoweron

sefere 100 ms beamleftengagedduringpowerdown

often n/a curtainmechanically shutsat powerdown

0 ms

Safe Linear Accelerator

CPU

Beam IntensityBeam Duration

1. Set Dose2. Start Beam3. End Beam Sensor

Radiation Dose

Safety CPU

Periodic watchdog service

Self test results shared prior to operation

Deenergize Mechanical shutoffwhen curtain low

Summary

• Safety is a system issue• It is cheaper and more effective to

include safety early on then to add it later

• Safety architectures provide programming in the large safety

• Safe coding rules and detailed design provide programming in-the-small safety

Documents

Safety Critical Systems