Upload
grace-horton
View
34
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Safety Critical Systems. Eight steps to safety. Identify the hazards Determine the risks Define the safety measures Create safe requirements Create safe designs Implement safety Assure the safety process Test, test, test. Eight steps to safety. Identify the hazards Determine the risks - PowerPoint PPT Presentation
Citation preview
Safety Critical Systems
Eight steps to safety
• Identify the hazards
• Determine the risks
• Define the safety measures
• Create safe requirements
• Create safe designs
• Implement safety
• Assure the safety process
• Test, test, test
Eight steps to safety
• Identify the hazards
• Determine the risks
• Define the safety measures
• Create safe requirements
• Create safe designs
• Implement safety
• Assure the safety process
• Test, test, test
Safety analysis
Handled at thearchitectural leveland mechanistic level
Safety Analysis
• You must identify the hazards of the system
• You must identify the faults that can lead to hazards
• You must define safety control measures to handle hazards
• These culminate in the Hazard Analysis
• The Hazard Analysis feeds into the Requirements Specification
Eight steps to safety
• Identify the hazards
• Determine the risks
• Define the safety measures
• Create safe requirements
• Create safe designs
• Implement safety
• Assure the safety process
• Test, test, test
Hazard Causes
• Release of energy– electromagnetism (microwave oven)– radiation (nuclear power plant)– electricity (electrocution hazard from
ECG leads)– heat (infant warmer)– kinetic (runaway train)
• Release of toxins
Hazard Causes
• Interference with life support or other safety-related function
• Misleading safety personnel
• Failure to alarm– alarming too much - Therac 25. These
were ignored and people were killed
Types of Hazards
• Actions– inappropriate system actions taken• F-18 pilot pulling up landing gear
– appropriate system actions not taken
• Timing– too soon– too late– fault latency time
Types of Hazards
• Sequence– skipping actions– actions out of order
• Amount– too much– too little
Example Hazards
• Actions– incorrectly energizing a medical
treatment laser– failure to engage landing gear
• Timing– cardiac pacemaker paces too fast– flight control surface adjusted too
slowly
Example Hazards
• Sequence– empty the vat, THEN add the reagent– out of sequence network packets
controlling industrial robot
• Amount– electrocution from muscle stimulator– too little oxygen delivered to ventilator
patient
Means of Hazard Control• Obviation; the possibility of the hazard can be
removed by being made physically impossible– use incompatible fasteners to prevent cross
connections
• Education; the hazard can be handled by educating the users so that they won’t create hazardous conditions through equipment misuse– don’t look down the barrel when cleaning your rifle
Means of Hazard Control
• Alarming; announcing the hazard to the user when it appears so that they can take appropriate action– alarming when the heart stops beating
• Interlocks; the hazard can be removed by using secondary devices and/or logic to intercede when a hazard presents itself– car won’t start unless it is in “Park”
Means of Hazard Control
• Internal checking; the hazard can be handled by ensuring that a system can detect that it is malfunctioning prior to an incident– CRC checks data for corruption
whenever it is accessed
• Safety equipment– goggles, gloves
Means of Hazard Control
• Restricting access to potential hazards so that only knowledgeable users have such access– using passwords to prevent
inadvertently starting service mode
• Labelling– “High Voltage High Voltage -- DO NOT TOUCH”
Hazard Analysis
Hazard Level ofrisk
Tolerance timeT1
Fault Likelihood
Detectiontime
ControlMeasure
Exposuretime
Hypo-ventilation
Severe 5 min Ventilatorfans
rare 30 sec Independentpressurealarm,action bydoctor
1 min
EsphagealIntubation
often 30 sec C)2 sensoralarm
1 min
Usermisattachesbreathingcircuit
often 0 Noncompatiblemechanicalfastenersused
0
Overpressure
Severe 250 ms Releasevalvefailure
rare 50 ms Secondaryvalve opens
55 ms
Hazardous condition
How bad if itoccurs?
How long can it be tolerated
How can thishappen?
Howfrequently?
How long todiscover?
What do youdo about it?
How long isthe exposureto hazard?
When is a system safe enough?
• (Minimal) No hazards in the absence of faults
• (Minimal) No hazards in the presence of any single point failure– a common mode failure common mode failure is a single point failure
that affects multiple channels– a latent fault latent fault is an undetected fault which allows
another fault to cause a hazard
• Your mileage may vary depending on the risk introduced by your system
Safety Measures• You cannot depend on a safety measure that You cannot depend on a safety measure that
you cannot test!you cannot test!
CAN bus with 2 nodes provides a CRC on messages checked at the chip level, but the chips provide no way of testing to see if it is working.
Therefore, it cannot be relied on as a safety Therefore, it cannot be relied on as a safety measuremeasure
Fail-Safe States
• Off– Emergency stop -- immediately cut power– Production stop -- stop after the current
task– Protection stop -- shut down without
removing power
• Partial Shutdown– Degraded level of functionality
Fail-Safe States
• Hold– No functionality, but with safety actions
taken
• Manuel or External control
• Restart (reboot)
Eight steps to safety
• Identify the hazards
• Determine the risks
• Define the safety measures
• Create safe requirements
• Create safe designs
• Implement safety
• Assure the safety process
• Test, test, test
Risk Assessment
• For each hazard– determine the potential severity– determine the likelihood of the hazard– determine how long the user is exposed
to the hazard– determine whether the risk can be
removed
TUV Risk Level Determination Chart
W3 W2 W1
1234567
8
-1234567
--123456
S1
S2
S3
S4
E1
E2
E1E2
G1G2G1G2
Risk parametersS: Extent of damage
S1: slight injuryS2: severe irreversible injury, to one of more persons or the death of a single personS3: death of several personsS4: Catestrophic consequences, several deaths
E: Exposure timeE1: seldom to relatively infrequentE2: frequent to continuous
G: Hazard PreventionG1: possible under certain conditionsG2: hardly possible
W: Occurrence probability of hazardous eventW1: very lowW2: lowW3: relatively high
Sample Risk Assessments
Device Hazard Extent ofdamage
Exposuretime
HazardPrevention
Probability TUV Risklevel
Microwaveoven
Irradiation S2 E2 G2 W3 5
Pacemaker Pace tooslowly
S2 E2 G2 W3 5
Pace toofast
S2 E2 G2 W3 5
Powerstation
Explosion S3 E1 -- W3 6
Airliner Crash S4 E2 G2 W2 8
Eight steps to safety
• Identify the hazards
• Determine the risks
• Define the safety measures
• Create safe requirements
• Create safe designs
• Implement safety
• Assure the safety process
• Test, test, test
Safety Measures
• Safety measures do one of the following– remove the hazard– reduce the risk– identify the hazard to supervisory control
• The purpose of the safety measure is to ensure the system remains in a safe state
Safety Measures
Component Fault/Error Software class
12
Examples of acceptable measures
Interrupt handlingand execution
no interrupt or toofrequent
rq functional test; or time-slotmonitoring
no interrupt or toofrequent andinterrupt relatedto differentsources
rq comparison of redundantfunctional channles by either;- reciprocal comparison- independent hardwarecomparator- independent time-slot and logicalmonitoring
• Adequacy of measures
• safety measures mut be able to reliably detect the fault
• safety measures must be able to take appropriate actions
Risk Reduction
• Identify the fault• Take corrective action, either– use redundancy to correct and move on
• feedforward error correction (Hamming codes)
– redo the computational step• feedback error detection (take corrective
action first)
– go to a fail-safe state
Fault Identification at Run-Time
• Faults must be identified in < TO
• Fault identification requires redundancy
• Redundancy can be in terms of– channel– device– data– control
} Architectural
} Detailed design
Fault Identification at Run-Time
• Redundancy may be either– homogenous (random faults only)
• does not detect errors• peform functions the same way on the same
thing multiple times
– heterogenous (systematic and random faults)• includes errors -> present in allall channels• perform processing differently and hopefully you
didn’t make the same mistake!
Fault Tree Analysis Symbology
An event that results from acombination of events througha logic gate
A basic fault event that requires nofurther development
A fault event because the eventis inconsequential or the necessaryinformation is not available
An event that is expected to occurnormally
A condition that must bepresent to produce the output of a gate
Transfer
AND gate (also OR gate)
NOT gate
Subset of Pacemaker Fault Analysis
Condition or event to avoid
Secondary conditions or events
Primary or fundamental faults
Pacing tooslowly
Time-basefault
Invalidpacing rate
Shutdownfault
Watchdogfailure
Crystalfailure
Badcommandrate
Datacorruptedin vivo
Ratecommandcorrupted
CRChardwarefailure
Softwarefailure
CPUH/Wfailure
OR
OR
ORAND
AND
Eight steps to safety
• Identify the hazards
• Determine the risks
• Define the safety measures
• Create safe requirements
• Create safe designs
• Implement safety
• Assure the safety process
• Test, test, test
Safe Requirements
• Requirements specification follows initial hazard analysis
• Specific requirements should track back to hazard analysis –must be shown to FDA, etc
• Architectural framework should be selected with safety needs in mind– has the hooks in place
Eight steps to safety
• Identify the hazards
• Determine the risks
• Define the safety measures
• Create safe requirements
• Create safe designs
• Implement safety
• Assure the safety process
• Test, test, test
Use Good Design Practices
• Good design practices allow you to–manage complexity– view the system at various levels of abstraction– zoom in on a particular area of interest– identify hot spots of special concern– have consistent quality– easily test– build and use high quality components
• Regulatory agencies look at this!!
Use Good Design Practices• Manage your requirements– trace requirements to design elements– trace design elements back to requirements
remote communications
class a class b
class cclass d class e
requirementsspecification
design model
use cases
adjusttrajectory
remotecommunication
Use Good Design Practices
• Use iterative development– integrating many times finds more
defects– iterative prototypes can result in more
reliable and safe systems
Use Good Design Practices
• Use component-based design architectures– third party components may be very well
tested in they are in wide use– require bug lists from component vendors
• this bit Microsoft once
Use Good Design Practices
• Use Visual Modeling– UML–Ward-Mellor
• Use executable models– animate models– execute and debug at modeling level of
abstraction
Use Good Design Practices
• Use frameworks– a framework is a partially completed
application which is specialized by the user• Microsoft foundation classes
• Object Execution Framework
– frameworks reduce the work of developing new applications
– frameworks rely on well-tested patterns
Use Good Design Practices
pattern
pattern pattern
pattern
pattern
+
=
Framework User Model
System
80-90% of application code is housekeeping code
Use Good Design Practices
• Use Configuration Management– only use unit-testing components in
builds
dataaquisition
drivers
OS
parameters
CM DatabaseSYSTEM
Use Good Design Practices
• Design for test– product testing– built-in-testing to ensure• invariants are truly invariant
– functional invariants– quality of service invariants (e.g. performance)
• faults are detected
Good Design Practices• Isolate Safety Functions– Safety-relevant systems are 200-300% more
effort to produce– Isolation of safety systems allows more
expedient development– Care must be taken that the safety system is
truly isolated so that a defect in the non-safety system cannot affect the safety system• different processor
• different heavy-weight tasks (depends on the OS)sub system system
Safety Critical Patterns
Safety Architecture Patterns
• Protected Single-Channel Pattern
• Dual-Channel Pattern– Homogenous Dual Channel Pattern– Heterogenous Peer-Channel Pattern– Sanity Check Pattern– Actuator-Monitor Pattern
• Voting Multichannel Pattern
Protected Single Channel Pattern
• Within the single channel, mechanisms exist to identify and handle faults
• All faults must be detected within the fault tolerance time
• May be impossible– to test for all faults within the fault tolerance time– to remove common mode failures from the single
channel
• Generally, less recurring system cost– no additional hardware required
Protected Single Channel Pattern
Single Channel Train Braking System
If I’m not getting life ticks, I’ll shut down!
Dual Channel Architecture Patterns
• Separation of safety-relevant from non-safety relevant where possible
• Separation of monitoring from control• Generally easier to meet safety
requirements– timing– common mode failures
• Generally higher recurring system cost– additional hardware required
Homogenous Dual-Channel Pattern
• Identical channels used
• Channels may operate simultaneously (Multichannel Vote Pattern)
• Channels may operate in series (Backup Pattern)
• Good at identifying random faults but not systematic faults
• Low R&D cost, higher recurring cost
Homogenous Dual-Channel Pattern
Heterogeneous Peer-Channel Pattern
• Equal weight, differently implemented channels–may use algorithmic inversion to recreate
initial data–may use different algorithm–may use different teams (not fool proof
because of hot spots that can cause failures)
• Good at identifying both random and systematic faults
Heterogeneous Peer-Channel Pattern
• Generally safest, but higher R&D and recurring cost
Heterogeneous Peer-Channel Pattern
Sanity Check Pattern
• A primary actuator channel does real computations
• A light-weight secondary channel checks the reasonableness of the primary channel
• Good for detection of both random and systematic faults
• May not detect faults which result in small variance
• Relatively inexpensive to implement
Monitor-Actuator Pattern
• Separates actuation from the monitoring of that actuation
• If the actuator channel fails, the monitor channel detects it
• If the monitor channel fails, the actuator channel continues correctly
• Requires fault isolation to be single-fault tolerant– actuator channel cannot use the monitor itself
Monitor-Actuator Pattern
Dual-Channel Design Architecture
Safety Executive Pattern
• Large scale architectural pattern• Controller subsystem (safety executive)• One or more watchdog subsystems– check on system health– ensure proper actuation is occurring
• One or more actuation channels• Recovery subsystem (Fail safe
processing channel)
Safety Executive Pattern
• Appropriate when– A set of fail-safe system states needs to be
entered when failures identified– Determination of failures is complex– Several safety-related system actions are
controlled simultaneously– Safety-related actions are not independent– Determining proper safety action in the
event of a failure can be complex
Safety Executive Pattern
Eight steps to safety
• Identify the hazards
• Determine the risks
• Define the safety measures
• Create safe requirements
• Create safe designs
• Implement safety
• Assure the safety process
• Test, test, test
Detailed Design for Safety
• Make it right before you make it fast– simple, clear algorithms and code– optimize only the 10%-20% of code
which affects performance– use “safe” language subsets– ensure you haven’t introduced any
common mode failures
Detailed Design for Safety
• Thoroughly test– unit test and peer review– integration test– validation test
Detailed Design for Safety
• Verify that it remains right throughout program execution– exceptions
– invariant assertions
– range checking
– index and boundary checking
• When its not right during execution, then make it right with corrective or protective measures
Detailed Design for Safety
• Use “safe” language subsets– strong compile-time checking
• if you use C, use “lint”
– strong run-time checking– exception handling– avoid void*– avoid error prone statements and syntax
• you can make C++ safe but its not safe out of the box
Detailed Design for Safety
• Language choice– Compile time checking (C versus Ada)– Run-time checking (C versus Ada)
• Exceptions versus error codes
• Language selection– “C treats you like a consenting adult.
Pascal treats you like a naughty child. Ada treats you like a criminal”
Pascal exampleProgram WontCompile;
type
MySubRange = 0 .. 20;
Day = (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday);
var
MyVar: MySubRange
MyDate: Day;
begin
MyVar := 9; { will not compile -- range error! }
MyDate := 0;{ will not compile -- wrong type! }
end.
Ada exampleProcedure MyProc is
Var
MyArray: array (1..10) of integer;
j: integer;
b: byte;
begin
for j in 0 .. 10 loop
MyArray(j) := j^6; -- raises exception on first time
--through
end loop;
b := MyArray(10); -- will fail run-time range check
end MyProc;
Exceptions
• Some languages (Pascal, Modula-2) have a draconian error handling policy– exception raised and program terminated– not good for embedded systems
• Ada and C++ allow run time recovery through user-defined exceptions and exception handlers
Exceptions
• A lot of extra code to check the statement
a[j] := b;
Detailed Design for Safety
• Do not allow ignoring of error indications– checking of return values is a manuel
process– user of the function must remember each
and every time– easy to circumvent this error handling
system
• Separate normal code from error handling code
Detailed Design for Safety
• Handle errors at the lowest level with sufficient context to correct the problem
Error handling codea = getfone(&b, &c);
if (a) {
switch (a) {
case 1: ..
case 2: ..
}
d = getftwo(b,c)
if (d) {
switch (a) {
case 1: ..
case 2: ..
}
}
in this code the normal execution path is:in this code the normal execution path is:
a = getfone(&b,&c);a = getfone(&b,&c);d = getftwo(b,c);d = getftwo(b,c);
Built-in exception types• procedure enqueue (q: in out queue; v: in FLOAT)
is• begin– if full (q) then• raise overflow;
– end if;– q.body(q.head + q.length) mod qSize := v;– q.length := q.length + 1;
• end enqueue;
Caller of the sequence handles exception
• procedure testQ(q: in out queue) is
• begin
– for j in 1 .. 10 loop
• enqueue(q, random(1000))
– end loop;
– exception
• when overflow =>
• puts(“Test failed due to queue overflow”);
• end testQ
C++ exception handling
• Extends capabilities beyond that of Ada• Exceptions extended by type rather than
value– possible to create hierarchies of exception
classes and catch by thrown subclass type– class can contain different types of
information about the kind of device that failed– this facilitates error recovery, debugging, and
user error reporting
Making C++ safe
• Overloading the [ ] operator with index range checking improves the safety of arrays
• Make classes of scalars and overload the assignment operator allows additional range and value checking
Detailed Design for Safety
• Data Validity Checks– CRC (16 bit or 32 bit)
• identifies all single or dual bit errors• detects high percentage of multiple bit errors• table or compute-driven• chips are available
– checksum– redundant storage
• ones complement
Detailed Design for Safety
• Redundancy should be set every write access
• Data should be checked every read access
ANSI C++ Exception Class Hierarchy
exception
logic error
domain error out of range
invalidargument length error
runtime error
range error
overflow error
Eight steps to safety
• Identify the hazards
• Determine the risks
• Define the safety measures
• Create safe requirements
• Create safe designs
• Implement safety
• Assure the safety process
• Test, test, test
Safety Process (Development)
• Do Hazard Analysis early and often
• Track safety measures from hazard analysis to– requirments specification– design– code– validation tests
• Test safety measures with fault seeding
Safety Process (Deployment)
• Install safely– ensure proper means are used to set up system– safety measures are installed and checked
• Deploy safely– ensure safety measures are periodically
checked and serviced– Do not turn off safety measures
• Decommission safely– removal of hazardous materials
Concept
Overall scope definition
Hazard and riskanalysis
Overall safetyrequirements
Safety requirementsallocation
SRS E/E/PESrealization
Overall installation and commissioning
Overall safety validation
Overall operation and maintenance
Decommissioning
Overall modificationand retrofit
Overall planning
Ops &mainten.planning
Validationplanning
Install.planning
SRS: other technologyrealization
External risk reduct.facilities
SRS; Safety Related SystemE/E/PES; Electrical/Electronic/Programmableelectronic system
IEC Overall Safety Lifecycle
Eight steps to safety
• Identify the hazards
• Determine the risks
• Define the safety measures
• Create safe requirements
• Create safe designs
• Implement safety
• Assure the safety process
• Test, test, test
Safety in Testing in R&D• Use fault-seeding– Unit (class) testing
• white box
• procedural invariant violation assertions
• peer reviews
– Integration testing• grey box
– Validation testing• black box
• externally caused faults
• (Grey box) internally seeded faults
Safety Testing During Operation
• Power on Self-Test (POST)– Check for latent faults– All safety measures must be tested at power on
and periodically• RAM (stuck-at, shorts, cell failures)
• ROM
• Flash
• Disks
• CPU
• Interfaces
• Buses
Safety Testing During Operation
• Built-In Tests– Repeats some of POST– Data integrity checks– Index and pointer validity checking– Subrange value invariant assertions– Proper functioning
• Watchdogs• Reasonableness checks• Lifeticks
A simplified Example: A linear Accelerator
Unsafe Linear Accelerator
CPU
Beam IntensityBeam Duration
1. Set Dose2. Start Beam3. End Beam Sensor
Radiation Dose
Fault Tree Analysis
OR
AND
Over radiation
Radiationcommandinvalid
OR
EMI Softwaredefect
Shutofftimerfailure
CPU Halted
OR
EMI CPUfailure
Softwaredefect
Beamengaged
Hazards of the Linear Accelerator
Hazard Level ofrisk
ToleranceTime T1
Fault Likelihood Detectiontime
Controlmeasure
Exposure time
Overradiation
Severe 100 ms CPUlocksup
rare 50 ms SafetyCPUcheckslifetick at2 5 ms
50m ms
Corrupt datasettings
often 10 ms 32 bitCRCs ondatacheckedeveryaccess
15 ms
Underradiation
Moderate
2 weeks corrupt datasetting
often 10 ms 32 bitCRCs ondatacheckedeveryaccess
15 ms
Inadvertantradiation onpoweron
sefere 100 ms beamleftengagedduringpowerdown
often n/a curtainmechanically shutsat powerdown
0 ms
Safe Linear Accelerator
CPU
Beam IntensityBeam Duration
1. Set Dose2. Start Beam3. End Beam Sensor
Radiation Dose
Safety CPU
Periodic watchdog service
Self test results shared prior to operation
Deenergize Mechanical shutoffwhen curtain low
Summary
• Safety is a system issue• It is cheaper and more effective to
include safety early on then to add it later
• Safety architectures provide programming in the large safety
• Safe coding rules and detailed design provide programming in-the-small safety