Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Safety in large technologysystems
• Technology Residential College• October 13, 1999• Dan Little
Technology failure
• Why do large, complex systems sometimesfail so spectacularly? Do the easyexplanations of “operator error,” “faultytechnology,” or “complexity” suffice? Arethere managerial causes of technologyfailure? Are there design principles andengineering protocols that can enhancelarge system safety? What is the role ofsoftware in safety and failure?
Surprising failures
• Franco-Prussian war, Israeli intelligencefailure in Yom Kippur war
• The Mercedes “A” vehicle sedan and themoose test
• Chernobyl nuclear power meltdown
Therac-25
• high energies• computer control rather than electro-
mechanical control• positioning the turntable: x-ray beam
flattener• 15,000 rad administered rather than 200 rad
Causes of failure
• Complexity and multiple causal pathwaysand relations
• defective procedures• defective training systems• “human” error• faulty design
Technology failure
• sources of failure– management failures– design failures– proliferating random failures– “storming” the system
• design for “soft landings”• crisis management
Information and decision-making
• Information flow and management ofcomplex technology systems
• complex organizations pursue multipleobjectives simultaneously
• complex organizations pursue the sameobjective along different and conflictingpaths
Sources of potential failure
• hardware interlocks replaced with softwarechecks on turntable position
• cryptic malfunction codes; frequentmessages
• excessive operator confidence in safetysystems
• lack of effective mechanism for reportingand investigating failures
• poor software engineering practices;
Causes of failure
• “The causes of accidents are frequently, ifnot almost always, rooted in theorganization--its culture, management, andstructure. These factors are all critical tothe eventual safety of the engineeredsystem” (Leveson, 47).
Organizational factors
• “Large-scale engineered systems are morethan just a collection of technologicalartifacts: They are a reflection of thestructure, management, procedures, andculture of the engineering organization thatcreated them, and they are also, usually, areflection of the society in which they werecreated” (Leveson, 47).
Advice for better software design
• design for the worst case• avoid “single point of failure” designs• design “defensively”• investigate failures carefully and
extensively• look for “root cause,” not symptom or
specific transient cause• embed audit trails; design for simplicity
Design for safety
• hazard elimination• hazard reduction• hazard control• damage reduction
System safety
• builds in safety, not simply adding it on to acompleted design
• deals with systems as a whole rather thansubsystems or components
• takes a larger view of hazards than justfailures
• emphasizes analysis rather than pastexperience and standards
System safety (2)
• emphasizes qualitative rather thanquantitative approaches
• recognizes the importance of tradeoffs andconflicts in system design
• more than just system engineering
Hazard analysis
• development: identify and assess potentialhazards
• operations: examine an existing system toimprove its safety
• licencing: examine a planned system todemonstrate acceptable safety to aregulatory authority
Hazard analysis (2)
• construct an exhaustive inventory ofhazards early in design
• classify by severity and probability• construct causal pathways that lead to
hazards• design so as to eliminate, reduce, control, or
ameliorate
Safe software design
• control software should be designed withmaximum simplicity (408)
• design should be testable; limited number ofstates
• avoid multitasking, use polling rather thaninterrupts
• design should be easily readable andunderstood
Safe software (2)
• interactions between components should belimited and straightforward
• worst-case timing should be determinableby review of code
• code should include only the minimumfeatures and capabilities required by thesystem; no unnecessary or undocumentedfeatures
Safe software (3)
• critical decisions (launch a missile) shouldnot be made on values often taken by failedcomponents -- 0 or 1.
• Messages should be designed in ways toeliminate possibility of compute hardwarefailures having hazardous consequences(missile launch example)
Safe software (4)
• strive for maximal decoupling of parts of asoftware control system
• accidents in tightly coupled systems are aresult of unplanned interactions
• the flexibility of software encouragescoupling and multiple functions; importantto resist this impulse.
Safe software (5)
• “Adding computers to potentially dangeroussystems is likely to increase accidentsunless extra care is put into system design”(411).
Human interface considerations
• unambiguous error messages (Therac 25)• operator needs extensive knowledge about
the “theory” of the system• alarms need to be comprehensible (TMI);
spurious alarms minimized• operator needs knowledge about timing and
sequencing of events• design of control board is critical
Control panel anomalies
Risk assessment and prediction
• What is involved in assessing risk?– probability of failure– prediction of consequences of failure– failure pathways
Reasoning about risk
• How should we reason about risk?• Expected utility: probability of outcome x
utility of outcome• Probability and science• How to anticipate failure scenarios?
Compare scenarios
• nuclear power vs coal power• automated highway system vs routine traffic
accidents
Ordinary reasoning and judgment
• well-known “fallacies” of ordinaryreasoning:– time preference– framing– risk aversion
large risks and small risks
• the decision-theory approach: minimizeexpected harms
• the decision-making reality: large harms aremore difficult to absorb, even if smaller inoverall consequence
• example: JR West railway
Scope and limits of simulations
• Computer simulations permit “experiments”on different scenarios presented to complexsystems
• Simulations are not reality• Simulations represent some factors and
exclude others• Simulations rely on a mathematicization of
the process that may be approximate oreven false.