Software Reliability and Safety in Nuclear Reactor Protection Systems

UCRL-ID-114839

Software Reliability andSafety in Nuclear ReactorProtection Systems

Prepared byJ. Dennis LawrencePrepared forU.S. Nuclear Regulatory Commission

FESSP

Fission Energy and Systems Safety Program

Lawrence Livermore National Laboratory

DisclaimerThis document was prepared as an account of work sponsored by an agency of the United States Government.Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty,express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness ofany information, apparatus, product, or process disclosed, or represents that its use would not infringe privatelyowned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark,manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoringby the United States Government or any agency thereof. The views and opinions of authors expressed herein do notnecessarily state or reflect those of the United States Government or any agency thereof.

This work was supported by the United States Nuclear Regulatory Commission under a Memorandum ofUnderstanding with the United States Department of Energy.

UCRL-ID-114839

Software Reliability andSafety in Nuclear ReactorProtection Systems

Manuscript date: June 11, 1993

Prepared byJ. Dennis LawrenceLawrence Livermore National Laboratory7000 East AvenueLivermore, CA 94550

Prepared forU.S. Nuclear Regulatory Commission

ii

ABSTRACTPlanning the development, use and regulation of computer systems in nuclear reactor protection systems in such away as to enhance reliability and safety is a complex issue. This report is one of a series of reports from theComputer Safety and Reliability Group, Lawrence Livermore National Laboratory, that investigates differentaspects of computer software in reactor protection systems. There are two central themes in the report. First,software considerations cannot be fully understood in isolation from computer hardware and applicationconsiderations. Second, the process of engineering reliability and safety into a computer system requires activities tobe carried out throughout the software life cycle. The report discusses the many activities that can be carried outduring the software life cycle to improve the safety and reliability of the resulting product. The viewpoint isprimarily that of the assessor, or auditor.

iii

iv

CONTENTS1. Introduction ..........................................................................................................................................................11.1. Purpose ......................................................................................................................................................11.2. Scope..........................................................................................................................................................11.3. Report Organization ..................................................................................................................................22. Terminology .........................................................................................................................................................32.1. Systems Terminology................................................................................................................................32.2. Software Reliability and Safety Terminology ...........................................................................................32.2.1. Faults, Errors, and Failures.............................................................................................................32.2.2. Reliability and Safety Measures .....................................................................................................42.2.3. Safety Terminology ........................................................................................................................52.3. Life Cycle Models .....................................................................................................................................62.3.1. Waterfall Model..............................................................................................................................72.3.2. Phased Implementation Model .......................................................................................................72.3.3. Spiral Model ...................................................................................................................................72.4. Fault and Failure Classification Schemes..................................................................................................72.4.1. Fault Classifications......................................................................................................................122.4.2. Failure Classifications ..................................................................................................................142.5. Software Qualities ...................................................................................................................................153. Life Cycle Software Reliability and Safety Activities .......................................................................................173.1. Planning Activities ..................................................................................................................................173.1.1. Software Project Management Plan..............................................................................................193.1.2 Software Quality Assurance Plan ...................................................................................................213.1.3. Software Configuration Management Plan ..................................................................................233.1.4. Software Verification and Validation Plan...................................................................................263.1.5. Software Safety Plan ....................................................................................................................303.1.6. Software Development Plan .........................................................................................................333.1.7. Software Integration Plan .............................................................................................................353.1.8. Software Installation Plan.............................................................................................................363.1.9. Software Maintenance Plan..........................................................................................................373.1.10. Software Training Plan ...............................................................................................................383.2. Requirements Activities ..........................................................................................................................383.2.1. Software Requirements Specification ..........................................................................................383.2.2. Requirements Safety Analysis......................................................................................................433.3. Design Activities .....................................................................................................................................443.3.1. Hardware and Software Architecture ...........................................................................................453.3.2. Software Design Specification .....................................................................................................453.3.3. Software Design Safety Analysis .................................................................................................473.4. Implementation Activities .......................................................................................................................483.4.1. Code Safety Analysis....................................................................................................................483.5. Integration Activities ...............................................................................................................................493.5.1. System Build Documents .............................................................................................................493.5.2. Integration Safety Analysis ..........................................................................................................493.6. Validation Activities................................................................................................................................493.6.1. Validation Safety Analysis ...........................................................................................................503.7. Installation Activities...............................................................................................................................503.7.1. Operations Manual........................................................................................................................50

v

3.7.2. Installation Configuration Tables .................................................................................................503.7.3. Training Manuals..........................................................................................................................503.7.4. Maintenance Manuals...................................................................................................................503.7.5. Installation Safety Analysis..........................................................................................................503.8. Operations and Maintenance ActivitiesChange Safety Analysis ........................................................514. Recommendations, Guidelines, and Assessment................................................................................................534.1. Planning Activities ..................................................................................................................................534.1.1. Software Project Management Plan..............................................................................................534.1.2. Software Quality Assurance Plan .................................................................................................544.1.3. Software Configuration Management Plan ..................................................................................564.1.4. Software Verification and Validation Plan...................................................................................594.1.5. Software Safety Plan ....................................................................................................................654.1.6. Software Development Plan .........................................................................................................674.1.7. Software Integration Plan .............................................................................................................684.1.8. Software Installation Plan.............................................................................................................694.1.9. Software Maintenance Plan..........................................................................................................704.2. Requirements Activities ..........................................................................................................................714.2.1. Software Requirements Specification ..........................................................................................714.2.2. Requirements Safety Analysis......................................................................................................734.3. Design Activities .....................................................................................................................................744.3.1. Hardware/Software Architecture Specification............................................................................744.3.2. Software Design Specification .....................................................................................................744.3.3. Design Safety Analysis.................................................................................................................754.4. Implementation Activities .......................................................................................................................764.4.1. Code Listings................................................................................................................................764.4.2. Code Safety Analysis....................................................................................................................774.5. Integration Activities ...............................................................................................................................784.5.1. System Build Documents .............................................................................................................784.5.2. Integration Safety Analysis ..........................................................................................................784.6. Validation Activities................................................................................................................................784.6.1. Validation Safety Analysis ...........................................................................................................784.7. Installation Activities...............................................................................................................................794.7.1. Installation Safety Analysis..........................................................................................................79Appendix: Technical Background..........................................................................................................................81A.1. Software Fault Tolerance Techniques ....................................................................................................81A.1.1. Fault Tolerance and Redundancy ................................................................................................82A.1.2. General Aspects of Recovery ......................................................................................................82A.1.3. Software Fault Tolerance Techniques .........................................................................................84A.2. Reliability and Safety Analysis and Modeling Techniques....................................................................87A.2.1. Reliability Block Diagrams .........................................................................................................87A.2.2. Fault Tree Analysis......................................................................................................................88A.2.3. Event Tree Analysis.....................................................................................................................93A.2.4. Failure Modes and Effects Analysis............................................................................................93A.2.5. Markov Models............................................................................................................................95A.2.6. Petri Net Models..........................................................................................................................97A.3. Reliability Growth Models ...................................................................................................................101A.3.1. Duane Model..............................................................................................................................103A.3.2. Musa Model ...............................................................................................................................103A.3.3. Littlewood Model ......................................................................................................................104A.3.4. Musa-Okumoto Model ..............................................................................................................104References..............................................................................................................................................................107Standards........................................................................................................................................................107Books, Articles, and Reports .........................................................................................................................108Bibliography ..........................................................................................................................................................113

vi

FiguresFigure 2-1. Documents Produced During Each Life Cycle Stage............................................................................8Figure 2-2. Waterfall Life Cycle Model .................................................................................................................10Figure 2-3. Spiral Life Cycle Model ......................................................................................................................11Figure 3-1. Software Planning Activities ...............................................................................................................18Figure 3-2. Outline of a Software Project Management Plan ................................................................................19Figure 3-3. Outline of a Software Quality Assurance Plan ....................................................................................22Figure 3-4. Outline of a Software Configuration Management Plan......................................................................24Figure 3-5. Verification and Validation Activities .................................................................................................28Figure 3-6. Outline of a Software Verification and Validation Plan......................................................................30Figure 3-7. Outline of a Software Safety Plan........................................................................................................30Figure 3-8. Outline of a Software Development Plan ............................................................................................34Figure 3-9. Outline of a Software Integration Plan ................................................................................................35Figure 3-10. Outline of a Software Installation Plan..............................................................................................37Figure 3-11. Outline of a Software Maintenance Plan ...........................................................................................37Figure 3-12. Outline of a Software Requirements Plan..........................................................................................39Figure A-1. Reliability Block Diagram of a Simple System..................................................................................89Figure A-2. Reliability Block Diagram of Single, Duplex, and Triplex Communication Line .............................89Figure A-3. Reliability Block Diagram of Simple System with Duplexed Communication Line .........................90Figure A-4. Reliability Block Diagram that Cannot Be Constructed from Serial and Parallel Parts.....................90Figure A-5. Simple Fault Tree................................................................................................................................90Figure A-6. AND Node Evaluation in a Fault Tree................................................................................................91Figure A-7. OR Node Evaluation in a Fault Tree ..................................................................................................91Figure A-8. Example of a Software Fault Tree ......................................................................................................92Figure A-9. Simple Event Tree ..............................................................................................................................93Figure A-10. A Simple Markov Model of a System with Three CPUs..................................................................95Figure A-11. Markov Model of a System with CPUs and Memories ....................................................................96Figure A-12. Simple Markov Model with Varying Failure Rates..........................................................................97Figure A-13. Markov Model of a Simple System with Transient Faults ...............................................................97Figure A-14. An Unmarked Petri Net ....................................................................................................................99Figure A-15. Example of a Marked Petri Net ........................................................................................................99Figure A-16. The Result of Firing Figure A-15 ...................................................................................................100Figure A-17. A Petri Net for the Mutual Exclusion Problem...............................................................................100Figure A-18. Petri Net for a Railroad Crossing....................................................................................................101Figure A-19. Execution Time Between Successive Failures of an Actual System..............................................102

TablesTable 2-1. Persistence Classes and Fault Sources ..................................................................................................12Table A-1. Failure Rate Calculation.....................................................................................................................104

vii

ABBREVIATIONS AND ACRONYMSANSICASECCBCICMCPUETAFBDFLBSFMEAFMECAFTAI&CI/OIEEEMTTFPDPPERTQARAMROMSCMSCMPSPMPSQASQAPSRSSSPTMRUCLAUPSV&VWBS

American National Standards InstituteComputer-Assisted Software EngineeringConfiguration Control BoardConfiguration ItemConfiguration ManagementCentral Processing UnitEvent Tree AnalysisFunctional Block DiagramFunctional Level Breakdown StructureFailure Modes and Effects AnalysisFailure Modes, Effects and Criticality AnalysisFault Tree AnalysisInstrumentation and ControlInput/OutputInstitute of Electrical and Electronic EngineersMean Time To FailurePreviously Developed or PurchasedProgram Education and Review TechniqueQuality AssuranceRandom Access MemoryRead Only MemorySoftware Configuration ManagementSoftware Configuration Management PlanSoftware Project Management PlanSoftware Quality AssuranceSoftware Quality Assurance PlanSoftware Requirements SpecificationSoftware Safety PlanTriple Modular RedundancyUniversity of California at Los AngelesUninterruptable Power SupplyVerification and ValidationWork Breakdown Structure

viii

EXECUTIVE SUMMARYThe development, use, and regulation of computer systems in nuclear reactor protection systems to enhancereliability and safety is a complex issue. This report is one of a series of reports from the Computer Safety andReliability Group, Lawrence Livermore National Laboratory, which investigates different aspects of computersoftware in reactor protection systems.There are two central themes in this report. First, software considerations cannot be fully understood in isolationfrom computer hardware and application considerations. Second, the process of engineering reliability and safetyinto a computer system requires activities to be carried out throughout the software life cycle. These two themesaffect both the structure and the content of this report.Reliability and safety are concerned with faults, errors, and failures. A fault is a triggering event that causesthings to go wrong; a software bug is an example. The fault may cause a change of state in the computer, which istermed an error. The error remains latent until the incorrect state is used; it then is termed effective. It may thencause an externally-visible failure. Only the failure is visible outside the computer system. Preventing or correctingthe failure can be done at any of the levels: preventing or correcting the causative fault, preventing the fault fromcausing an error, preventing the error from causing a failure, or preventing the failure from causing damage. Thetechniques for achieving these goals are termed fault prevention, fault correction, and fault tolerance.Reliability and safety are related, but not identical, concepts. Reliability, as defined in this report, is a measureof how long a system will run without failure of any kind, while safety is a measure of how long a system will runwithout catastrophic failure. Thus safety is directly concerned with the consequences of failure, not merely theexistence of failure. As a result, safety is a system issue, not simply a software issue, and must be analyzed anddiscussed as a property of the entire reactor protection system.Faults and failures can be classified in several different ways. Faults can be described as design faults,operational faults, or transient faults. All software faults are design faults; however, hardware faults may occur inany of the three classes. This is important in a safety-related system since the software may be required tocompensate for the operational faults of the hardware. Faults can also be classified by the source of the fault;software and hardware are two of the possible sources discussed in the report. Others are: input data, system state,system topology, people, environment, and unknown. For example, the source of many transient faults is unknown.Failures are classified by mode and scope. A failure mode may be sudden or gradual; partial or complete. Allfour combinations of these are possible. The scope of a failure describes the extent within the system of the effectsof the failure. This may range from an internal failure, whose effect is confined to a single small portion of thesystem, to a pervasive failure, which affects much of the system.Many different life cycle models exist for developing software systems. These differ in the timing of the variousactivities that must be done in order to produce a high-quality software product, but the actual activities must bedone in any case. No particular life cycle is recommended here, but there are extensive comments on the activitiesthat must be carried out. These have been divided into eight categories, termed sets of activities in the report. Thesesets are used merely to group related activities; there is no implication that the activities in any one set must be allcarried out at the same time, or that activities in later sets must follow those of earlier sets. The eight categoriesare as follows: Planning activities result in the creation of a number of documents that are used to control the developmentprocess. Eleven are recommended here: a Software Project Management Plan, a Software Quality AssurancePlan, a Software Configuration Management (CM) Plan, a Software Verification and Validation (V&V) Plan, aSoftware Safety Plan, a Software Development Plan, a Software Integration Plan, a Software Installation Plan, aSoftware Maintenance Plan, a Software Training Plan, and a Software Operations Plan. Many of these plans arediscussed in detail, relying on various ANSI/IEEE standards when these exist for the individual plans.

The second set of activities relate to documenting the requirements for the software system. Four documents arerecommended: the Software Requirements Specification, a Requirements Safety Analysis, a V&V

ix

Requirements Analysis, and a CM Requirements Report. These documents will fully capture all therequirements of the software project, and relate these requirements to the overall protection system functionalrequirements and protection system safety requirements.

The design activities include five recommended documents. The Hardware and Software Architecture willdescribe the computer system design at a fairly high level, giving hardware devices and mapping softwareactivities to those devices. The Software Design Specification provides the complete design on the softwareproducts. Design analyses include the Design Safety Analysis, the V&V Design Analysis, and the CM DesignReport.

Implementation activities include writing and analyzing the actual code, using some programming language.Documents include the actual code listings, the Code Safety Analysis, the V&V Implementation Analysis andTest Report, and the CM Implementation Report.

Integration activities are those activities that bring software, hardware, and instrumentation together to form acomplete computer system. Documents include the System Build Documents, the Integration Safety Analysis,the V&V Integration Analysis and Test Report, and the CM Integration Report.

Validation is the process of ensuring that the final complete computer system achieves the original goals thatwere imposed by the protection system design. The final system is matched against the original requirements,and the protection system safety analysis. Documents include the Validation Safety Analysis, the V&VValidation and Test Report, and the CM Validation Report.

Installation is the process of moving the completed computer system from the developers site to theoperational environment, within the actual reactor protection system. The completion of installation providesthe operator with a documented operational computer system. Seven documents are recommended: theOperations Manual, the Installation Configuration Tables, Training Manuals, Maintenance Manuals, anInstallation Safety Analysis, a V&V Installation Analysis and Test Report, and a CM Installation Report.

The operations and maintenance activities involve the actual use of the computer system in the operatingreactor, and making any required changes to it. Changes may be required due to errors in the system that werenot found during the development process, changes to hardware or requirements for additional functionality.Safety analyses, V&V analyses, and CM activities are all recommended as part of the maintenance process.

Three general methods exist that may be used to achieve software fault tolerance; n-version programming,recovery block, and exception handling. Each of these attempts to achieve fault tolerance by using more than onealgorithm or program module to perform a calculation, with some means of selecting the preferred result. In nversion programming, three or more program modules that implement the same function are executed in parallel,and voting is used to select the correct one. In recovery block, two or more modules are executed in series, with anacceptance algorithm used after each module is executed to decide if the result should be accepted or the nextmodule executed. In exception handling, a single module is executed, with corrections made when exceptions aredetected. Serious questions exist as to the applicability of the n-version programming and the recovery-blocktechniques to reactor protection systems, because of the assumptions underlying the techniques, the possibility ofcommon-mode failures in the voting or decision programs, and the cost and time of implementing them.One means of assessing system reliability or safety is to create a mathematical model of the system and analyzethe properties of that model. This can be very effective providing that the model captures all the relevant factors ofthe reality. Reliability models have been used for many years for electronic and mechanical systems. The use ofreliability models for software is fairly new, and their effectiveness has not yet been fully demonstrated. Fault treemodels, event tree models, failure modes and effects analysis, Markov models, and Petri net models all havepossibilities. Of particular interest are reliability growth models, since software bugs tend to be corrected as they arefound. Reliability Growth models can be very useful in understanding the growth of reliability through a testingactivity, but cannot be used alone to justify software for use in a safety-related application, since such applicationsrequire a much higher level of reliability than can be convincingly demonstrated during a test-correct-test activity.

x

Software Reliability andSafety in Nuclear ReactorProtection Systems1. INTRODUCTION1.1. Purpose

there will be appropriate plans, requirements anddesign specifications, procurement and installation,testing and analysis for the complete computer system,as well as the hardware, software, and instrumentationsubsystems. The complete computer system and thehardware and instrumentation subsystems arediscussed here only as they relate to the softwaresubsystem.

Reliability and safety are related, but not identical,concepts. Reliability can be thought of as theprobability that a system fails in any way whatever,while safety is concerned with the consequences offailure. Both are important in reactor protectionsystems. When a protection system is controlled by acomputer, the impact of the computer system onreliability and safety must be considered in the reactordesign. Because software is an integral part of acomputer system, software reliability and softwaresafety become a matter of concern to the organizationsthat develop software for protection systems and to thegovernment agencies that regulate the developers. Thisreport is oriented toward the assessment process. Theviewpoint is from that of a person who is assessing thereliability and safety of a computer software systemthat is intended to be used in a reactor protectionsystem.

The report is specifically directed toward enhancingthe reliability and safety of computer controlled reactorprotection systems. Almost anything can affect safety,so it is difficult to bound the contents of the report.Consequently material is included that may seemtangential to the topic. In these cases the focus is onreliability and safety; other aspects of such material aresummarized or ignored. More complete discussions ofthese secondary issues may be found in the references.This report is one of a series of reports prepared by theComputer Safety and Reliability Group, FissionEnergy and System Safety Program, LawrenceLivermore National Laboratory. Aspects of softwarereliability and safety engineering that are covered inthe other reports are treated briefly in this report, if atall. The reader is referred to the following additionalreports:

1.2. ScopeSoftware is only one portion of a computer system.The other portions are the computer hardware and theinstrumentation (sensors and actuators) to which thecomputer is connected. The combination of software,hardware, and instrumentation is frequently referred toas the Instrumentation and Control (I&C) System.Nuclear reactors have at least two I&C systemsonecontrols the reactor operation, and the other controlsthe reactor protection. The latter, termed the ProtectionComputer System, is the subject of this report.This report assumes that the computer system as awhole, as well as the hardware and instrumentationsubsystems, will be subject to careful development,analysis, and assessment in a manner similar to thatgiven here for the software. That is, it is assumed that

1

1.

Robert Barter and Lin Zucconi, Verification andValidation Techniques and Auditing Criteria forCritical System-Control Software, LawrenceLivermore National Laboratory, Livermore, CA(February 1993).

2.

George G. Preckshot, Real-Time SystemsComplexity and Scalability, Lawrence LivermoreNational Laboratory, Livermore, CA (August1992).

Section 1. Introduction

3.

George G. Preckshot and Robert H. Wyman,Communications Systems in Nuclear PowerPlants, Lawrence Livermore NationalLaboratory, Livermore, CA (August 1992).

4.

George G. Preckshot, Real-Time Performance,Lawrence Livermore National Laboratory,Livermore, CA (November 1992).

5.

Debra Sparkman, Techniques, Processes, andMeasures for Software Safety and Reliability,Lawrence Livermore National Laboratory,Livermore, CA (April 1992).

6.

Lloyd G. Williams, Formal Methods in theDevelopment of Safety Critical SoftwareSystems, SERM-014-91, Software EngineeringResearch, Boulder, CO (April 1992).

7.

Lloyd G. Williams, Assessment of FormalSpecifications for Safety-Critical Systems,Software Engineering Research, Boulder, CO(February 1993).

8.

Lloyd G. Williams, Considerations for the Use ofFormal Methods in Software-Based SafetySystems, Software Engineering Research,Boulder, CO (February 1993).

9.

Lin Zucconi and Booker Thomas, TestingExisting Software for Safety-RelatedApplications, Lawrence Livermore NationalLaboratory, Livermore, CA (January 1993).

sets of activities: planning, requirements specification,design specification, software implementation,integration with hardware and instrumentation,validation, installation and operations, andmaintenance. Each set of activities includes a numberof tasks that can be undertaken to enhance reliabilityand safety. Because the report is oriented towardsassessment, the tasks are discussed in terms of thedocuments they produce and the actions necessary tocreate the document contents.Section 4 discusses specific motivations,recommendations, guidelines, and assessmentquestions. The motivation sections describe particularconcerns of the assessor when examining the safety ofsoftware in a reactor protection system.Recommendations consist of actions the developershould or should not do in order to address suchconcerns. Guidelines consist of suggestions that areconsidered good engineering practice when developingsoftware. Finally, the assessment sections consist oflists of questions that the assessor may use to guide theassessment of a particular aspect of the softwaresystem.From the viewpoint of the assessor, softwaredevelopment consists of the organization that does thedevelopment, the process used in the development, andthe products of that development. Each is subject toanalysis, assessment and judgment. This reportdiscusses all three aspects in various places within theframework of the life cycle. Process and product arethe primary emphasis.

1.3. Report OrganizationSection 2 contains background on several topicsrelating to software reliability and software safety.Terms are defined, life cycle models are discussedbriefly, and two classification schemes are presented.

Following the main body of the report, the appendixprovides information on software fault tolerancetechniques and software reliability models. Abibliography of information relating to softwarereliability and safety is also included.

Section 3 provides detail on the many life cycleactivities that can be done to improve reliability andsafety. Development activities are divided into eight

2

Section 2. Terminology

2. TERMINOLOGYThis section includes discussions of the basicterminology used in the remainder of the report. Thesection begins with a description of the terms used todescribe systems. Section 2.2 provides carefuldefinitions of the basic terminology for reliability andsafety. Section 2.3 contains brief descriptions ofseveral of the life cycle models commonly used insoftware development, and defines the variousactivities that must be carried out during any softwaredevelopment project. Section 2.4 describes variousclassification schemes for failures and faults, andprovides the terms used in these schemes. Finally,Section 2.5 discusses the terms used to describesoftware qualities that are used in following sections.

does not apply to parts that are internal components ofan instrument.

2.1. Systems Terminology

2.2.1. Faults, Errors, and Failures

The word system is used in many different ways incomputer science. The basic definition, given in IEEEStandard 610.12, is a collection of componentsorganized to accomplish a specific function or set offunctions. In the context of a nuclear reactor, the wordcould mean, depending on context, the society usingthe reactor, the entire reactor itself, the portion devotedto protection, the computer hardware and softwareresponsible for protection, or just the software.

The words fault, error, and failure have a plethora ofdefinitions in the literature. This report uses thefollowing definitions, specialized to computer systems(Laprie 1985; Randell 1978; Siewiorek 1982).

Since this report is concerned with computer systemsin general, and software systems in particular,instruments are restricted to those that interact with thecomputer system. There are two types: sensors andactuators. Sensors provide information to the softwareon the state of the reactor, and actuators providecommands to the rest of the reactor protection systemfrom the software.

2.2. Software Reliability and SafetyTerminology

A fault is a deviation of the behavior of a computersystem from the authoritative specification of itsbehavior. A hardware fault is a physical change inhardware that causes the computer system to change itsbehavior in an undesirable way. A software fault is amistake (also called a bug) in the code. A user faultconsists of a mistake by a person in carrying out someprocedure. An environmental fault is a deviation fromexpected behavior of the world outside the computersystem; electric power interruption is an example. Theclassification of faults is discussed further inSubsection 2.4.1.

In this report the term system, without modifiers, willconsistently refer to the complete application withwhich the computer is directly concerned. Thus asystem should generally be understood as a reactorprotection system. When portions of the protectionsystem are meant, and the meaning isnt clear fromcontext, a modifier will be used. Reference could bemade to the computer system (a portion of theprotection system), the software system (in thecomputer system), the hardware system (in thecomputer system) and so forth. In some cases, the termapplication system is used to emphasize that theentire reactor protection system is meant.

An error is an incorrect state of hardware, software, ordata resulting from a fault. An error is, therefore, thatpart of the computer system state that is liable to leadto failure. Upon occurrence, a fault creates a latenterror, which becomes effective when it is activated,leading to a failure. If never activated, the latent errornever becomes effective and no failure occurs.

A computer system is itself composed of subsystems.These include the computer hardware, the computersoftware, operators who are using the computersystem, and the instruments to which the computer isconnected. The definition of instrument is taken fromANSI/ISA Standard S5.1: a device used directly orindirectly to measure and/or control a variable. Theterm includes primary elements, final control elements,computing devices and electrical devices such asannunciators, switches, and pushbuttons. The term

A failure is the external manifestation of an error. Thatis, a failure is the external effect of the error, as seenby a (human or physical device) user, or by anotherprogram.Some examples may clarify the differences among thethree terms. A fault may occur in a circuit (a wirebreaks) causing a bit in memory to always be a 1 (an

3


error, since memory is part of the state) resulting in afailed calculation.

A programmer's mistake is a fault; the consequence isa latent error in the written software (erroneousinstruction). Upon activation of the module where theerror resides, the error becomes effective. If thiseffective error causes a divide by zero, a failure occursand the program aborts.A maintenance or operating manual writer's mistake isa fault; the consequence is an error in thecorresponding manual, which will remain latent aslong as the directives are not acted upon.

A fault creates one or more latent errors in thecomputer system component where it occurs.Physical faults can directly affect only thephysical layer components, whereas other types offaults may affect any component.

2.

There is always a time delay between theoccurrence of a fault and the occurrence of theresulting latent error(s). This may be measured innanoseconds or years, depending on the situation.Some faults may not cause errors at all; forexample, a bug in a portion of a program that isnever executed. It is convenient to consider this tobe an extreme case in which an infinite amount oftime elapses between fault and latent error.

3.

4.

A component failure occurs when an error affectsthe service delivered (as a response to requests) bythe component. There is always a time delaybetween the occurrence of the error and theoccurrence of the resulting failure. This may varyfrom nanoseconds to infinity (if the failure neveractually occurs).

5.

These properties apply to any component of thecomputer system. In a hierarchical system, failuresat one level can usefully be thought of as faults bythe next higher level.

Most reliability, availability, and safety analysis andmodeling assume that each fault causes at most asingle failure. That is, failures are statisticallyindependent. This is not always true. A common-modefailure occurs when multiple components of acomputer system fail due to a single fault. If commonmode failures do occur, an analysis that assumes thatthey do not will be excessively optimistic. There are anumber of reasons for common mode failures (Dhillon1983):

The view summarized here enables fault pathology tobe made precise. The creation and action mechanismsof faults, errors and failures may be summarized asfollows.1.

Environmental causes, such as dirt, temperature,moisture, and vibrations.

Equipment failure that results from an unexpectedexternal event, such as fire, flood, earthquake, ortornadoes.

Design deficiencies, where some failures were notanticipated during design. An example is multipletelephone circuits routed through a singleequipment box. Software design errors, whereidentical software is being run on multiplecomputers, is of particular concern in this report.

Operational errors, due to factors such as impropermaintenance procedures, carelessness, or impropercalibration of equipment.

Multiple items purchased from the same vendor,where all of the items have the samemanufacturing defect.

Common power supply used for redundant units.

Functional deficiencies, such as misunderstandingof process variable behavior, inadequatelydesigned protective actions, or inappropriateinstrumentation.

The properties governing errors may be stated asfollows:a.

A latent error becomes effective once it isactivated.

b.

An error may cycle between its latent andeffective states.

c.

An effective error may, and in general does,propagate from one component to another. Bypropagating, an error creates other (new)errors.

From these properties it may be deduced that aneffective error within a component may originatefrom:

An effective error propagating within thesame component or from another component.

Activation of a latent error within the samecomponent.

4


2.2.2. Reliability and Safety MeasuresReliability and safety measurements are inherentlystatistical, so the fundamental quantities are definedstatistically. The four basic terms are reliability,availability, maintainability, and safety. These andother related terms are defined in the following text.Note that the final three definitions are qualitative, notquantitative (Siewiorek 1982; Smith 1972). Most ofthese definitions apply to arbitrary systems. Theexception is safety; since this concept is concernedwith the consequences of failure, rather than the simplefact of failure, the definition applies only to a systemthat can have major impacts on people or equipment.More specifically, safety applies to reactors, not tocomponents of a reactor.

The safety, S(t) , of a system is the conditionalprobability that the system has survived theinterval [0, t] without an accident, given that itwas operating without catastrophic failure at time0.

The dependability of a system is a measure of itsability to commence and complete a missionwithout failure. It is therefore a function of bothreliability and maintainability. It can be thought ofas the quality of the system that permits the user torely on it for service.

System effectiveness is the product of capability,availability and dependability. System costeffectiveness is the quotient of systemeffectiveness and cost.

Safety engineering has special terminology of its own.The following definitions, based on those developedby the IEEE Draft Standard 1228, are used in thisreport. They are reasonably standard definitions, butspecialized to computer software in a few places.

The availability, A(t) , of a system is theprobability that the system is operational at theinstant of time t. For nonrepairable systems,availability and reliability are equal. For repairablesystems, they are not. As a general rule,0 R(t) A(t) 1 .The maintainability, M(t) , of a system is theconditional probability that the system will berestored to operational effectiveness by time t,given that it was not functioning at time 0.Maintainability is often given in terms of therepair rate, (t) , or mean time to repair, mttr . Ifthe repair rate is constant, mttr = 1 / .

The capability of a system is a measure of itsability to satisfy the user's requirements.

2.2.3. Safety Terminology

The reliability, R(t) , of a system is theconditional probability that the system hassurvived the interval [0, t], given that it wasoperating at time 0. Reliability is often given interms of the failure rate (also referred to as thehazard rate ), (t) , or the mean time to failure,mttf . If the failure rate is constant,mttf = 1 / . Reliability is a measure of thesuccess with which the system conforms to someauthoritative specification of its behavior, andcannot be measured without such a specification.

5

An accident is an unplanned event or series ofevents that result in death, injury, illness,environmental damage, or damage to or loss ofequipment or property. (The word mishap issometimes used to mean an accident, financial lossor public relations loss.)

A system hazard is an application systemcondition that is a prerequisite to an accident. Thatis, the system states can be divided into two sets.No state in the first set (of nonhazardous states )can directly cause an accident, while accidentsmay result from any state in the second set (ofhazardous states ). Note that a system can be in ahazardous state without an accident occurringitis the potential for causing an accident that createsthe hazard, not necessarily the actuality.

The term risk is used to designate a measure thatcombines the likelihood that a system hazard willoccur, the likelihood that the hazard will cause anaccident and the severity of the worst plausibleaccident. The simplest measure is to simplymultiply the probability that a hazard occurs, theprobability that a hazard will cause an accident(given that the hazard occurs), and the worst-caseseverity of the accident.

Safety-critical software is software whoseinadvertent response to stimuli, failure to respondwhen required, response out-of-sequence, orresponse in unplanned combination with otherscan result in an accident. This includes softwarewhose operation or failure to operate can lead to ahazardous state, software intended to recover fromhazardous states, and software intended tomitigate the severity of, or recover from, anaccident.

The term safety is used to mean the extent towhich a system is free from system hazard. This is


a less precise definition than that given in Section2.2.2, which is generally preferred in this report.

requirements are the specific representationof the interface between the software and theprocesses or devices being controlled.

It is also useful to consider the word critical whenused to describe systems. A critical system is a systemwhose failure may have very unpleasant consequences(mishaps). The results of failure may affect thedevelopers of the system, its direct users, theircustomers or the general public. The consequencesmay involve loss of life or property, financial loss,legal liability (such as jail), regulatory threats, or eventhe loss of good will (if that is extremely important).The term safety critical refers to a system whosefailure could cause an accident.

A third important characteristic claimed foraccidents is that they are intimatelyintertwined with complexity and coupling.Perrow has argued that accidents arenormal in complex and tightly coupledsystems. Unless great care is taken, theaddition of computers to control these systemsis likely to increase both complexity andcoupling, which will increase the potential foraccidents.

A good brief discussion of accidents is found inLeveson 1991:

2.3. Life Cycle ModelsMany different software life cycles have beenproposed. These have different motivations, strengths,and weaknesses. The life cycle models generallyrequire the same types of tasks to be carried out; theydiffer in the ordering of these tasks in time. Noparticular life cycle is assumed here. There is anassumption that the activities that occur during thedevelopers life cycle yield the products indicated inFigure 2-1. Each of the life cycle activities producesone or more products, mostly documents, that can beassessed. The development process itself is subject toassessment.

Despite the usual oversimplification of thecauses of particular accidents (humanerror is often the identified culprit despitethe all-encompassing nature and relativeuselessness of such a categorization),accidents are caused almost withoutexception by multiple factors, and the relativecontribution of each is usually not clear. Anaccident may be thought of as a set of eventscombining together in random fashion or,alternatively, as a dynamic mechanism thatbegins with the activation of a hazard andflows through the system as a series ofsequential and concurrent events in a logicalsequence until the system is out of control anda loss is produced (the domino theory).Either way, major incidents often have morethan one single cause, and it is usuallydifficult to place blame on any one event orcomponent of the system. The high frequencyof complex, multifactorial accidents may arisefrom the fact that the simpler potentials havebeen anticipated and handled. But the verycomplexity of events leading to an accidentimplies that there may be many opportunitiesto intervene or interrupt the sequence.

The ultimate result of software development, asconsidered in this report, is a suite of computerprograms that run on computers and control the reactorprotection system. These programs will havecharacteristics deemed desirable by the developer orcustomer, such as reliability, performance, usability,and functionality. This report is only concerned withreliability and safety; however, that concern does spillover into other qualities.The development model used here suggests one ormore audits of the products of each set of life cycleactivities. The number of audits depends, among otherthings, on the specific life cycle model used by thedeveloper. The audit will assess the work done thatrelates to the set of activities being audited. Manyreliability, performance, and safety problems can beresolved only by careful design of the softwareproduct, so must be addressed early in the life cycle,no matter which life cycle is used. Any errors oroversights can require difficult and expensive retrofits,so are best found as early as possible. Consequently, anincremental audit process is believed to be more cost

A second characteristic of accidents is thatthey often involve problems in subsysteminterfaces. It appears to be easier to deal withfailures of components than failures in theinterfaces between components. This shouldnot be a surprise to software engineers,consider the large number of operationalsoftware faults that can be traced back torequirements problems. The software

6


effective than a single audit at the end of thedevelopment process. In this way, problems can bedetected early in the life cycle and corrected beforelarge amounts of resources have been wasted.

external requirements change slowly. Operatingsystems and language compilers are examples.

2.3.3. Spiral ModelThe spiral model was developed at TRW (Boehm1988) in an attempt to solve some of the perceiveddifficulties with earlier models. This model assumesthat software development can be modeled as asequence of activities, as shown in Figure 2-3. Eachtime around the spiral (phase), the product isdeveloped to a more complete degree. Four broad stepsare required:

Three of the many life cycle models are describedbriefly in subsections 2.3.1. through 2.3.3. Noparticular life cycle model is advocated. Instead, amodel should be chosen to fit the style of thedevelopment organization and the nature of theproblem being solved.

2.3.1. Waterfall ModelThe classic waterfall model of software developmentassumes that each phase of the life cycle can becompleted before the next phase is begun (Pressman1987). This is illustrated in Figure 2-2. The actualphases of the waterfall model differ among the variousauthors who discuss the model; the figure showsphases appropriate to reactor protection systems. Notethat the model permits the developer to return toprevious phases. However, this is considered to be anexceptional condition to the normal forward flow,included to permit errors in previous stages to becorrected. For example, if a requirements error isdiscovered during the implementation phase, thedeveloper is expected to halt work, return to therequirements phase, fix the problem, change the designaccordingly, and then restart the implementation fromthe revised design. In practice, one only stops theimplementation affected by the newly discoveredrequirement.

1.

Determine the objectives for the phase. Consideralternatives to meeting the objectives.

2.

Evaluate the alternatives. Identify risks tocompleting the phase, and perform a risk analysis.Make a decision to proceed or stop.

3.

Develop the product for the particular phase.

4.

Plan for the next phase.

The products for each phase may match those of theprevious models. In such circumstances, the first looparound the spiral results in a concept of operations; thenext, a requirements specification; the next, a design;and so forth. Alternately, each loop may contain acomplete development cycle for one phase of theproduct; here, the spiral model looks somewhat like thephased implementation model. Other possibilitiesexist.The spiral model is particularly appropriate whenconsiderable financial, schedule, or technical risk isinvolved in the product development. This is becausean explicit risk analysis is carried out as part of eachphase, with an explicit decision to continue or stop.

The waterfall model has been severely criticized as notbeing realistic to many software developmentsituations, and this is frequently justified. It remains anexcellent model for those situations where therequirements are known and stable before developmentbegins, and where little change to requirements isanticipated.

2.4. Fault and Failure ClassificationSchemesFaults and failures can be classified in several differentways. Those that are considered useful in safetyrelated applications are described briefly here. Faultsare classified by persistence and by the source of thefault. There is some interaction between these, in thesense that not all persistence classes may occur for allsources. Table 2-1 provides the interrelationship.

2.3.2. Phased Implementation ModelThis model assumes that the development will takeplace as a sequence of versions, with a release aftereach version is completed. Each version has its ownlife cycle model. If new requirements are generatedduring the development of a version, they willgenerally be delayed until the next version, so awaterfall model may be appropriate to each version.(Marketing pressures may modify such delays.)

Failures are classified by mode, scope, and the effecton safety. These classification schemes consider theeffect of a failure, both on the environment withinwhich the computer system operates, and on thecomponents of the system.

This model is appropriate to commercial products thatare evolving over long periods of time, or for which

7


Life CycleActivities

Software Developer ActivitiesPlanningActivities

RequirementsActivities

SoftwareManagement Plan

DesignActivities

ImplementationActivities

CodeListings

DesignSpecification

RequirementsSpecification

SoftwareDevelopment Plan

Hardware &SoftwareArchitecture

Operations Plan

Conformance Review

Training Plan

Code SafetyAnalysis

Software V&VPlan

V&V Requirements AnalysisReport

V&V DesignAnalysisReport

V&V Implementation Analysis& Test Report

Software CM Plan

CM Requirements Report

CM DesignReport

CM Implementation Report

Design Audit

Design SafetyAnalysis

IRequirementsAudit

RequirementsSafety Analysis

Planning Audit

SoftwareSafety Plan

Software Audit Activities

Figure 2-1. Documents Produced During Each Life Cycle Stage

8

ImplementationAudit

Maintenance Plan

Conformance Review

Installation Plan

Conformance Review

Integration Plan

Conformance Review

Software QAPlan


Life CycleActivities

Software Developer Activities

Integration

Validation

Installation

Activities

Activities

Activities

Operation &Maintenance

Activities

OperationsManuals

System BuildDocuments

Conformance Review

Conformance Review

Conformance Review

Conformance Review

InstallationConfigurationTables

MaintenanceManuals

TrainingManuals

InstallationSafetyAnalysis

ChangeSafetyAnalysis

V&V IntegrationAnalysis & TestReport

V&V ValidationAnalysis Test& Report

V&V InstallationAnalysis & TestReport

V&V ChangeReport

CM IntegrationReport

CM ValidationReport

CM InstallationReport

CM ChangeReport

ValidationAudit

InstallationAudit

ValidationSafetyAnalysis

IntegrationAudit

IntegrationSafetyAnalysis

Software Audit Activities

Figure 2-1. Documents Produced During Each Life Cycle Stage (continued)

9


Pre-Development

RequirementsSpecification

SoftwareDesign

SoftwareImplementation

Integration

Validation

Installation

Operation andMaintenance

Figure 2-2. Waterfall Life Cycle Model

10


Cumulativecost

ProgressthroughstepsEvaluate alternatives,identify, resolve risks

Determineobjectives,alternatives,constraints

RiskanalysisRiskanalysisRiskanalysisRiskanalysisReview

Prototype 1

Commitment

Prototype 2

partitionRequirements planLife-cycle plan

DevelopmentplanIntegrationand testplan

Concept ofoperation

Softwarerequirements

Softwareproductdesign

DetaileddesignCode

Unittest

Design validationand verification

Implementation

Operationalprototype

Simulations, models, benchmarks

Requirementsvalidation

Plan next phases

Prototype 3

Acceptancetest

Integrationand testDevelop, verifynext-level product

(Boehm 1988)

Figure 2-3. Spiral Life Cycle Model

11


Table 2-1. Persistence Classes and Fault SourcesDesign

Operational

Transient

Hardware component

X

X

X

Software component

X

Input data

X

X

Permanent state

X

X

Temporary state

X

X

Topological

X

Operator

X

X

X

User

X

X

X

Environmental

X

X

X

Unknown

X

2.4.1. Fault Classifications

usually quite expensive to correct if they are notdiscovered until the product is in operation.

Faults and failures can be classified by several moreor-less orthogonal measures. This is important,because the classification may affect the depth andmethod of analysis and problem resolution, as well asthe preferred modeling technique.

An operational fault is a fault where some portionof the computer system breaks and must berepaired in order to return the system to a state thatmeets the design specifications. Examples includeelectronic and mechanical faults, databasecorruption, and some operator faults. Operationalfaults are sometimes called non-removable faults.When calculating fault rates for operational faults,it is generally assumed that the entity that hasfailed is in the steady-state portion of its life, sooperational fault rates are constant. As with designfaults, an operational fault may cause many errorsbefore being identified and repaired.

A transient fault is a fault that does cause acomputer system failure, but is no longer presentwhen the system is restarted. Frequently the basiccause of a transient fault cannot be determined.Redesign or repair has no effect in this case,although redesign can affect the frequency oftransient faults. Examples include power supplynoise and operating system timing errors. Whilean underlying problem may actually exist, noaction is taken to correct it (or the fault would fall

Faults can be classified by the persistence and sourceof the fault. This is described in the two subsections ofthis section. Terms defined in each subsection are usedin other subsections.2.4.1.1. Fault PersistenceAny fault falls into one of the following three classes(Kopetz 1985):

A design fault is a fault that can be corrected byredesign. Most software and topological faults fallinto this class, but relatively few hardware faultsdo. Design faults are sometimes called removablefaults, and are generally modeled by reliabilitygrowth models (See Appendix A.3.). One designfault can cause many errors and failures before itis diagnosed and corrected. Design faults are

12


into one of the other classes). In some computersystems, 5080% of all faults are transient. Thefrequency of operating system faults, for example,is typically dependent on system load andcomposition.

the underlying faults are corrected and new releasesare sent to the customers, the failure rate shoulddecrease until a more-or-less steady state is reached.Over time, the maintenance and enhancement processmay perturb the software structure sufficiently thatnew faults are introduced faster than old ones areremoved. The failure rate may then go up, and acomplete redesign is in order.

The class of transient faults actually includes twodifferent types of event; they are grouped together heresince it is generally impossible to distinguish betweenthem. Some events are truly transient; a classic (thoughspeculative) example is a cosmic ray that flips a singlememory bit. The other type is an event that really is adesign or operational fault, but this is not known whenit occurs. That is, it looks like the first type of transientevent. If the cause is never discovered, no real harm isdone in placing it in this class. However, if the cause iseventually determined, the event should be classifiedproperly; this may well require recalculation ofreliability measures.

While this behavior looks similar to that described forelectronic systems, the causal factors are quitedifferent. One should be very careful when attemptingto extrapolate from one to the other.2.4.1.2. Source of Faults in Computer SystemsFault sources can be classified into a number ofcategories; ten are given here. For each one, the sourceis described briefly, and the types of persistence thatare possible is discussed.

A computer system is constructed according to somespecification. If the system fails, but still meets thespecification, then the specification was wrong. This isa design fault. If, however, the system ceases to meetthe specification and fails, then the underlying fault isan operational fault. A broken wire is an example. Ifthe specification is correct, but the system failsmomentarily and then recovers on its own, the fault istransient.Many electronic systems, and some mechanicalsystems, have a three stage life cycle with respect tofault persistence. When the device is first constructed,it will have a fairly high fault rate due to undetecteddesign faults and burn-in operational faults. Thisfault rate decreases for a period of time, after that thedevice enters its normal life period. During this(hopefully quite long) period, the failure rate isapproximately constant, and is due primarily tooperational and transient faults, with perhaps a fewremaining design faults. Eventually the device beginsto wear out, and enters the terminal stage of its life.Here the fault rate increases rapidly as the probabilityof an operational fault goes up at an increasing rate. Itshould be noted that in many cases the end of theproducts useful life is defined by this increase in thefault rate.

A hardware fault is a fault in a hardwarecomponent, and can be of any of the threepersistence types. Application systems rarelyencounter hardware design faults. Transienthardware faults are very frequent in some systems.

A software fault is a bug in a program. In theory,all such are design faults. Dhillon (1987) classifiessoftware faults into the following eight categories: Logic faults Interface faults Data definition faults Database faults Input/output faults Computational faults Data handling faults Miscellaneous faults

The behavior described in the last paragraph results ina failure rate curve termed the bathtub curve. It wasoriginally designed to model electronic failure rates.There is a somewhat analogous situation for software.When a software product is first released, there may bemany failures in the field for some period of time. As

13

An input data fault is a mistake in the input. Itcould be a design fault (connecting a sensor to thewrong device is an example) or an operationalfault (if a user supplies the wrong data).

A permanent state fault is a fault in state data thatis recorded on non-volatile storage media (such asdisk). Both design and operational faults arepossible. The use of a data structure definition thatdoes not accurately reflect the relationships amongthe data items is an example of a design fault. Thefailure of a program might cause an erroneousvalue to be stored in a file, causing an operationalfault in the file.


A temporary state fault is a fault in state data thatis recorded on volatile media (such as mainmemory). Both design and operational faults arepossible. The primary reason to separate this frompermanent state faults is to allow for thepossibility of different failure rates.

A topological fault is a fault caused by a mistakein computer system architecture, not with thecomponent parts. All such faults are design faults.Notice that the failure of a cable is considered ahardware operational fault, not a topological fault.

An operator fault is a mistake by the operator.Any of the three types are possible. A design faultoccurs if the instructions provided to the operatorare incorrect; this is sometimes called a procedurefault. An operational fault would occur if theinstructions are correct, but the operatormisunderstands and doesn't follow them. Atransient fault would occur if the operator isattempting to follow the instructions, but makes anunintended mistake. Hitting the wrong key on akeyboard is an example. (One goal of displayscreen design is to reduce the probability oftransient operator errors.)

A user fault differs from an operator fault onlybecause of the different type of person involved;operators and users can be expected to havedifferent fault rates.

An environmental fault is a fault that occursoutside the boundary of the computer system, butthat affects the system. Any of the three types ispossible. Failure to provide an uninterruptiblepower supply (UPS) would be a design fault,while failure of the UPS would be an operationalfault. A voltage spike on a power line is anexample of an environmentally induced transientfault.

might only be known that there is a fault in acommunication system.Table 2-1 shows which persistence classes may occurfor each of the ten fault sources.

2.4.2. Failure ClassificationsThree aspects of classifying failures are given below;there are others. These are particularly relevant to laterdiscussion in this report.2.4.2.1. Failure ModesDifferent failure modes can have different effects on acomputer system. The following definitions apply(Smith 1972).

An unknown fault is any fault whose source classis never identified. Unfortunately, in somecomputer systems many faults occur whose sourcecannot be identified. All such faults are transient(more or less by definition), and this category maywell include a plurality of system faults. Anotherproblem is that the underlying problem may beidentified at a later time (possibly months later),so there is a certain impermanence about thiscategory. It generally happens that someinformation is available about the source of thefault, but not sufficient information to allow thesource to be completely identified. For example, it

A sudden failure is a failure that could not beanticipated by prior examination. That is, thefailure is unexpected.

A gradual failure is a failure that could beanticipated by prior examination. That is, thesystem goes into a period of degraded operationbefore the failure actually occurs.

A partial failure is a failure resulting in deviationsin characteristics beyond specified limits but notsuch as to cause complete lack of the requiredfunction.

A complete failure is a failure resulting indeviations in characteristics beyond specifiedlimits such as to cause complete lack of therequired function. The limits referred to in thiscategory are special limits specified for thispurpose.

A catastrophic failure is a failure that is bothsudden and complete.

A degradation failure is a failure that is bothgradual and partial.

2.4.2.2. The Scope of FailuresFailures can be assigned to one of three classes, dependingon the scope of their effects (Anderson 1983).

14

A failure is internal if it can be adequatelyhandled by the device or process in which thefailure is detected.

A failure is limited if it is not internal, but if theeffects are limited to that device or process.

A failure is pervasive if it results in failures ofother devices or processes.


2.4.2.3. The Effects of Failures on Safety

system, incorrect input signals being sent to thecomputer system by intervention of a humanagent, incorrect commands from the operator, andany other forms of tampering. Access controlshould consider both inadvertent and maliciouspenetration.

Finally, it is possible to classify application systems bythe effect of failures on safety.

A system is intrinsically safe if the system has nohazardous states.

A system is termed fail safe if a hazardous statemay be entered, but the system will prevent anaccident from resulting from the hazard. Anexample would be a facility in a reactor that forcesa controlled shutdown in case a hazardous state isentered, so that no radiation escapes.

A system controls accidents if a hazardous statemay be entered and an accident may occur, but thesystem will mitigate the consequences of theaccident. An example is the containment shell of areactor, designed to preclude a radiation releaseinto the environment if an accident did occur.

A system gives warning of hazards if a failuremay result in a hazardous state, but the systemissues a warning that allows trained personnel toapply procedures outside the system to recoverfrom the hazard or mitigate the accident. Forexample, a reactor computer protection systemmight notify the operator that a hazardous statehas been entered, permitting the operator to hitthe panic button and force a shutdown in such away that the computer system is not involved.

Accuracy. Accuracy refers to those attributes of thesoftware that provide the required precision incalculations and outputs. In some situations, thiscan require a careful error analysis of numericalalgorithms.Auditability. Auditability refers to the ease withwhich conformance to standards can be checked.The careful development of project plans,adherence to those plans, and proper recordkeeping can help make audits easier, morethorough and less intrusive. Sections 3 and 4discuss this topic in great depth.Completeness. Completeness properties are thoseattributes of the software that provide fullimplementation of the functions required. Asoftware design is complete if all requirements arefulfilled in the design. A software implementationis complete if the code fully implements thedesign.Consistency. Consistency is defined as the degree ofuniformity, standardization and freedom fromcontradictions among the documents or parts of asystem or component. Standardized errorhandling is an example of consistency.Requirements are consistent if they do not requirethe system to carry out some function, and underthe same conditions to carry out its negation. Aninconsistent design might cause the system to sendincompatible signals to one or more actuators,causing the protection system to attemptcontradictory actions. An example would bestarting a pump but not opening the intake value.

Finally, a system is fail dangerous, or creates anuncontrolled hazard, if system failure can cause anuncontrolled accident.

2.5. Software QualitiesA large number of factors have been identified byvarious theoreticians and practitioners that affect thequality of software. Many of these are very difficult toquantify. The discussion here is based on IEEE 610.12,Evans 1987, Pressman 1987, and Vincent 1988. Thelatter two references based their own discussion onMcCall 1977. The discussion concentrates on definingthose terms that appear important to the design ofreactor protection computer systems. Quotations in thissection come from the references listed above.

Correctness. Correctness refers to the extent towhich a program satisfies its specifications andfulfills the users mission objectives. This is abroader definition than that given forcompleteness. It is worth noting that some of thedocuments referenced at the beginning of thesection essentially equate correctness withcompleteness, while others distinguish betweenthem. The IEEE Standard 610.12 gives both formsof definition.

Access Control. The term access control relates tothose attributes of the software that provide forcontrol of the access to software and data. In areactor protection system, this refers to the abilityof the utility to prevent unauthorized changes toeither software or data within the computer

Expandability. Expandability attributes are thoseattributes of the software that provide forexpansion of data storage requirements or

15


computational functions. The wordextendibility is sometimes used as a synonym.

Robustness. Robustness refers to the degree to whicha system or component can function correctly inthe presence of invalid inputs or stressfulenvironmental conditions. This quality issometimes referred to as error tolerance andmay be implemented by fault tolerance or designdiversity.

Generality. Generality is the degree to which asystem or component performs a broad range offunctions. This is not necessarily a desirableattribute of a reactor protection system if thegenerality encompasses functionality beyondsimply protecting the reactor.

Simplicity. Simplicity attributes are those attributesthat provide implementation of functions in themost understandable manner. It can be thought ofas the absence of complexity. This is one of themore important design qualities for a reactorcomputer protection system, and is quite difficultto quantify. See Preckshot 1992 for additionalinformation on complexity and scalability.

Software Instrumentation. Instrumentation refers tothose attributes of the software that provide formeasurement of usage or identification of errors.A well-instrumented system can monitor its ownoperation, and detect errors in that operation.Software instrumentation can be used to monitorthe hardware operation as well as its ownoperation. A hardware device such as a watch-dogtimer can be used to help monitor the softwareoperation. If instrumentation is required for acomputer system, it may have a considerableeffect on the system design, so must be consideredas part of that design.

A particularly important aspect of complexity isthe distinction between functional complexity andstructural complexity. The former refers to asystem that attempts to carry out many disparatefunctions, and is controlled by limiting the goalsof the system. The latter refers to the method ofcarrying out the functions, and may be controlledby redesigning the system to carry out the samefunctions in a simpler way.

Modularity. Modularity attributes are those attributesof the software that provide a structure of highlyindependent modules. To achieve modularity, theprotection computer system should be divided intodiscrete hardware and software components insuch a way that a change to one component hasminimal impact on the remaining modules.Modularity is measured by cohesion and coupling(Yourdon 1979).

Testability. Testability refers to the degree to which asystem or component facilitates the establishmentof test criteria and the performance of tests todetermine whether those criteria have been met.Traceability. Traceability attributes are thoseattributes of the software that provide a threadfrom the requirements to the implementation withrespect to the specific development andoperational environment.

Operability. Operability refers to those attributes ofthe software that determine operation andprocedures concerned with the operation of thesoftware. This quality is concerned with the manmachine interface, and measures the ease withwhich the operators can use the system. This isparticularly a concern during off-normal andemergency conditions when confusion may behigh and mistakes may be unfortunate.

16

Section 3. Activities

3. LIFE CYCLE SOFTWARE RELIABILITYAND SAFETY ACTIVITIESMuch has been written about software engineering andhow a well-structured development life cycle can helpin the production of correct maintainable softwaresystems. Many standard software engineering activitiesshould be performed for any software project, so arenot discussed in this report. Instead, the reportconcentrates on the additional activities required for asoftware project in which safety is a prime concern.Refer to a general text, such as Macro 1990 orPressman 1987, for general information on softwareengineering.

The documents that an assessor should expect to haveavailable, and their contents, is the subject of thissection of the report. The process of assessing thesedocuments is discussed in Section 4.

3.1. Planning ActivitiesFundamental to the effective management of anyengineering project is the planning that goes into theproject. This is especially true where extremereliability and safety are of concern. While there aregeneral issues of avoiding cost and schedule overruns,the particular concern here is safety. Unless amanagement plan exists, and is followed, theprobability is high that some safety concerns will beoverlooked at some point in the project lifetime, orlack of time or money near the end of the developmentperiod will cause safety concerns to be ignored, ortesting will be abridged. It should be noted that thetime/money/safety tradeoff is a very difficultmanagement issue requiring very wise judgment. Noproject manager should be allowed to claim safety asan excuse for unconscionable cost or scheduleoverruns. On the other hand, the project managershould also not be allowed to compromise safety in aneffort to meet totally artificial schedule and budgetconstraints.

Any software development project can be discussedfrom a number of different viewpoints. Examplesinclude the customer, the user, the developer, theproject manager, the general manager, and theassessor. The viewpoint that is presumed will have aconsiderable effect on the topics discussed, andparticularly on the emphasis placed on differentaspects of those topics. The interest here is theviewpoint of the assessor. This is a person (or group ofpeople) who evaluates both the development processand the products of that process for assurance that theymeet some externally-imposed standard. In this report,those standards will relate to the reliability of thesoftware products and the safety of the application inwhich the software is embedded. The assessor may bea person in the development organization charged withthe duty of assuring reliability and safety, a person inan independent auditing organization, or an employeeof a regulatory agency. The difference among theseassessors should be the reporting paths, not thetechnical activities that are carried out. Consequentlyno distinction is made here among the different typesof assessor.

For a computer-based safety system, a number ofdocuments will result from the planning activity. Theseare discussed in this section, insofar as safety is anissue. For example, a software management plan willgenerally involve non-safety aspects of thedevelopment project, which go beyond the discussionin Section 3.1.1.Software project planning cannot take place inisolation from the rest of the reactor development. It isassumed that a number of documents are available tothe software project team. At minimum, the followingmust exist:

Since this report is written from the viewpoint of theassessor, the production of documents is emphasized inthis report. The documents provide the evidence thatrequired activities have actually taken place. There issome danger that the software developer willconcentrate on the creation of the documents ratherthan the creation of safe reliable software. The assessormust be constantly on guard for this activity. Thesoftware runs the protection system, not thedocuments. There is heavy emphasis below onplanning: creating and following the plans that arenecessary to the development of software where safetyis a particular concern.

17

Hazards analysis. This identifies hazardousreactor system states, sequences of actions that cancause the reactor to enter a hazardous state,sequences of actions intended to return the reactorfrom a hazardous state to a nonhazardous state,and actions intended to mitigate the consequencesof an accident.

Section 3. Activities

actuators by the computer system. Interfaces alsoinclude display devices intended for man-machineinteraction.Planning a software development project can be acomplex process involving a hierarchy of activities.The entire process is beyond the scope of this report.Figure 3-1, reprinted from Evans 1983 (copyright 1983by Michael Evans, Pamela Piazza, and James Dolkas.Reprinted by permission of John Wiley & Sons), givesa hint as to the activities involved. Planning isdiscussed in detail in Pressman 1987.

Interfaces between the protection computersystem and the rest of the reactor protectionsystem. That is, what signals must be obtainedfrom sensors and what signals must be provided to

Software, design production,integration, test, and documentation

Financial andresource planningetBudg entnmassig

Rav esoail urab ceility

Cosdev t accoelop untment

High level reactor system design. This identifiesthose functions that will be performed by theprotection system, and includes a specification ofthose safety-related actions that will be required ofthe software in order to prevent the reactor fromentering a hazardous state, move the reactor froma hazardous state to a