3
IEEE TRANSACTIONS ON RELIABILITY, VOL. R-28, NO. 3, AUGUST 1979 247 Software Failure Modes and Effects Analysis Donald J. Reifer, Member IEEE pensate for known failure modes of critical functions in space Software Management Consultants, Torrance and missile systems where the consequences of failure are often catastrophic (e.g., inadvertent detonation of a nuclear missile Key Words-Software reliability, Software failure modes and effects near a populated area). Standards [6, 7] and a handbook [81 analysis, Fault tolerant software, Self checking software have been published to direct its orderly application on these Reader Aids- military projects. The standards give useful insight into why Purpose: Widen state of the art FMEA is employed. Special math needed: None The five major objectives for performing FMEA are: Results useful to: Software and reliability engineers 1. To identify single point failure modes and define their Abstract-This concept paper discusses the possible use of failure effects. modes and effects analysis (FMEA) as a means to produce more reliable 2. To identify those areas of a design where redundancy software. FMEA is a fault avoidance technique whose objective is to should be implemented. identify hazards in requirements that have the potential to either en- 3. To identify compensating features for those single danger mission success or significantly impact life-cycle costs. FMEA point failure modes where elimination is impractical. techniques can be profitably applied during the analysis stage to identify potential hazards in requirements and design. As hazards are identified, 4. To identify redundancy which is not or cannot be software defenses can be developed using fault tolerant or self-checking tested techniques to reduce the probability of their occurrence once the pro- 5. To assist in ranking the most serious failure modes and gram is implemented. Critical design features can also be demonstrated for establishing a critical items list. a priori analytically using proof of correctness techniques prior to their implementation if warranted by cost and criticality. Unfortunately, no direction is given in these standards for handling software failure modes. Yet, software failure modes. I. INTRODUCTION a significant hazard. For example, an aircraft autoland system used during bad weather landings may not be safe because of Over the last decade, considerable research has been under- a problem in convergence of its vectoring algorithms when taken to improve the reliability of software. Software error, singularities are introduced under extreme operating conditions. reliability, and complexity models have been formulated and Or, a computerized safety system used to monitor a nuclear applied experimentally to a variety of projects [1 - 3] . Most power-plant reactor-control system may have a logic error of the current work in software reliability is directed toward which allows a failure to switch-in hardware (needed to dampen producing quantitative models that can be used to measure, an overloaded reactor) to go undetected. As a last example, manage, and predict the level of software perfection, primarily an error in a spacecraft program used to control reentry angle during the test phase. Yet, the software engineering commun- closely could cause skipout and loss of mission. All of these ity has agreed that the greatest leverage on error reduction and examples represent present day situations where the cost of thereby cost avoidance can be exerted during the requirements failure is high and where an FMEA for software may be war- and design stages of the software development cycle [4] . The ranted. experience of the SAFEGUARD project provides a useful in- dication of the magnitude of this leverage. According to Lewis II. SOFTWARE FMEA CONCEPTS [5], more than 62.5 percent of all changes occurring during test and integration resulted from latent requirements and Several techniques can be used for FMEA. Those that seem design errors. Data in [9; pp 4-39] are in striking agreement. applicable to the requirements analysis phase of software de- The cost of these test and integration corrections was 36 times velopment revolve around conducting a 1) hazards analysis more expensive than if they were corrected during require- to identify failure modes and their consequences and 2) a ments formulation. New and inventive techniques based upon feasibility study to determine how these failure modes can be principles proven in other discipline areas need be devised to eliminated or guarded against. capitalize on this leverage. Software FMEA is one such tech- The hazard analysis is performed in five steps. Each of these nique. steps is explained in subsequent paragraphs. Failure Modes and Effects Analysis (FMEA) is a systematic 1) The software requirements specification is carefully ex- process aimed at identification and elimination or compensa- aimined and mission-essential requirements and failure-critical tion of failure modes for reliability improvement. It has been factors are identified. Mission-essential requirements are those used successfully for many years to identify, rank, and com- that must be performed in order for the system, of which soft- 0018-9529/79/0800-247 $00.75 ©C 1979 IEEE

Software Failure Modes and Effects Analysis

Embed Size (px)

Citation preview

Page 1: Software Failure Modes and Effects Analysis

IEEE TRANSACTIONS ON RELIABILITY, VOL. R-28, NO. 3, AUGUST 1979 247

Software Failure Modes and Effects Analysis

Donald J. Reifer, Member IEEE pensate for known failure modes of critical functions in spaceSoftware Management Consultants, Torrance and missile systems where the consequences of failure are often

catastrophic (e.g., inadvertent detonation of a nuclear missileKey Words-Software reliability, Software failure modes and effects near a populated area). Standards [6, 7] and a handbook [81

analysis, Fault tolerant software, Self checking software have been published to direct its orderly application on these

Reader Aids- military projects. The standards give useful insight into whyPurpose: Widen state of the art FMEA is employed.Special math needed: None The five major objectives for performing FMEA are:Results useful to: Software and reliability engineers 1. To identify single point failure modes and define their

Abstract-This concept paper discusses the possible use of failure effects.modes and effects analysis (FMEA) as a means to produce more reliable 2. To identify those areas of a design where redundancysoftware. FMEA is a fault avoidance technique whose objective is to should be implemented.identify hazards in requirements that have the potential to either en- 3. To identify compensating features for those singledanger mission success or significantly impact life-cycle costs. FMEA point failure modes where elimination is impractical.techniques can be profitably applied during the analysis stage to identifypotential hazards in requirements and design. As hazards are identified, 4. To identify redundancy which is not or cannot besoftware defenses can be developed using fault tolerant or self-checking testedtechniques to reduce the probability of their occurrence once the pro- 5. To assist in ranking the most serious failure modes andgram is implemented. Critical design features can also be demonstrated for establishing a critical items list.a priori analytically using proof of correctness techniques prior to theirimplementation if warranted by cost and criticality. Unfortunately, no direction is given in these standards for

handling software failure modes. Yet, software failure modes.

I. INTRODUCTION a significant hazard. For example, an aircraft autoland systemused during bad weather landings may not be safe because of

Over the last decade, considerable research has been under- a problem in convergence of its vectoring algorithms whentaken to improve the reliability of software. Software error, singularities are introduced under extreme operating conditions.reliability, and complexity models have been formulated and Or, a computerized safety system used to monitor a nuclearapplied experimentally to a variety of projects [1 - 3] . Most power-plant reactor-control system may have a logic errorof the current work in software reliability is directed toward which allows a failure to switch-in hardware (needed to dampenproducing quantitative models that can be used to measure, an overloaded reactor) to go undetected. As a last example,manage, and predict the level of software perfection, primarily an error in a spacecraft program used to control reentry angleduring the test phase. Yet, the software engineering commun- closely could cause skipout and loss of mission. All of theseity has agreed that the greatest leverage on error reduction and examples represent present day situations where the cost ofthereby cost avoidance can be exerted during the requirements failure is high and where an FMEA for software may be war-and design stages of the software development cycle [4] . The ranted.experience of the SAFEGUARD project provides a useful in-dication of the magnitude of this leverage. According to Lewis II. SOFTWARE FMEA CONCEPTS[5], more than 62.5 percent of all changes occurring duringtest and integration resulted from latent requirements and Several techniques can be used for FMEA. Those that seemdesign errors. Data in [9; pp 4-39] are in striking agreement. applicable to the requirements analysis phase of software de-The cost of these test and integration corrections was 36 times velopment revolve around conducting a 1) hazards analysismore expensive than if they were corrected during require- to identify failure modes and their consequences and 2) aments formulation. New and inventive techniques based upon feasibility study to determine how these failure modes can beprinciples proven in other discipline areas need be devised to eliminated or guarded against.capitalize on this leverage. Software FMEA is one such tech- The hazard analysis is performed in five steps. Each of thesenique. steps is explained in subsequent paragraphs.

Failure Modes and Effects Analysis (FMEA) is a systematic 1) The software requirements specification is carefully ex-process aimed at identification and elimination or compensa- aimined and mission-essential requirements and failure-criticaltion of failure modes for reliability improvement. It has been factors are identified. Mission-essential requirements are thoseused successfully for many years to identify, rank, and com- that must be performed in order for the system, of which soft-

0018-9529/79/0800-247 $00.75 ©C 1979 IEEE

Page 2: Software Failure Modes and Effects Analysis

248 IEEE TRANSACTIONS ON RELIABILITY, VOL. R-28, NO. 3, AUGUST 1979

Table IMAJOR ERROR CATEGORIES FOR THREE LARGE SOFTWARE PROJECTS*

PROJECT III

OPERATING APPLICATION SIMULATORMAJOR ERROR CATEGORY PROJECT I PROJECT II SYSTEM SOFTWARE SOFTWARE

Computational 9.0 1.7 2.5 13.5 19.6Logic 26.0 34.5 34.6 17.1 20.9Data I/O 16.4 8.9 8.6 7.3 9.3Data Handling 18.2 27.2 21.0 10.9 8.4Interface 17.0 22.5 7.4 9.8 6.7Data Definition 0.8 3.0 7.4 7.3 13.8Data Base 4.1 2.2 4.9 24.7 16.4Other 8.5 0.0 13.6 9.4 4.9

Total Problem Reports 2019 405 81 275 225

*From [10; Table 17.6] with copyright permission of the authors/publishers.Body of the Table give the percent of errors in each category for that project.

ware is part, to perform its mission successfully. An example fects can be represented as the conditional probability of aof such a requirement taken from a missile application is: failure given that a failure has occured. The probability modelsschedule attitude control commands to be issued every 50 (hazard function) associated with each identified mission-es-milliseconds. Without this update, the stability of the missile sential requirement must be modified according to the findingsduring critical phases of flight will be jeopardized. Failure- of this analysis.critical factors are software errors that are serious enough to 3) Mission-essential requirements are ranked in terms ofcause the program, when executed, to either abort or degrade the probability of experiencing a critical failure during thebefore the mission objective is realized. For example: the pro- mission.gram enters an infinite loop (due to a logic error) and conse- 4) The process is iterated based upon further refinementquently never exits to perform the remainder of the calcula- of the requirements and detailed FMEAs. Failure modestions needed to fulfill the mission. Mission-essential require- identified are induced analytically and failure effects are eval-ments and failure-critical factors are determined using check- uated. Random failures are induced and their effects are noted.lists derived by analyzing previous projects and software relia- As each failure mode is evaluated, the corresponding effect atbility data [9] . Table 1 illustrates data reported on three the software system level is determined. Based upon this ap-major projects [10]. Because little public data exist in usable proach, probabilities for the occurrence of the system effectform, experienced personnel familiar with the application can be calculated based upon the individual hazard functions.must be used in this determination. Criticality numerics can then be assigned to serve as a basis for

2) Mission-critical requirements are analyzed to determine corrective action.if and how they interrelate with each other and what time- 5) The process is iterated to account for changes in the re-dependencies exist over the projected operating time. Ana- quirements and failure data.lytic and simulation models are used to make the behavior of The feasibility study takes the ranked list of mission-essentialthe functional requirements clear and easy to understand. The requirements and investigates how each critical failure mode cananalysis is performed to develop a critical items list where fail- be either eliminated or its effect reduced. Its objective is to re-ure-critical factors are ranked by seriousness and frequency of state the requirements in such a manner that the risk associatedoccurrence for each mission-essential requirement. Once the with a critical failure's occurring during the mission is reducedlist is developed, an attempt is made to generate a 'hazard' to an acceptable level. The feasibility study evaluates the re-function for each mission-essential requirement. For single- sults of the hazard analysis and attempts to eliminate the causesevent requirements like setting an enabling condition, the of failure. It investigates whether failure modes induced ana-hazard function can be expressed in the form of the ratio of lytically can be corrected by modifying the requirements. Inthe forecasted number of failures to the number of trials of many instances the analysis must be supplemented with a de-the event. For requirements which operate over some time tailed evaluation of the engineering equations (requirementsperiod, anlappropriate software reliability hazard function contain equations in engineering form, while designs containmust be employed. For example, [1 1] considers nine hazard algorithms). This is necessary because accuracy is typicallyfunctions. Next, the interrelationships between mission-es- one of the prerequisites for satisfactory performance in mostsential requirements are examined to determine whether fail- real-time systems. If the results of both the analytic and equa-ures in one function affect other functions. The analytic and tion evaluation indicate that the failure mode has been elim-simulation models devised and N2 charts [12] are useful tools mnated, checks should be inserted into the requirements toin analyzing the functional interactions and interfaces between ensure that design when implemented does not reintroducesoftware requirements. In mathematical terms, secondary ef- the problem (e.g., filter equations implemented to prevent

Page 3: Software Failure Modes and Effects Analysis

REIFER: SOFTWARE FAILURE MODES AND EFFECTS ANALYSIS 249

singularities reintroduce them due to scaling considerations). [51 R.O. Lewis, unpublished private correspondence.Such checks can be assertions which test the correctness of [61 MIL-STD-1543 (USAF), Reliability Program Requirements forcritical portion of a program either symbolically [13] or Space and Missile Systems, 1974 July 15.mathematically [14] using proof of correctness (based upon [7] SAMSO-STD 77-2, Failure Modes and Effects Analysis For

Satellite, Launch Vehicle and Reentry Systems, 1977 Novemberinductive assertion) techniques. Recognized limitations in 22.the proof technology [1 5, 16] make its use infeasible for all [8] R.T. Anderson, Reliability Design Handbook, IIT Research Insti-but small, critical segments of the program. tute, Catalog No. RDH-376, 1976 March.

Proof techniques are recommended only when the cost [9] T.A. Thayer, M. Lipow, E.C. Nelson, Software Reliability - Aare justified by the additional assurance. If the results of the Study ofLarge Project Reality, TRW Software Technology Series,Volume 2, North-Holland Publishing Co., New York, 1978.analytic and equation evaluation indicate that the failure mode [10] David K. Lloyd, Myron Lipow, Reliability: Management, Methods,cannot be eliminated economically, safeguards must be built and Mathematics, Second Edition, Published by the Authors,into the program to reduce the probability that a failure mode Redondo Beach, California, 1977, p 502.wili diminish the chances of mission success. [11] Alan N. Sukert, "An investigation of software reliability models",

Requiring the use of self-checking or fault-tolerant software Proc. 1977Reliability and Maintainability Symposium, 1977under such circumstances can provide the appropriate level of [12] R.J. Lano, TheN2 Chart, TRW-SS-77-04, 1977 November.safequard. Because both self-checking and fault-tolerant soft- [13] M.S. Fuju, M.A. Ikezawa, "Use of symbolic execution in verifi-ware are discussed elsewhere in this Special Issue, only a brief cation and validation", Tools for Embedded Computer Systemsummary of each approach is offered. The criticality of the Software, NASA Conf.Publ.2064, 1978 Nov, pp 113-116.

potentilfault must justify the additional cost for either option [141 R.L. London, "A view of program verification," in Proc. Inter-potentialfault must ust sredunan softwar the ck national Conf Reliable Software, IEEE Catalog No. 75CH0940-selected. Self-checking [17] uses redundant software to check 7CSR, 1975 Apr, pp 534-545.the dynamic behavior of the program (during its execution) [15] K.N. Levitt, et al., "A panel session-formal methods in program-for proper operation. When improper behavior is detected, ming - when will they be practical?" Proc. 1978 National Com-the error is isolated and recovery is attempted using classical puter Conf, AFIPS Press, 1978 Jun, pp 665-668.roliback and reentry operations. Fault-tolerance [18] uses [16] S.L. Gerhart, Program Verification in the 1980s: Problems, Per-

spectives and Opportunities, USC/ISI, Report ISI/RR-78-71,deliberately different software to back up critical modules. 1978 August.Recovery blocks [19] are the usual mechanism by which [17] S.S. Yau, R.C. Cheung, D.C. Cochrane, "An approach to errorchecks are made of the acceptability of the intermediate resistant software design," Proc. 2nd International Conf Soft-stages of program execution and for determining which ware Engineering, IEEE Cat. No. 76CH1 125-4C, 1976 Oct,optagesion shogrambetakenif ndheches inditerm mfilu. pp 429-436.option should be taken if the checks indicate failure. [18] H. Hecht, "Fault-tolerant software for real-time applications,"

ACM Computing Surveys, vol 8, 1976 Dec, pp 391-407.III. OPERATIONAL EXPERIENCE [19] B. Randall, "System structure for software fault tolerance,"

Software FMEA is being investigated as part of a multi- IEEE Trans. Software Engineering, vol 1, 1975 Jun, pp 220-232.faceted internal research program aimed at improving the [20] A. Frimtzis, M. Lipow, D. Reifer, "Software failure modes andreliability of tomorrow's weapon systems. The software effects analysis," Proc. Industry/SAMSO Conf and Workshop

on Mission Assurance, Los Angleles, 1978 Apr, pp 154-154.FMEA concept was developed and reported in 1978 [20] .Its viability will be assessed when a software requirementsspecification for an operational program is evaluated using the AUTHORmethodology. An analysis of failure data will indicate whetherthe failure modes indentified were indeed those that caused Donald J. Reifer; Software Management Consultants; 2922 West 227ththe system to be unsuccessful. Street; Torrance, CA 90505 USA.

Mr. Reifer (S'67,M'75) received a BS in Electrical Engineering fromIV. ACKNOWLEDGEMENT Newark College of Engineering, an MS in Operations Research from the

I ackowledge the encouragement and assistance provided University of Southern California, and the Certificate in Business Manage-by my colleague, Myron Lipow. While I suggested that it was ment for Technical Personnel from the University of California at Losprematureto publish work that had not been vahdated, he Angeles. He has over 12 years of management experience in software

premature tengineering. Mr. Reifer is President of Software Management Consult-countered with an appeal to place it in the public domain be- ants. Previously, he was a Senior Staff Engineer with TRW Digitalcause of its importance. Avionics Laboratory where he served as Deputy Program Manager on

a U.S. AFLC study assessing computer resources for the 1980's. Priorto that, Mr. Reifer managed all software activities in support of the

REFERENCES Aerospace Corporation's Space Transportation Systems efforts whichinvolved over $100 million of software products. He is a member of

[1] M.L. Shooman, H. Ruston, Software Modeling Studies, Summary Eta Kappa Nu, Omicron Delta Kappa, Alpha Sigma Mu, the Americanof Techricl Progress, RADC-TR-78-4, 1978 January. Institute of Aeronautics and Astronautics, the Association for Corn-

[2] H. Hecht, W.A. Sturm, S. Trattner, "Reliability measurement puting Machinery, the Data Processing Management Association. Heduring software development", Proc. AIAA/NASA/IEEE/ACM has over 30 major publications and has received many honors includingComputers in Aerospace Conference, Los Angeles, 1977 Oct the Hughes Aircraft Company Fellowship, and membership in Who's31 - Nov 2. Who in the West. Mr. Reifer recently visited the Soviet Union as a

[3] B.H. Yin, J.W. Winchester, "The establishment and use of meas- member of the U.S. Popov Society delegation, and the Peoples Republicures to evaluate the quality of software designs", Proc. Software of China as a member of an IEEE scientific exchange delegationQuality and Assurance Workshop, ACM, 1978 Nov, pp 45-52.

[4] Anthony I. Wasserman, L.A. Belady, "Software engineering - Manuscript SI79-11 received 1978 December 8; revised 1979 January.the turning point"s, Computer, 1978 Sep, pp 30-41.**