16
QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL Qual. Reliab. Engng. Int. 2005; 21:329–344 Published online 27 January 2005 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/qre.646 Application Total System Reliability: Integrated Model for Growth and Test Termination John Donovan 1, ,† and Eamonn Murphy 2 1 School of Engineering, Institute of Technology, Ballinode, Sligo, Ireland 2 University of Limerick, Ireland Reliability demonstration testing is not the most efficient method of assuring product reliability prior to shipment. It is costly, time consuming and has inherent technical and social limitations. The dilemma facing the reliability and quality engineer is whether to continue demonstration testing and risk shipping a product late or ship the product and risk warranty and field service returns. Either option can cause the company to lose significant market share and profit. This paper sets out to solve this dilemma by meeting both the time to market constraints and the product reliability goals. The weaknesses of existing reliability demonstration techniques are explored and a comprehensive methodology is introduced involving controlled development processes, stress testing, root cause determination and process change feedback mechanisms. All prototype products are manufactured on the final volume process line resulting in the early identification and correction of process- related problems. Testing commences on the first available prototypes with system stress/robust testing being employed to stimulate failures, determine their root cause and correct them. Reliability growth modelling assesses the ongoing improvements occurring in reliability during the development cycle, while a statistical stopping rule is used to determine the optimal product release time without risking product warranty. The approach is applicable to systems incorporating both hardware and software elements. The methodology has been validated on three development projects of telecommunication systems comprising hardware and software. In addition to enhancing team behaviour and performance, the development times have been reduced by 14% and the ramp-up time to full worldwide product shipments has been reduced by 50%. Copyright c 2005 John Wiley & Sons, Ltd. KEY WORDS: new product introduction; reliability demonstration; reliability growth; time to market; stress testing; stopping rule; SURGE Correspondence to: John Donovan, School of Engineering, Institute of Technology, Ballinode, Sligo, Ireland. E-mail: [email protected] Copyright c 2005 John Wiley & Sons, Ltd. Received 20 July 2003 Revised 22 January 2004

Total System Reliability: Integrated Model for Growth and Test Termination

Embed Size (px)

Citation preview

Page 1: Total System Reliability: Integrated Model for Growth and Test Termination

QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL

Qual. Reliab. Engng. Int. 2005; 21:329–344

Published online 27 January 2005 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/qre.646

Application Total System Reliability:Integrated Model for Growthand Test TerminationJohn Donovan1,∗,† and Eamonn Murphy2

1School of Engineering, Institute of Technology, Ballinode, Sligo, Ireland2University of Limerick, Ireland

Reliability demonstration testing is not the most efficient method of assuring productreliability prior to shipment. It is costly, time consuming and has inherent technicaland social limitations. The dilemma facing the reliability and quality engineer iswhether to continue demonstration testing and risk shipping a product late or shipthe product and risk warranty and field service returns. Either option can causethe company to lose significant market share and profit. This paper sets out tosolve this dilemma by meeting both the time to market constraints and the productreliability goals. The weaknesses of existing reliability demonstration techniquesare explored and a comprehensive methodology is introduced involving controlleddevelopment processes, stress testing, root cause determination and process changefeedback mechanisms. All prototype products are manufactured on the finalvolume process line resulting in the early identification and correction of process-related problems. Testing commences on the first available prototypes with systemstress/robust testing being employed to stimulate failures, determine their root causeand correct them. Reliability growth modelling assesses the ongoing improvementsoccurring in reliability during the development cycle, while a statistical stoppingrule is used to determine the optimal product release time without risking productwarranty. The approach is applicable to systems incorporating both hardware andsoftware elements. The methodology has been validated on three development projectsof telecommunication systems comprising hardware and software. In addition toenhancing team behaviour and performance, the development times have beenreduced by 14% and the ramp-up time to full worldwide product shipments has beenreduced by 50%. Copyright c© 2005 John Wiley & Sons, Ltd.

KEY WORDS: new product introduction; reliability demonstration; reliability growth; time to market; stresstesting; stopping rule; SURGE

∗Correspondence to: John Donovan, School of Engineering, Institute of Technology, Ballinode, Sligo, Ireland.†E-mail: [email protected]

Copyright c© 2005 John Wiley & Sons, Ltd.Received 20 July 2003

Revised 22 January 2004

Page 2: Total System Reliability: Integrated Model for Growth and Test Termination

330 J. DONOVAN AND E. MURPHY

1. INTRODUCTION

When companies ship a product late there is a loss in profit. Companies typically incur a 33% reductionin after-tax profit when they ship a product six months late1. Kumar and McCaffrey2 report that acompany operating in the Enterprise Storage Business has around only six quarters to recover a

typical investment of between $100 million and $500 million. With a typical product price erosion of 9–12%each quarter, the penalty for being one quarter late to market is a gross margin reduction of $150 million.Concurrent with this, customers are demanding increased quality and reliability in their products. Gonzalez andPalacios3 report that product success is largely related to product quality. Shorter development times alone willhave no effect on product success if the new product does not satisfy the customers’ needs. This dilemma placesthe reliability engineer in a difficult position as traditional reliability techniques fail to satisfy these competingexpectations. Before a product can be manufactured in volume and shipped to customers, it is first necessary todevelop tests that demonstrate that the product has attained a specific level of reliability. The traditional methodused to assess this is by means of a reliability demonstration test such as Sequential Probability Ratio Testing(SPRT) applied at the end of the development cycle. These test plans are included in MIL-HDBK-781A4.

This testing is applicable for repairable equipment that exhibits an exponential failure rate distribution andrepresents a hypothesis test whose objective is to test whether the measured mean time between failures(MTBF) equals an upper test (acceptable) MTBF. The alternate hypothesis is that the MTBF equals a lowertest (unacceptable) MTBF.

This testing is costly, time consuming and wasteful of resources. For example, if an acceptable MTBF is90 000 h and an unacceptable MTBF is 30 000 h, then with 20% consumer and producer’s risk, the product isrequired to operate successfully, without error, for 80 100 h. If even one failure occurs during this period thenthe test time is automatically extended to 129 600 h. If three failures occur then the product is rejected as beingincapable of achieving a MTBF of 90 000 h per test plan.

As there are 168 h in a week, and approx. 700 h per month, the above test plan would indicate that with asample of 20 units, it would take almost 6 months without failure to pass the test. With just a single failure, thistesting would extend to 9 months. To speed up the reliability testing, more and more units are placed on test.Yang and Zunzanyika5 have reported using up to 400 units in an attempt to reduce this test period.

2. LIMITATIONS OF RELIABILITY DEMONSTRATION TESTING

Reliability demonstration testing is by no means the most effective and inclusive test that can be performed asit creates many difficulties for product development teams attempting to operate in a concurrent engineeringenvironment. The emphasis of the test is on the accumulation of test hours and the minimization of interactivehuman involvement with the product under evaluation. It has been observed by Meeker and Hamada6 that ‘forhigh reliability products, with limitations on the number of samples and amount of available testing time, suchdemonstrations are expensive, impracticable and impossible’. The problems with sequential probability ratiotesting are of both a technical and social nature. A number of these problems have been discussed by otherauthors7–9 and include the following.

• It is economically costly in terms of units and time.• Sequential probability ratio testing assumes a constant failure rate. The concept of a constant failure

rate is currently being questioned within the electronics industry where a decreasing failure rate is nowconsidered more appropriate10.

• The infancy mortality period is ignored. This is a period with decreasing failure rate and can be quitelengthy. Peck11 reports this period extending to 10 000 h while McLinn12 reports that the Society ofAutomotive Engineers (SAE) places the infant mortality periods up to 100 000 h. The effect of ignoringthis period is twofold. On the one hand, the resulting failure distribution is not consistent with the constantfailure rate assumptions. On the other hand, a single infancy failure can cause the termination of the testand the rejection of the null hypothesis.

Copyright c© 2005 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2005; 21:329–344

Page 3: Total System Reliability: Integrated Model for Growth and Test Termination

TOTAL SYSTEM RELIABILITY: INTEGRATED MODEL FOR GROWTH AND TEST TERMINATION 331

• Electronic systems now contain more complex embedded software than they have ever previouslycontained. Demonstration tests tend to ignore this fact and place the emphasis on the testing of thehardware. The reliability of such systems is inadequately assessed during reliability demonstration tests.More appropriate testing would involve operational testing such as interactive use, robust testing and thetesting of previously untried paths in the code. The accumulation of hours involving a repeat of the samesequence of tests does not adequately assess the software.

• The test focus is on measurement rather than improvement. Clearly if measurement indicates that the testplan cannot be successfully completed, then the reliability must be improved. However, if the productcan meet the acceptable upper MTBF per test plan, there is no further focus on improvement. Kromholtzand Condra13 point out that one of the problems associated with the use of military standards is thatthey ironically result in ‘reduced attention to actual reliability improvement’. This is because the standarddefines what is acceptable and provides no impetus to move beyond this.

• The test creates dysfunctional behaviour in project teams where the emphasis is on transferring blamerather than solving the problems. Design engineers tend to blame the manufacturing process whilemanufacturing engineers will blame the design.

• It is inappropriate with a concurrent engineering environment where the emphasis is on performing tasksin parallel.

• It is analogous to an end-of-line reliability audit. Modern quality assurance techniques have moved fromend-of-line audits towards the management of internal processes. Indeed, Montgomery14 reports that thematurity of a company is in inverse proportion to the amount of inspection that it performs.

• There is no emphasis on controlled processes as a means of attaining reliability. If one takes the viewpointthat the successful completion of a reliability demonstration test represents the ‘end point’ of a reliabilityprogramme then they have missed out on many of the factors that influence and improve reliability.Controlling a process ensures that its output is predictable in terms of time, quality and reliability. If goodcontrols and monitoring are absent, even the successful completion of a reliability demonstration test isalmost meaningless as it purely represents the outcome of the test under a specific set of conditions andtime. There is absolutely no guarantee that such conditions, and therefore reliability of the product, willhold into the future.

Donovan et al.15 analysed the data from 16 reliability demonstration tests performed using the SPRT and showedthat the effective outcome of reliability demonstration testing is merely the debugging of products. Defects werefound in the design, components and manufacturing process that could have been detected much earlier. It wasnot necessary to wait for a reliability demonstration test to identify defects that were due to poor designs and poormanufacturing processes. As the SPRT test occurs at the very end of the development cycle, the defects couldall have been found much earlier and off the critical path if a more proactive and inclusive test methodologyhad proceeded in parallel with the product development cycle as shown in Figure 1. This approach has beenrepresented as SURGE (Stress Unveiled Reliability Growth Enhancement) and a minimum saving of 2 months’development time is envisaged.

3. ELEMENTS OF THE SURGE APPROACH

3.1. Introduction

The weaknesses of the reliability demonstration test via the SPRT approach forces one to search for a viablealternative. Reliability demonstration testing is a concept of a bygone era when components and processes wereinherently unreliable, resulting in low system reliability. It is now more appropriate to move from the traditionalprobabilistic approach to a deterministic approach as components and processes become more reliable.

The SURGE process represents a comprehensive methodology that focuses on reliability improvement ratherthan reliability measurement. The experience of Donovan et al.15 has identified that reliability testing muststart as early as possible and continue for the remaining duration of the development phase. By controllingthe development process, generic improvements can be made to the development process that builds upon the

Copyright c© 2005 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2005; 21:329–344

Page 4: Total System Reliability: Integrated Model for Growth and Test Termination

332 J. DONOVAN AND E. MURPHY

Design

ReliabilityDemonstrationTests ie.

- incl. design verification, system

Design - incl. design verification, system

SURGE Testing

Product Development cycle with SURGE

Time Saved

Existing Product Development Cycle

14 months

Shipment

Shipment

12 months

Manufacture

Manufactureshipment

shipmentunits

units

testing and regulatory approvals etc.

testing and regulatory approvals etc.

Design Test Manuf. Test

2 months

Figure 1. Comparison of the product development cycle prior to and post SURGE

On critical path

Costly

Reactive

Unpredictable

Measures

Simulates

No prior controls

Dysfunctional

Tests to spec.

Extends cycle

time

Concurrent Process

Less costly

Proactive

Predictable Process

Improves Reliability

Stimulates defects

Prior controls

Teamwork

Tests beyond spec.

Reduces cycle

time

Reliability

Demonstration

Testing

SURGE

Process

Transition

Involved:

Features

Figure 2. Reliability demonstration versus SURGE process

experience gained on each project. While eliminating the root cause of failure mechanisms in the product, theroot cause of the deficiencies in the development process can also be eliminated.

All prototype products should be manufactured using the final volume process, ensuring that from a veryearly stage both the product and the manufacturing process are assessed.

The SURGE approach focuses on understanding why the failure occurred and eliminating the cause ratherthan predicting times to failure. It strives to ensure reliability by controlling the input variables/parameters ofthe processes rather than by measuring the failure rate of the final product. One observes in Figure 2 that thefeatures of the alternate approach (SURGE) are designed to assure reliability rather than measure it.

Copyright c© 2005 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2005; 21:329–344

Page 5: Total System Reliability: Integrated Model for Growth and Test Termination

TOTAL SYSTEM RELIABILITY: INTEGRATED MODEL FOR GROWTH AND TEST TERMINATION 333

Controlled Processes

Design Manufacturing Procurement

Stress/ Robust Testing

Reliability

Growth

Monitoring

Statistical

Stopping

Rule

Figure 3. Overview of SURGE process elements

The constrained measurement focus associated with reliability demonstration testing has been converted toa broader holistic philosophy. All efforts are made through defined controls and teamwork to move reliabilityassurance upstream in the product development cycle. The objective of the SURGE process is to satisfy theneeds of the customer by moving beyond the numerical approach of reliability demonstration testing towardsa reliability enhancement philosophy. By moving in this direction, the time to market is reduced while at thesame time the quality and reliability of both the product and the process are improved. The focus is shifted fromreliability measurement towards reliability improvement.

Controlled design, manufacturing and procurement processes represent the basic building block for theSURGE methodology as shown in Figure 3.

Continual stress/robust tests involving extreme data traffic rates and conditions are applied to the prototypeproduct over the remainder of the development phase with the purpose of precipitating, identifying andcorrecting defects in the design, manufacturing process and off-the-shelf component/assemblies. As thesedefects are resolved the reliability growth model graphically portrays the reliability improvements occurringas a result of the defect stimulation and removal exercise. The term ‘cumulative MTBF’ continues to be usedin the growth model simply because it is an historical term in the graphical portrayal of reliability growth. Itis not the intention of the reliability growth model to quantify or demonstrate a particular reliability goal, butrather to graphically illustrate the improvements to the development team. Finally, the stopping rule providesguidance to management on the risks associated with releasing the product onto the marketplace and whetherany potential problems exist that would cause warranty concerns. It is acknowledged that the reliability datahave been achieved using accelerated test conditions and that any failure rates observed in the field should besignificantly less than the failure rates observed during the accelerated reliability testing.

3.2. Controlled processes

Modern quality assurance techniques have moved from end-of-line audits to the management of the internalprocess. A more appropriate approach in the present environment is to introduce internal controls into the designand development phase and to continually monitor and assess the product reliability during development andintroduction into manufacturing. The importance of controlled development processes for assuring reliabilitycannot be overstated. It is also the foundation on which the Software Engineering Institute (SEI) builds theCapability Maturity Model for the development of software16. The necessity for controlling the processes canbe further demonstrated by looking at the changes that are taking place in the Very Large Scale Integration(VLSI) industry. Crook17 and Schaft et al.18 argue that controlling the processes should represent the prime

Copyright c© 2005 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2005; 21:329–344

Page 6: Total System Reliability: Integrated Model for Growth and Test Termination

334 J. DONOVAN AND E. MURPHY

means of ensuring reliability. At low hazard rates, too few failures are encountered to provide statistically validresults in economical realistic situations. They suggest that if one can no longer measure with confidence thereliability of a device, then the rules must be changed and a new approach is required. Soden and Anderson19

have reported that the traditional methods for qualifying components for military and space applications arebecoming impractical due to time and cost. As a solution, such industries are moving towards qualifying themanufacturing line that produces the components rather than the individual components themselves.

3.3. Stress/robust testing and corrective action mechanisms

Continual stress/robust tests are applied to prototypes for the entire development phase to identify weak pointsin the design and manufacturing process. This involves applying both environmental and operational stressesincluding extreme data traffic rates and conditions to precipitate defect detection. As defects in the design andmanufacturing process are identified and corrected, the reliability of the product begins to grow. Due to feedbackmechanisms inherent in the methodology, the ability of the process to produce a reliable product also grows.This growth is modelled and allows the development team to monitor the reliability improvement. Finally, amathematical stopping rule determines the optimum time for product release.

Figure 4 represents in diagrammatic form the control/monitoring and feedback mechanisms associated withthe SURGE process. If the complete development cycle is treated as a process that can be controlled andsimilarly all inputs (i.e. the design, materials, manufacturing) to that process are likewise controlled, then theoutput is controlled and representative of future products. By measuring the outputs (i.e. prototype product) ofthe controlled development cycle at the earliest opportunity, one can assess how the product and process willbehave in volume production. Measuring the output of an uncontrolled process only provides us with a glimpseof what the process is like at any one time, while measuring the output of a process under control providesus with a vision of its future behaviour. If one can control and monitor these processes, then the need forreliability demonstration tests to assess reliability becomes unnecessary. The emphasis on controlled processeswas introduced, not with a view to restricting the creativity of engineers, but rather with the view to ensuringthat the product development cycle was repeatable and predictable.

Features of SURGE represented in Figure 4 include the following.

• A design process that is fully controlled such that it is systematic, predictable, reliable and follows definedmethodologies and procedures.

• All prototype components/materials are procured from approved vendors.• All prototype units are manufactured via the final volume process.• Stress/robust testing commences as soon as the first prototype is available and continues until the final

product is ready for shipment (six to seven months typically).• A sample of each prototype build is used for stress/robust testing.• The testing involves the application of environmental and operational stresses, including extreme data

traffic rates and conditions to identify weak points in new designs. This includes the robust testing ofsoftware.

• As changes are made to the design, these are incorporated into the units undergoing test.

With such a process, each prototype product is as representative in terms of quality and reliability of the finalproduct as that which will be produced in volume production.

The SURGE process has as its philosophy the detection and elimination of all potential failures thatwill impact on the reliability of products. By continually monitoring/testing prototypes and introducing theappropriate changes in the development process, the product reliability will grow in addition to setting correctteam behaviour. This method is more cost effective while at the same time introduces process changes that willreflect in increased reliability and quality of future products. The SURGE process represents a feedback processfor improving a product’s reliability by testing to its design limit and beyond. When failures occur there are twostages of resolution as observed in Figure 4.

Copyright c© 2005 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2005; 21:329–344

Page 7: Total System Reliability: Integrated Model for Growth and Test Termination

TOTAL SYSTEM RELIABILITY: INTEGRATED MODEL FOR GROWTH AND TEST TERMINATION 335

Controlled

Manufacturing

Processes

Prototype

Product

Controlled

Design

Controlled

Materials

Environmental Tests

Network Tests

Operational/Abuse Tests

Development

Cycle

Surge Process

Stage 1 - Product Resolution

Stage 2 - Process Resolution

6 - 7

Months

Volume

Product

Figure 4. Representation of SURGE control/feedback mechanisms

• Stage 1 resolution: root causes are determined and appropriate changes are implemented until nounaddressed failure modes remain on the product.

• Stage 2 resolution: once testing is completed and the product has been released onto the marketplace,a second stage resolution occurs whereby changes are made to the design and manufacturing process.This ensures that generic sources of failure will not recur on future products.

As defects are identified and corrected in the design, components and manufacturing process, the product’sreliability begins to grow.

3.4. The reliability growth model

The purpose of the reliability growth model is to graphically portray the growth or otherwise in reliability to theproject team as the development phase progresses.

Duane20 suggested a reliability growth model in which the cumulative MTBF (θ ) and the cumulative time (T )are related by Equation (1).

ln θ = ln α2 + β2 ln T (1)

Although the term ‘cumulative MTBF’ has little true meaning, it is nonetheless instructive in displaying a growthin reliability due to defect root cause elimination. As there are a number of limitations with the Duane model,Donovan and Murphy21 developed a replacement model shown in Equation (2), based on variance stabilizationtransformations that continue to use the conventional cumulative MTBF term:

θ = α1 + β1√

T (2)

This model plots as a straight line with an intercept α1 and a slope β1. The cumulative MTBF (θ ) and the squareroot of cumulative time (

√T ) are plotted on the y-axis and x-axis, respectively. Simulation by Donovan and

Murphy21 has shown that the new model provides superior performance to the more traditional Duane model.

Copyright c© 2005 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2005; 21:329–344

Page 8: Total System Reliability: Integrated Model for Growth and Test Termination

336 J. DONOVAN AND E. MURPHY

The Duane model cannot be used to graphically display reliability progress to the development team ona weekly or biweekly basis as log scales are used on both the x- and y-axes. As testing progresses and thecumulative time on testing increases, then little change is observed on the Duane plot because there is littledifference evident between the natural logs of large values of cumulative time. This causes a clustering effecton the Duane model, as failures occurring at high values of cumulative time will tend to be clustered together.This clustering is avoided in the Donovan–Murphy model, as there is no transformation of cumulative MTBFand only a square root transformation of cumulative time.

Donovan and Murphy21,22 have shown that their model provides a much better fit to failure data over the rangeof Duane slopes (β2) typically observed during reliability growth programmes. In addition, it allows thedevelopment team to monitor and display the reliability improvements at weekly or biweekly planning sessions.

It is worthwhile stating that it is not the intention to use the reliability growth model to quantify or extrapolateany reliability data or statistics, but rather to act as a graphical tool to display continual reliability test progressto the development team. Because of this, it is acceptable to combine data from various stress and operationaltests.

3.5. The statistical stopping rule

Finally, by means of a statistical stopping rule, one can assess when the product has achieved a pre-defined levelof reliability that allows one to terminate SURGE testing and proceed with product shipment. This stoppingrule of Donovan and Murphy23 is an adaptation of the Ross24 stopping rule. The rule introduces the concept ofa defect hazard rate, whereby a system has an unknown number of defects present, each of which has its owndefect hazard rate. The assumptions of the rule can be stated as follows.

• The system has an unknown number of defects m.• Each defect i (i = 1, . . . , m) is independent and has an associated hazard rate λi , which occurs in

accordance with a Poisson process.• When a failure occurs, its cause is investigated and the defect found and corrected.

One problem with the Ross stopping rule is that it is possible for the rule to indicate test termination at thesame time as a defect is detected. It was considered reasonable by the authors that testing should never terminateimmediately after a failure. This resulted in a modification to the Ross stopping rule without adversely affectingthe structure of the original rule.

All types of defects can be accommodated within this stopping rule and it is applicable to the total systemincorporating both hardware and software. When assessing the reliability of a new product, developmentengineers recognize that failures arise as a result of defects. These defects may be the result of deficienciesin the design, materials or manufacturing processes. It is never acceptable to ignore a failure on the basis that itis a random occurrence and nothing can be done about it. This is not the way engineers approach the problem.The emphasis must always be aimed at getting to the root cause of the defect and correcting it. Defects orerrors in the design or the manufacturing process are essentially similar to software defects. Each is correctableand when successfully corrected in the design or manufacturing process should not recur. As all failures arisefrom correctable defects, these assumptions can be applied to electronic systems containing both hardware andsoftware elements.

The assumptions listed above indicate that each defect i in the system has an associated defect hazard rate λi .The summation of individual defect hazard rates is totally dissimilar to the concept of a constant hazard rate for asystem. A system that initially contained m defects will observe a decreasing hazard rate as the number of defectsis reduced from m due to defect detection and correction. If defects are not corrected or a failing componentis replaced with one containing the same defect, then clearly that defect and its associated hazard rate remainswithin the system. The concept of defects and their subsequent removal having the effect of reducing systemhazard rates has previously been discussed by Wong25, although he has referred to these as flaws rather thandefects.

Whereas software has no wearout mechanisms, neither is wearout generally considered to be an issue duringthe service life of modern electronic equipment. Pecht and Nash26 reports that once ‘wearout mechanisms were

Copyright c© 2005 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2005; 21:329–344

Page 9: Total System Reliability: Integrated Model for Growth and Test Termination

TOTAL SYSTEM RELIABILITY: INTEGRATED MODEL FOR GROWTH AND TEST TERMINATION 337

understood, it proved possible to “design them out,” with the result that wearout failures are no longer likelyto occur during the normal service life of microelectronic devices’. Wong25 also reports, ‘The wearout life forelectronic systems is so long that for most equipment it becomes academic to even investigate it’. He furtheracknowledges ‘. . . from a reliability viewpoint it is not necessary to address gross wearout failures. However theknowledge of wearout mechanisms is very important because the same mechanisms are also causing the failuresduring the useful lifetime of electronics at sites of built-in flaws’. If a wearout mechanism has been identifiedduring the development phase testing then it has arisen due to a defect, either due to design overstress or a defectwithin the component. In any case, the emphasis must be on identifying and correcting this underlining defect.

The stopping rule advocates a minimum test time (Tmin) to avoid terminating the test too early. This minimumtest time, shown in Equation (3), was developed by Donovan and Murphy23.

Tmin = − ln δ

Acceptable hazard rate(3)

where δ represents the probability of no failure occurring by Tmin. It is preferable to have δ small as there is nowish to terminate the test prior to encountering failures. A δ = 0.05 indicates that there is only a 5% chance thatTmin will have been reached without encountering any failure.

Once the minimum test time is accumulated, the stopping rule shown in Equation (4) indicates that testingshould terminate at the earliest time t such that

1

t − TD(t)

+ 3

( D(t)∑i=1

e−t/Ti

T 2i (1 − e−t/Ti )2

)1/2

≤ Acceptable hazard rate (4)

Ti represents the time to failure of defect i. D(t) is the number of defects by time t . The acceptable hazard raterepresents a pre-defined hazard rate that the product should attain before testing is terminated and product isreleased to the marketplace. 1/(t − TD(t)) represents the point estimate of the product hazard rate in the periodsince the last failure. Simulation of this stopping rule by Donovan and Murphy has shown that it is more effectivethan the Ross stopping rule.

The stopping rule is also quite robust to the time of failure occurrence and it is not necessarily adverselyaffected by a small number of failures, providing of course that they do not occur at the end of the developmentand test phase. This represents an advantage for the stopping rule over conventional reliability demonstrationtechniques as an aid in determining when to terminate development testing and release products onto the market.

4. CASE STUDY EXAMPLE

The SURGE principles outlined earlier have been used in the development of network communications systems.These systems are comprised of hardware and software components. The hardware includes a number ofnetworks interfaces/cards and a power supply. The data from one of these projects are used to illustrate theSURGE process. While the product had a predicted MTBF in the region of 100 000 h, it was consideredacceptable that 10 000 h should be demonstrated using the stopping rule. A minimum test time of 30 000 hwas identified using a value of δ = 0.05. The sample size grew to a total of 30 units over a six-monthperiod and 26 distinct defects or failures were encountered and these are shown in Table I. The data donot include any repeat occurrences of the same defect due to the same root cause. Indeed it is one of theassumptions of Ross’ stopping rule that once a failure occurs its root cause is found and corrected. As a result,repeat occurrences have not been included in the mathematical development of the stopping rule. Inclusionof these identical defects would actually have skewed the data, for the development team, as part of thedebugging process, will attempt to deliberately reproduce a failure as a means of identifying its prevalenceand provide information as to its root cause. For example, intermittent contacts on two distinct areas of aprinted circuit board are not considered as repeat occurrences while repeated intermittent contacts on thesame pin of the same device are considered as repeat occurrences. The cumulative time to failure for eachdefect is represented in column 2 of Table I. This information is transformed in columns 4 and 5 and is

Copyright c© 2005 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2005; 21:329–344

Page 10: Total System Reliability: Integrated Model for Growth and Test Termination

338 J. DONOVAN AND E. MURPHY

Table I. Failure information and stopping times for the case study example

Defect/failure Cumulative time Stopping time Cumulative MTBF Failure attributed Stress applied whennumber i to failure Ti (h) t (h) θi = Ti/i

√Ti to . . . failure detected

1 1 10 010 1 1 Process Temperature/humidity cycling2 60 10 070 30 8 Component Temperature/humidity cycling3 8230 31 130 2743 91 Process Temperature/humidity cycling4 8300 35 030 2075 91 Design Temperature/humidity cycling5 8350 37 570 1670 91 Process Temperature/humidity cycling6 12 568 42 510 2095 112 Design Operational7 15 556 47 260 2222 125 Design Operational8 19 876 52 590 2485 141 Component Random vibration9 19 900 56 150 2211 141 Process Random vibration

10 19 910 59 220 1991 141 Component Random vibration11 27 200 64 570 2473 165 Process Temperature/humidity cycling12 27 210 67 770 2268 165 Component Temperature/humidity cycling13 27 700 70 850 2131 166 Process Temperature/humidity cycling14 28 660 73 840 2047 169 Component Temperature/humidity cycling15 34 450 78 080 2297 186 Component Temperature/humidity cycling16 37 400 81 660 2338 193 Process Power on/off17 37 410 84 350 2201 193 Process Temperature/humidity cycling18 41 250 87 920 2292 203 Component Temperature/humidity cycling19 42 000 90 660 2211 205 Process Temperature/humidity cycling20 42 100 93 130 2105 205 Process Operational21 44 020 95 940 2096 210 Process Temperature/humidity cycling22 48 600 99 360 2209 220 Process Temperature/humidity cycling23 51 600 102 430 2243 227 Design Operational24 55 100 105 600 2296 235 Component Temperature/humidity cycling25 82 100 117 120 3284 287 Process Temperature/humidity cycling26 108 300 133 740 4165 329 Component Temperature/humidity cycling

used to develop the reliability growth plot. This allows one to assess whether there is evidence of reliabilitygrowth.

Figures 5 and 6 show the growth in reliability that is observed when the Donovan–Murphy and Duane modelsare applied, respectively. The use of the Donovan–Murphy model allows the development team to observe anyreliability progress that takes place at biweekly meetings. This progress cannot be observed in the Duane modelas it is dominated by the earliest failures. Indeed, in the Duane model, the later 20 failures are all clusteredtogether and the only way to observe what is occurring is to draw a growth plot that ignores the very earlyfailures. In the Donovan–Murphy model, one can easily observe the periods of reliability growth followed byreliability decay. The periods of reliability decay are generally related to the incorporation of new productfeatures and functionality. Much of additional functionality had inherent defects that were detected during thestress/robust testing described earlier. It is important to note that all added functionality had passed its designverification tests prior to being incorporated. One can observe from Table I and Figure 5 that sustained reliabilitygrowth occurred after approx. 42 000 cumulative test hours (20th failure).

Column 3 of Table I indicates the proposed stopping times when the stopping rule outlined in Equation (3)is applied. For example, the proposed stopping time for failure number 3 is 31 130 cumulative test hours.These calculations are shown in Equation (5).

1

t − 8230+ 3

(e−t

(1 − e−t )2+ e−t/60

602(1 − e−t/60)2+ e−t/8230

82302(1 − e−t/8230)2

)1/2

≤ 1 × 10−4 (5)

Copyright c© 2005 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2005; 21:329–344

Page 11: Total System Reliability: Integrated Model for Growth and Test Termination

TOTAL SYSTEM RELIABILITY: INTEGRATED MODEL FOR GROWTH AND TEST TERMINATION 339

3002001000

4000

3000

2000

1000

0

Square Root of Cumulative Time

Cum

ula

tive

MT

BF

Figure 5. The Donovan–Murphy reliability growth model

1050

9

8

7

6

5

4

3

2

1

0

ln (Cumulative Time)

ln (C

um

ula

tive

MT

BF)

Figure 6. The Duane reliability growth model

The smallest value of t that satisfied this equation is 31 130 h. This result indicates that testing should continueuntil 31 130 cumulative test hours have elapsed before testing is terminated. However, we observed that beforethis time period had been reached another defect was detected, pushing the stopping time up to 35030 h.

The stopping rule is quite robust to the time that the defect is detected. For example, let us consider all thedefects that were detected by random vibrations and operational testing. If all these defects had occurred at halfor even double the number of cumulative test hours listed in Table I, then the overall effect on the test and theproposed stopping time is quite minimal. The stopping times would have been 132 000 and 138 000 cumulativetest hours, respectively. This is not significantly different from the 133 740 cumulative tests hours currentlyproposed.

The attributable causes of failure and the type of stress being applied when the failures occurred are shownin columns 6 and 7 of Table I, respectively. These are also graphically presented in Figures 7 and 8. There is nodistinction made between hardware and software design failures. These failures are collectively attributed to thedesign process.

Copyright c© 2005 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2005; 21:329–344

Page 12: Total System Reliability: Integrated Model for Growth and Test Termination

340 J. DONOVAN AND E. MURPHY

����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

Design

15%

35% Vendor/

Component

Manufacturing

Process

50%

Figure 7. The attributable causes of failure for the case study example

������������������������������������������������������������������������������������������������������������������������������������������������������������������

��������������������������������������������������������

11.5%

Random

Vibration

3.8%

Power On/Off

Operational

15.4% Temp/Humidity

69.3%

Figure 8. The successful detection method used to identify the failures in the case study example

The root causes of the design defects (15%) fell into the following categories:

• misreading of component programming specification by design engineering;• initialization problem;• occurrence of an unforeseen state while transmitting data.

The problems were largely the responsibility of the software design.The root causes of the manufacturing process defects (50%) fell into the following categories:

• deficiencies in the surface mount process;• incorrect/inadequate process documentation;• poor rework and workmanship.

Resolution of these problems at an early stage during product development ensured that the worldwide volumebuild could proceed at a much smother and rapid rate.

The root causes of the vendor/component defects (35%) were as follows.

• Two distinct design defects were found on a power supply provided by an external vendor. This powersupply was already shipping with other products within the corporation. The vendor accepted andcorrected these design defects.

Copyright c© 2005 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2005; 21:329–344

Page 13: Total System Reliability: Integrated Model for Growth and Test Termination

TOTAL SYSTEM RELIABILITY: INTEGRATED MODEL FOR GROWTH AND TEST TERMINATION 341

• Connector defects arising from:

– contact pins outside the specification;– missing shroud leading to intermittent contact.

• Undocumented design bug in a programmable device.• Component outside the specification parameters and was therefore sensitive to the applied voltage.• Excessive epoxy applied during manufacture of crystal oscillators resulting in brittle parts.

The root causes of these vendor/component defects were resolved by the suppliers. Successful resolution wasconfirmed through correspondence, analysis and retesting.

5. ANALYSIS OF THE SURGE METHODOLOGY

5.1. Analysis of defects and detection methods

The SURGE methodology was validated using three distinct product development projects oftelecommunication systems comprising electronic hardware and software components. These developmentprojects were largely conducted sequentially over a four-year period within the same design and manufacturingfacility. The effect of changes to the development processes could then be observed over an extended period oftime. The data shown in the earlier example represent the second product through the SURGE process.

The total number of defects detected during SURGE on the three products broke down relatively equallybetween the following three categories:

• design defects (35%);• manufacturing process defects (35%);• vendor/component defects (30%).

The design and manufacturing defects were due to the internal corporate design and manufacturing process.The vendor/component defects were due to non-conforming hardware components or assemblies suppliedby external vendors, which failed to meet specified requirements. Such defects were due to design andmanufacturing issues at the supplier’s facilities. In most cases these same components and assemblies werealready in use on existing products and had been evaluated as part of earlier SPRTs without any detectionof their inherent defects. It is interesting that design defects on these same components/assemblies were onlyuncovered using the SURGE process.

Design defects accounted for 35% of the total number of defects found. Hardware defects accounted for only6% of these while software accounted for 94%. Operational testing including robust testing was found to bethe optimum method as this detected 73% of all software defects. The operational testing involved subjectingthe product to network load variations and interactive operations designed to ‘break’ the system. This includedincreasing the traffic rates beyond those typically observed in practice.

5.2. Analysis of business improvements

Overall a more cohesive and supportive team behaviour between design and manufacturing was evident duringSURGE. This developed as a result of the non-threatening approach that the SURGE process fosters. There is nolonger a defensive attitude adopted whenever defects are detected but rather a collective approach to identifyingand solving the root causes of problems. Once the drastic implications of a single failure were removed, therewas no longer any need for defensive postures.

The quality and reliability of many products has improved as a result of the SURGE methodology.SURGE testing detected inherent design defects on subassemblies, circuits and software modules that hadsuccessfully completed reliability demonstration testing on earlier products. As enhancements were made tothese assemblies the reliability of existing products also improved.

Copyright c© 2005 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2005; 21:329–344

Page 14: Total System Reliability: Integrated Model for Growth and Test Termination

342 J. DONOVAN AND E. MURPHY

Improvements were also made to the design and manufacturing processes. These improvements weresequential in nature, as recommendations arising from the stage 2 resolution progressively corrected inferiorprocedures and practices. A reduction was observed in the number of design defects and software defects fromproduct to product, as generic sources of defects were removed due to stage 2 resolution. The majority of theproblems encountered in the first of the three products was due to software design issues. By adopting newsoftware development processes, these generic defects were largely eliminated in the remaining two products.This reduction is reflected in the number of design defects being reduced from 91% of the total in the firstproduct to 15% in the second product.

The adoption of the SURGE techniques has also resulted in the following business benefits to the organization.

1. The development cycle has been shortened by 14%.2. The time taken to ramp up to full worldwide volume production has been reduced by 50%, i.e. from

12 weeks to 6 weeks. This was considered a significant breakthrough and was achieved because the finalvolume process was fully evaluated and debugged during SURGE. The confidence and experience of themanufacturing personnel were increased and resulted in fewer problems during the volume start-up phase.

3. The predictability in forecasting the product release date has been increased from a pre-SURGE accuracyfigure of 64% to a current post-SURGE accuracy of 100%.

4. Since the release of the three products, there has been no engineering change order (ECO) required tocorrect any design or manufacturing deficiency on these products. Prior to SURGE, there was an averageof one ECO per product after shipment.

5. A 60% reduction in the number of units required for reliability testing. On average, this has been reducedfrom 60 to 25 units.

6. The reliability of existing products was also improved as SURGE detected inherent design andmanufacturing problems on subassemblies and circuitry that had previously completed two reliabilitydemonstration tests per Mil-Hdbk-781A. The detection and correction of such defects clearly leads toincreased reliability of assemblies, benefiting many existing and future products.

7. The time taken to transfer the product to Support Engineering was reduced by 50%.

The time it takes to reach full production volume is referred to as the time-to-volume and, although largelyoverlooked, this is as critical to product success as time-to-market. In the case of a disk drive manufacturer, theramp-up phase can represent a quarter of the product life cycle.

Terwiesch and Bohn27 reported a number of published examples where production problems during thevolume ramp-up phase resulted in significant financial losses. It is characterized as a period when productionpersonnel are learning about the process. Machine set-ups may be slow and special operations may be requiredto correct product and process oversights. One of the significant aspects of the SURGE process is that allprototypes are manufactured using the volume process. Any deficiencies in this process are corrected duringSURGE and the learning period during volume production is thereby minimized. Success in reducing the time-to-volume has a knock on effect on the transfer of the product to Support Engineering who will maintain theproduct for the remainder of its life cycle. In the case of the three products identified, the time taken to transferthe products to the Support Engineering was reduced by 50%. This is highly desirable from a design engineeringperspective as it allows the development team to move onto the next project quickly without the hindrance ofalso providing continued engineering support.

6. CONCLUSIONS

The weaknesses of reliability demonstration testing per Mil-Hdbk-781A have been presented. These limitationsmake it an impractical approach as reliability increases and development cycle times decrease. An alternativestrategy called SURGE was developed to meet the requirements of the internal and external customer.

The SURGE process represents a change in direction and philosophy from reliability demonstration toreliability improvement of the product and process. It involves a shift from assessing the reliability ofhardware and software separately to assessing the reliability of the system, including the hardware, software,manufacturing process and the processes that developed these.

Copyright c© 2005 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2005; 21:329–344

Page 15: Total System Reliability: Integrated Model for Growth and Test Termination

TOTAL SYSTEM RELIABILITY: INTEGRATED MODEL FOR GROWTH AND TEST TERMINATION 343

The elements and models of the SURGE process have been presented in this paper. The approach has beenvalidated using three products, and the results from one of these products have been presented. During thisvalidation, defects were detected on subassemblies that had already completed reliability demonstrations onother products. The process has integrated the analysis of hardware, software and process failures within theone growth model. This allows the reliability to be plotted on readable scales and to be reviewed at weeklysessions by technical and non-technical staff. With the concept of defect failure rates, all types of defects can beaccommodated within the one stopping rule.

From a business perspective, the SURGE process has also achieved its intended goals of increased reliability,reduced time-to-market and reduced time-to-volume.

REFERENCES

1. House C, Price R. The return map: Tracking product teams. Harvard Business Review 1991; 69:92–100.2. Kumar S, McCaffrey TR. Engineering economics at a hard disk drive manufacturer. Technovation 2003; 23:749–755.3. Gonzalez FJM, Palacios TMB. The effect of new product development techniques on new product success in Spanish

firms. Industrial Marketing Management 2002; 31:261–271.4. U.S. Department of Defense. MIL-HDBK-781A: Handbook for Reliability Test Methods, Plans and Environments for

Engineering, Development Qualification and Production, U.S. Department of Defense, 1996.5. Yang JJ, Zunzanyika XK. Field reliability projection during New Product Introduction (NPI) process. Proceedings of

the Institute of Environmental Sciences, 1996; 58–67.6. Meeker WQ, Hamada M. Statistical tools for the rapid development and evaluation of high-reliability products. IEEE

Transactions on Reliability 1995; 44:187–198.7. Evans RA. Mil-Std-781C and statistics [editorial]. IEEE Transactions on Reliability 1980; 29:1.8. Schmee J. Correspondence. IEEE Transactions on Reliability 1980; 29:221.9. O’Connor PDT. Practical Reliability Engineering. Wiley: New York, 2002; 356–357.

10. Mortin DE, Krolewski JG, Cushing MJ. Consideration of component failure mechanisms in reliability assessmentof electronic equipment—addressing the constant failure rate assumption. Proceedings of Annual Reliability andMaintainability Symposium. IEEE Press: Piscataway, NJ, 1995; 54–59.

11. Peck DS. New concerns about integrated circuit reliability. IEEE Transactions on Electron Devices 1979; 26:38–43.12. McLinn JA. Constant failure rate—a paradigm in transition. Quality and Reliability Engineering International 1990;

6:237–241.13. Kromholtz GA, Condra LW. A new approach to reliability of commercial and military aerospace products: Beyond

military quality/reliability standards. Quality and Reliability Engineering International 1993; 9:211–215.14. Montgomery DC. Introduction to Statistical Quality Control. Wiley: New York, 2002; 15–17.15. Donovan J, Murphy E, Stephenson MI. SURGE process—a time-to-market approach to reliability improvement.

International Journal of Quality and Reliability Management 1997; 14:834–848.16. Paulk MC, Curtis B, Chrissis MB, Weber CV. Capability maturity model version 1.1. IEEE Software 1993; 10:18–27.17. Crook DL. Evolution of VLSI reliability engineering. Proceedings of the European Symposium on Reliability of

Electron Devices, Failure Physics and Analysis, Bordeaux. Elsevier: Amsterdam, 1991; 293–312.18. Schaft HA, Baglee DA, Kennedy PE. Building in reliability: Making it work. Proceedings of the European Symposium

on Reliability of Electron Devices, Failure Physics and Analysis, Bordeaux. Elsevier: Amsterdam, 1991; 19–32.19. Soden JM, Anderson RE. IC failure analysis: Techniques and tools for quality and reliability improvement. Proceedings

of the IEEE 1993; 81:703–715.20. Duane JT. Learning curve approach to reliability monitoring. IEEE Transactions on Aerospace 1964; 2:553–566.21. Donovan J, Murphy E. Reliability growth—a new graphical model. Quality and Reliability Engineering International

1999; 15:167–174.22. Donovan J, Murphy E. Improvements in reliability growth modeling. Proceedings of the Annual Reliability and

Maintainability Symposium, Philadelphia, PA. IEEE Press: Piscataway, NJ, 2001; 296–301.23. Donovan J, Murphy E. An infrequently used stopping rule revisited. Quality Engineering 2001; 13:367–376.24. Ross SM, Software reliability: The stopping rule problem. IEEE Transactions on Software Engineering 1985;

11:1472–1476.25. Wong KL. The physical basis for the roller-coaster hazard rate curve for electronics. Quality and Reliability

Engineering International 1991; 7:489–495.26. Pecht MG, Nash FR. Predicting the reliability of electronic equipment. Proceedings of the IEEE 1994; 82:992–1004.27. Terwiesch C, Bohn RE. Learning and process improvement during production ramp-up. International Journal of

Production Economics 2001; 70:1–19.

Copyright c© 2005 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2005; 21:329–344

Page 16: Total System Reliability: Integrated Model for Growth and Test Termination

344 J. DONOVAN AND E. MURPHY

Authors’ biographies

John Donovan PhD has been a lecturer in engineering at the Institute of Technology, Sligo since 1993.He received his primary degree in Electronic Engineering from the University of Limerick in 1982. He receiveda PhD in Reliability Engineering from the same university in 1998. Between 1983 and 1993 he worked forDigital Equipment Corporation in Clonmel and Galway, reaching the position of Design Assurance Manager.He is a member of the IEEE and the Irish Statistical Association. His research interests include statistical processcontrol, experimental design, simulation and reliability engineering.

Eamonn Murphy PhD is the Boart Longyear Professor of Quality and Applied Statistics at the University ofLimerick. He holds a BSc, MSc and PhD from University College Dublin. He has consulted extensively inquality management and quality systems throughout Europe. He is the Research Director of the National Centerfor Quality Management (NCQM) in Ireland. His research interests focus on adapting and deploying statisticaltools and techniques as a means of significantly improving quality and reliability.

Copyright c© 2005 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2005; 21:329–344