Reliability Growth Planning: Its Concept, Applications, and Challenges

Reliability Growth Planning: Its Concept,Applications, and

ChallengesTongdan Jin

Assistant Prof. of Industrial EngineeringIngram School of Engineering

Texas State University‐San Marcos©2011 ASQ & Presentation Tongdan Jin

Presented live on Nov 11th, 2010

http://reliabilitycalendar.org/The_Reliability_Calendar/Webinars_‐_English/Webinars_‐_English.html

ASQ Reliability Division English Webinar SeriesOne of the monthly webinars

on topics of interest to reliability engineers.

To view recorded webinar (available to ASQ Reliability Division members only) visit asq.org/reliability

To sign up for the free and available to anyone live webinars visit reliabilitycalendar.org and select English Webinars to find links to register for upcoming events

http://reliabilitycalendar.org/The_Reliability_Calendar/Webinars_‐_English/Webinars_‐_English.html

1

Reliability Growth Planning: Its Concept, Applications, and Challenges

Tongdan Jin

Assistant Prof. of Industrial Engineering Ingram School of Engineering

Texas State University-San Marcos

November 11, 2010

2

Contents

•  RGT vs. RGP

•  Design for Reliability

•  New Reliability Monitoring Metrics

•  Reliability Growth under Budget Constraints •  Conclusion

3

RGT vs. RGP

Design and Development

Prototype and Pilot Phase

Volume Production, Field Use and End of Life

Product Life Cycle

Reliability Growth Testing (RGT)

Reliability Growth Planning (RGP)

4

Why Need RGP? •  Design Cycle Shrinks

•  Cut-off of Testing Budget

•  Different Design/Development Schedule

Automatic Test Equipment

Basic subsys 1

Basic subsys 2

time

Basic subsys 3

Basic design Volume manufacturing and shipping

Adv. subsys 4

Adv. subsys 5

Adv. subsys 6

t1 t2 t3 t4t0

Figure 3 Compressed System Design Cycle

5

System Reliability vs. Shipment

MTBF

System Installs

Syst

em M

TB

F

Fiel

d Sy

stem

Pop

ulat

ions

Chronological Time

Target MTBF

6

Design for

Reliability

software

mfg

process NFF

Driving Reliability Growth

optimization

budget

failure mode pareto

Reliability Growth Planning Across Lifecycle Time

design hardware CA effectiveness

Note: mfg=manufacturing, NFF=no fault found, CA=corrective action

7

Topic One: Design for Reliability •  Component/Hardware Failures

•  Non-Component Failures Design weakness Software failures Manufacturing defects Process/handling issues No-fault-found (NFF)

8

System Failure Mode Categories

Failures Breakdown by Root-Cause Catagory

0%

10%

20%

30%

40%

50%H

ardw

are

Des

ign

Mfg

Pro

cess

Sof

twar

e

NFF

(com

pone

nts)

A

B

C D

9

Different MTBF Scenarios

Time

Target MTBF

10

Modeling Hardware Failure Rate

RFTQET πππππλλ 0=

R

FT

Q

E

T

πππππλ0 = base failure rate.

= temperature factor. = electrical stress factor. = quality factor. = fault tolerance factor. = redundancy factor.

For a given design, play essential roles in the actual component reliability.

ET ππ ,

11

Aggregate Failure Rate for Hardware

∑=∑===

k

iEiTiii

k

iiihw nn

10

1ππλλλ

][][][1

0 Ei

k

iTiiihw EEnE ππλλ ∑=

=

∑==

k

iEiTiiihw n

1

20

2 )var()var( ππλλ

Where

k = number of types of devices used in the product.

ni = quantity of ith type of device used in the product.

0i = base failure rate for ith type of device.

ASIC Temperature Distribution

0

2

4

68

10

12

14

<65 [65, 70)[70, 75)[75, 80)[80, 85)[85, 90) >90

Degree in Celsius

Qua

ntity

00.010.020.030.040.050.060.070.08

pdf

histogrampdf

12

Challenges in Modeling Non-Hardware Failures

1.  Quite often data is not well recorded

2.  Varies from one product line to another

3.  Process related

4.  Design experience

5.  Other random factors

13 Triangle Models for Non-Hardware Failures

⎪⎪⎪

⎩

⎪⎪⎪

⎨

⎧

≤<−−−

≤≤−−−

=

otherwise

bcbabbc

caaabac

g

0

)())((

2

)())((

2

)( λλ

λλ

λ

a = the smallest possible value of the failure rate b = the largest possible value of the failure rate c = the most likely value, and c=3 -b-a = is the sample mean for the dataset

λλ

Where:

a bcλ

g(λ)

h

14 Example for Non-Hardware Failure Estimate

Example: Based on historical data of predecessor products, it shows failure rates pertaining to manufacturing issues are (faults/hour): 1.210-6, 1.410-6 and 2.4 10-6. Then : = (1.210-6+1.410-6 +2.3 10-6)/3=1.610-6 a = 1.210-6

b = 2.4 10-6

c = 1.310-6

λ

15 Combining HW and Non-HW Failure Rate

∑+++++==

k

iiiopmsdsys n

1λλλλλλλ

Where: d = failure rate of design weakness s = failure rate of software m = failure rate of manufacturing p = failure rate of process o = failure rate of other issues (e.g. NFF)k= total number of HW component types i = failure rates for component type i

16 Confidence Intervals for Failure Rate

∑+++++==

k

iiiopmsdsys n

1λλλλλλλ

∑+++++==

k

ii iopmsdsysn1

22222222λλλλλλλ σσσσσσσ

sysλ sysλσ2sysλσ2−

17

Application to Reliability Design (cnt’d) 51013.1][][ −

− ×=+= HWnonHWsys EE λλµ

112 1023.2)var()var( −− ×=+= HWnonHWsys λλσ

µsys 51043.2 −×

0.3%

18

MTBF with 99.7% Confidence

%7.99}Pr{ ≥≥ tMTBF

%7.99}1Pr{ ≥≤tsysλ

MTBF(99.7%) =41,115 hours

MTBFSYS1=λ

MTBF Estimate with Confidence Neutral MTBF Estimate

The mean of PCB failure rate is 1.1310-5 faults/hours

MTBF=1/(1.1310-5 ) =88,100 hours

19

Topic Two:

Failure Mode Rate &

Failure-In-Time

20

Pareto Chart for Failure Modes

Difficulties: •  Static View

•  No Trend of Each Failure Mode

•  Fail to Reflect Product MTBF

Pareto by Failure Mode From January to March

02468

101214

Rel

ays

Res

isto

rs

No

Faul

tFo

und

Col

dS

olde

r

Sof

twar

eB

ug

Op-

Am

p

Qty

0%

20%

40%

60%

80%

100%

No C/AC/A In ProcessC/A CompletePercentage

Pareto Chart by Failure Mode From April to June

048

1216202428

Op-

Am

p

Res

isto

rs

Col

dS

olde

r

Rel

ays

softw

are

bug

No

Faul

tFo

und

Qty

0%

20%

40%

60%

80%

100%

No C/AC/A In ProcessC/A CompletePercentage Note: C/A= corrective action

21

Failure Mode Rate (FMR)

onsinstallatiproductfieldFMoftypeaforfailures=FMR

22

FMR Estimation: Example

For example: Assuming 120 PCBs were shipped and installed in the field in the first quarter, 5 failures returned due poor solder joints, then the FMR for poor solder joints in the first quarter is

quarterboardfaultsFMR //042.01205 ==

oninstallatiproduct fieldFMoftypeaforquantityfailure=FMR

23 FMR Run Chart

Failure Mode Rate (FMR) by Quarter

0.00

0.01

0.02

0.03

0.04

0.05

0.06

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

Failu

res

Per

Boa

rd

0

50

100

150

200

250

300

350

400

Cum

ulat

ive

PC

B S

hipm

entrelays

resistor

Op-amp Product shipment

24 Estimate MTBF using FMR Chart

Quarters 1st 2nd Qtr 3rd 4th Cumulative Shipment 120 200 220 264

Cum Run Hours 262,080 436,800 480,480 576,576 Cum FM rate 0.117 0.150 0.057 0.051

Defective Boards 14 30 12 13 MTBF (hours) 18720 14560 38541 42856

13 Weeks Rolling MTBF

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

MTB

F (h

ours

)

0

10

20

30

40

50

60

70

80

90

100

Failu

res

Defective Boards

MTBF

25

Estimate for PCB Failure Rate

∑+++++=

×=

∑+++++=

=

=

k

iiopmsdsys

k

iiiopmsdsys

FITFITFITFITFITFITFIT

FIT

n

1

91

10λ

λλλλλλλ

Notice

Where d = failure rate of design errors s = failure rate of software bugs m = failure rate of manufacturing p = failure rate of process o = failure rate of other issues i = failure rates for component type i k= total number of new component types ni= quantity of component type i used in the product

26

FIT-Based Reliability Driven: Example (1)

FM Category Target MTBF (hrs) Target FIT

Overall Product 50,000 20,000

Components (hardware) 117,647 8,500

Others (NFF) 250,000 4,000

Design 333,333 3,000

Manufacturing 500,000 2,000

Process 666,667 1,500

Software 1,000,000 1,000

MTBFFIT

910Notice =

27

FIT-Based Reliability Driven: Example (2) Product Target

FIT Categorical FM FIT Failure Mode Target FIT Current FIT Ownership

PCB (20,000)

Component (8,500)

Relay 2,000 2,491 Tom

Op-Amp 3,000 4,097 Jones

Resistor 1,500 2,786 Carlos

DC-DC converter 800 1,393 Jesson

ASICs 1,200 1,716 Jim

Design (3,000)

Eng Change Order 1,300 2,383 David

FPGA Rev Upgrade 900 1,643 Kim

Change relay type 800 1,498 John

Manufacturing (2,000)

cold Solder 1,600 3,092 Tony

backward component 250 355 Joe

Faked component 150 255 Paul

Process (1,500) broken part 700 942 Jen

Missing part 300 447 Chris

OES 500 515 Andrew

Software (1,000) Sever bugs 200 398 Eileen

Medium bugs 400 665 Ed

Trivial bugs 400 497 Eric

Others (4,000) NFF 3,000 457 Mark

PCFD 1,000 1,669 Jeff

28

Topic Three:

Reliability Growth Prediction &

Corrective Action under Budget/Cost Constraints

29

Crow/AMSAA Growth Model

∑ ⎟⎟⎠

⎞⎜⎜⎝

⎛=

=

N

i i

s

tt

N

1ln

β̂β

α ˆˆstN=

1ˆˆˆ −= ββαλ tFailure Intensity:

22/1,2ˆ

2θχ

β −< NN 2

2/,2ˆ2

θχβ NN >Reject H0

Where

Hypothesis Testing: H0: β=1, HPP

H1: β1, NHPP

or

0

1

2

3

4

5

6

0 1 2 3 4 5

Failu

re In

tens

ity

Time

Various Failure Intensity Models

beta 1beta 0.5beta 1.5

=1 for all

ts=termination time, ti=ith failure arrival time

30

An Example

797.0ln

ˆ

1

=∑ ⎟⎟⎠

⎞⎜⎜⎝

⎛=

=

N

i i

s

tt

Nβ

0266.0ˆ ˆ ==β

αstN

N=10 Cumulative

FailuresFailure Arrival Time (hours)

Interarrival Time (hours)

ln(ts/ti)

1 67 67 3.232 150 83 2.433 234 84 1.984 360 126 1.555 533 173 1.166 720 187 0.867 912 192 0.628 1102 190 0.439 1345 243 0.2310 1632 287 0.04ts 1700 sum 12.55

31

Failure Modes (FM) Pareto Chart

Cumulative operating time is 4800 hours, total failures is 14. Current MTBF=4800/14=343 hours.

Which FM should be fixed? Given limited budget.

Given $10 budget for corrective actions. Option one: Fix relays MTBF=4800/(14-2.5) =417 hours Option two: fix all others MTBF=4800/(14-9) =960 hours

32

New Reliability Growth Model

1.  Failure mode based growth prediction

2.  Reliability growth subject to CA budget constraints

3.  No assumption of parametric models

4.  CA effectiveness function

33

Limit Recourses ($) Spent on CA due to

1.  Retrofit 2.  ECO

Maximize Reliability

Growth

CA Effectiveness Function

Why Need the CA Effectiveness Function?

34 An Example: ECO or Retrofit

A type of relays used on a PCB module fails constantly due to a known failure mechanism. Two options available for corrective actions 1.  Replace all on-board relays upon the failure return of the

module 2.  Pro-actively recall all modules and replace with new types

of relays having much higher reliability

CA Option Cost ($) CA Effectiveness

ECO Low Low

Retrofit High High

35

0 c

x

1

effe

ctiv

enes

s

b

cxxh ⎟⎠⎞⎜

⎝⎛=)(

h(x)

CA budget ($)

Effectiveness Model

b>1 b=1

b<1

Modeling CA Effectiveness

b and c to be determined

Effectiveness= Failure rate before CA – Failures rate after CA

Failure rate before CA

36 An Example

The current failure rate a type of relay is 210-8 faults per hour. Upon the implementation of CA, the rate is reduced to 510-9. The CA effectiveness can be expressed as 0.75, that is

75.0102

1051028

98

=×

×−×−

−−

37 Incorporate h(x) into System Failure Rate

)()()(11ttnt

m

kii

k

iiis ∑+∑=

+==λλλ

b

cxxh ⎟⎠⎞⎜

⎝⎛=)(

∑ −+∑ −=+==

m

kiiii

k

iiiiiCAs txhtxhnt

11, )())(1()())(1()( λλλ

HW Non-HW

38

Making The Prediction via MS Excel (I)

Week No. 1 2 3 4 5 6 7 8Cum Failures

by FM Cum Opting Hours 1680 3360 5040 6720 8400 10080 11760 134407 Replay 2 0 1 0 3 0 1 06 resistors 1 1 0 1 0 2 0 14 op-‐amp 0 0 0 1 0 1 1 15 capacitor 1 0 0 1 1 0 0 22 design error 0 0 1 0 0 0 1 04 software bugs 1 0 0 0 1 1 1 06 cold solder 0 2 0 1 0 0 1 22 bad process 0 0 1 0 0 0 0 14 NFF 1 2 0 0 0 1 0 0

Latent Failure Modeweekly cum failures 6 5 3 4 5 5 5 7

Actual MTBF 280 305 360 373 365 360 356 336

39

Making The Prediction via MS Excel (II)

Week No. 1 8 9 10 11 12 13 14 15 16Required Budget ($)

Cost for fix FM ($) Target FM %

Cum Failures by FM Cum Opting Hours 1680 13440 15120 16800 18480 20160 21840 23520 25200 26880

150 300 50% 7 Replay 2 0 1.0 0 0.5 0 1.5 0 0.5 0500 500 0% 6 resistors 1 1 0 0 0 0 0 0 0 0100 200 50% 4 op-‐amp 0 1 0 0 0 0.5 0 0.5 0.5 0.5350 350 0% 5 capacitor 1 2 0 0 0 0 0 0 0 0700 700 0% 2 design error 0 0 0 0 0 0 0 0 0 0125 250 50% 4 software bugs 1 0 0.5 0 0 0 0.5 0.5 0.5 0100 100 0% 6 cold solder 0 2 0 0.0 0 0.0 0 0 0.0 0.00 50 100% 2 bad process 0 1 0 0 1.0 0 0 0 0 1.0225 450 50% 4 NFF 1 0 0.5 1.0 0 0 0 0.5 0 0

Latent Failure Mode 0.3 0.2 0.2 0.1 0.3 0.2 0.2 0.2weekly cum failures 6 7 2 1 2 1 2 2 2 2

Actual MTBF 280 3362250 Predicted MTBF 336 318 348 372 404 420 439 457 473

40

Reliability Growth Planning Process

SystemManufacturer

Repair Center

In-serviceSystems

Stocks

RetrofitTeam

Retrofit Loop

ECO Loop1. Failure analysis2. CA decisions3. Reliability prediction

ECO=Engineering Change Order CA=Corrective Actions

FRACA

41

Conclusions 1.  Design for reliability (DFR) should incorporate hardware and

non-hardware issues along with the variation of the failure rates.

2.  Trade-off should be made between the reliability growth and the associated availability of CA resources.

3.  The CA effectiveness function links the CA budget with the expected failure mode reduction rate.

4.  A reliability database system such as FRACAS is essential for performing RGP.

42

Thanks ! &

Questions/Comments ?

Technology

Reliability Growth Planning: Its Concept, Applications, and Challenges