Upload
asq-reliability-division
View
818
Download
1
Embed Size (px)
DESCRIPTION
Reliability growth planning (RGP) is emerging as a promising technique to address the reliability challenges arising from the distributed manufacturing environment. Unlike RGT (reliability growth testing), RGP drives the reliability growth of new products by spanning the product’s lifecycle from design, prototyping, manufacturing, to field use. It is a lifetime commitment to the product reliability via systematic failure analysis, rigorous corrective actions, and cost-effective financial investment. RGP has shown to be very effective, particularly in new product introductions under the fast time-to-market requirement. The RGP process will be introduced based on the three-phase product lifecycle: 1) design for reliability during early product development; 2) accelerated lifetime testing and corrective actions in pilot line stage; and 3) continuous reliability improvement following the volume shipment. Trade-offs among reliability investment, warranty cost reduction, and customer satisfactions will be investigated from the perspective of the manufacturer and the customer. Reliability growth tools such as Crow/AMSAA, Pareto graphs, failure mode run chart, FIT (failure-in-time), and FMECA will be reviewed and their roles in the GRP process will be discussed and demonstrated. Case studies drawn from electronics equipment industry will be used to demonstrate the RGP applications and justify its benefits as well. In parallel with the RGP, efforts have been devoted to developing optimal preventative maintenance programs, either time-based or usage-based strategies. Recently, CBM (condition based maintenance) is showing a great potential to achieve just-in-time maintenance or zero-downtime equipment. RGP and maintenance strategies share a common objective, i.e. achieving high system reliability and availability. In this presentation, optimal maintenance policies will be devised in the context of system reliability growth.
Citation preview
Reliability Growth Planning: Its Concept,Applications, and
ChallengesTongdan Jin
Assistant Prof. of Industrial EngineeringIngram School of Engineering
Texas State University‐San Marcos©2011 ASQ & Presentation Tongdan Jin
Presented live on Nov 11th, 2010
http://reliabilitycalendar.org/The_Reliability_Calendar/Webinars_‐_English/Webinars_‐_English.html
ASQ Reliability Division English Webinar SeriesOne of the monthly webinars
on topics of interest to reliability engineers.
To view recorded webinar (available to ASQ Reliability Division members only) visit asq.org/reliability
To sign up for the free and available to anyone live webinars visit reliabilitycalendar.org and select English Webinars to find links to register for upcoming events
http://reliabilitycalendar.org/The_Reliability_Calendar/Webinars_‐_English/Webinars_‐_English.html
1
Reliability Growth Planning: Its Concept, Applications, and Challenges
Tongdan Jin
Assistant Prof. of Industrial Engineering Ingram School of Engineering
Texas State University-San Marcos
November 11, 2010
2
Contents
• RGT vs. RGP
• Design for Reliability
• New Reliability Monitoring Metrics
• Reliability Growth under Budget Constraints • Conclusion
3
RGT vs. RGP
Design and Development
Prototype and Pilot Phase
Volume Production, Field Use and End of Life
Product Life Cycle
Reliability Growth Testing (RGT)
Reliability Growth Planning (RGP)
4
Why Need RGP? • Design Cycle Shrinks
• Cut-off of Testing Budget
• Different Design/Development Schedule
Automatic Test Equipment
Basic subsys 1
Basic subsys 2
time
Basic subsys 3
Basic design Volume manufacturing and shipping
Adv. subsys 4
Adv. subsys 5
Adv. subsys 6
t1 t2 t3 t4t0
Figure 3 Compressed System Design Cycle
5
System Reliability vs. Shipment
MTBF
System Installs
Syst
em M
TB
F
Fiel
d Sy
stem
Pop
ulat
ions
Chronological Time
Target MTBF
6
Design for
Reliability
software
mfg
process NFF
Driving Reliability Growth
optimization
budget
failure mode pareto
Reliability Growth Planning Across Lifecycle Time
design hardware CA effectiveness
Note: mfg=manufacturing, NFF=no fault found, CA=corrective action
7
Topic One: Design for Reliability • Component/Hardware Failures
• Non-Component Failures Design weakness Software failures Manufacturing defects Process/handling issues No-fault-found (NFF)
8
System Failure Mode Categories
Failures Breakdown by Root-Cause Catagory
0%
10%
20%
30%
40%
50%H
ardw
are
Des
ign
Mfg
Pro
cess
Sof
twar
e
NFF
(com
pone
nts)
A
B
C D
9
Different MTBF Scenarios
Time
Target MTBF
10
Modeling Hardware Failure Rate
RFTQET πππππλλ 0=
R
FT
Q
E
T
πππππλ0 = base failure rate.
= temperature factor. = electrical stress factor. = quality factor. = fault tolerance factor. = redundancy factor.
For a given design, play essential roles in the actual component reliability.
ET ππ ,
11
Aggregate Failure Rate for Hardware
∑=∑===
k
iEiTiii
k
iiihw nn
10
1ππλλλ
][][][1
0 Ei
k
iTiiihw EEnE ππλλ ∑=
=
∑==
k
iEiTiiihw n
1
20
2 )var()var( ππλλ
Where
k = number of types of devices used in the product.
ni = quantity of ith type of device used in the product.
0i = base failure rate for ith type of device.
ASIC Temperature Distribution
0
2
4
68
10
12
14
<65 [65, 70)[70, 75)[75, 80)[80, 85)[85, 90) >90
Degree in Celsius
Qua
ntity
00.010.020.030.040.050.060.070.08
histogrampdf
12
Challenges in Modeling Non-Hardware Failures
1. Quite often data is not well recorded
2. Varies from one product line to another
3. Process related
4. Design experience
5. Other random factors
13 Triangle Models for Non-Hardware Failures
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
≤<−−−
≤≤−−−
=
otherwise
bcbabbc
caaabac
g
0
)())((
2
)())((
2
)( λλ
λλ
λ
a = the smallest possible value of the failure rate b = the largest possible value of the failure rate c = the most likely value, and c=3 -b-a = is the sample mean for the dataset
λλ
Where:
a bcλ
g(λ)
h
14 Example for Non-Hardware Failure Estimate
Example: Based on historical data of predecessor products, it shows failure rates pertaining to manufacturing issues are (faults/hour): 1.210-6, 1.410-6 and 2.4 10-6. Then : = (1.210-6+1.410-6 +2.3 10-6)/3=1.610-6 a = 1.210-6
b = 2.4 10-6
c = 1.310-6
λ
15 Combining HW and Non-HW Failure Rate
∑+++++==
k
iiiopmsdsys n
1λλλλλλλ
Where: d = failure rate of design weakness s = failure rate of software m = failure rate of manufacturing p = failure rate of process o = failure rate of other issues (e.g. NFF)k= total number of HW component types i = failure rates for component type i
16 Confidence Intervals for Failure Rate
∑+++++==
k
iiiopmsdsys n
1λλλλλλλ
∑+++++==
k
ii iopmsdsysn1
22222222λλλλλλλ σσσσσσσ
sysλ sysλσ2sysλσ2−
17
Application to Reliability Design (cnt’d) 51013.1][][ −
− ×=+= HWnonHWsys EE λλµ
112 1023.2)var()var( −− ×=+= HWnonHWsys λλσ
µsys 51043.2 −×
0.3%
18
MTBF with 99.7% Confidence
%7.99}Pr{ ≥≥ tMTBF
%7.99}1Pr{ ≥≤tsysλ
MTBF(99.7%) =41,115 hours
MTBFSYS1=λ
MTBF Estimate with Confidence Neutral MTBF Estimate
The mean of PCB failure rate is 1.1310-5 faults/hours
MTBF=1/(1.1310-5 ) =88,100 hours
19
Topic Two:
Failure Mode Rate &
Failure-In-Time
20
Pareto Chart for Failure Modes
Difficulties: • Static View
• No Trend of Each Failure Mode
• Fail to Reflect Product MTBF
Pareto by Failure Mode From January to March
02468
101214
Rel
ays
Res
isto
rs
No
Faul
tFo
und
Col
dS
olde
r
Sof
twar
eB
ug
Op-
Am
p
Qty
0%
20%
40%
60%
80%
100%
No C/AC/A In ProcessC/A CompletePercentage
Pareto Chart by Failure Mode From April to June
048
1216202428
Op-
Am
p
Res
isto
rs
Col
dS
olde
r
Rel
ays
softw
are
bug
No
Faul
tFo
und
Qty
0%
20%
40%
60%
80%
100%
No C/AC/A In ProcessC/A CompletePercentage Note: C/A= corrective action
21
Failure Mode Rate (FMR)
onsinstallatiproductfieldFMoftypeaforfailures=FMR
22
FMR Estimation: Example
For example: Assuming 120 PCBs were shipped and installed in the field in the first quarter, 5 failures returned due poor solder joints, then the FMR for poor solder joints in the first quarter is
quarterboardfaultsFMR //042.01205 ==
oninstallatiproduct fieldFMoftypeaforquantityfailure=FMR
23 FMR Run Chart
Failure Mode Rate (FMR) by Quarter
0.00
0.01
0.02
0.03
0.04
0.05
0.06
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Failu
res
Per
Boa
rd
0
50
100
150
200
250
300
350
400
Cum
ulat
ive
PC
B S
hipm
entrelays
resistor
Op-amp Product shipment
24 Estimate MTBF using FMR Chart
Quarters 1st 2nd Qtr 3rd 4th Cumulative Shipment 120 200 220 264
Cum Run Hours 262,080 436,800 480,480 576,576 Cum FM rate 0.117 0.150 0.057 0.051
Defective Boards 14 30 12 13 MTBF (hours) 18720 14560 38541 42856
13 Weeks Rolling MTBF
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
MTB
F (h
ours
)
0
10
20
30
40
50
60
70
80
90
100
Failu
res
Defective Boards
MTBF
25
Estimate for PCB Failure Rate
∑+++++=
×=
∑+++++=
=
=
k
iiopmsdsys
k
iiiopmsdsys
FITFITFITFITFITFITFIT
FIT
n
1
91
10λ
λλλλλλλ
Notice
Where d = failure rate of design errors s = failure rate of software bugs m = failure rate of manufacturing p = failure rate of process o = failure rate of other issues i = failure rates for component type i k= total number of new component types ni= quantity of component type i used in the product
26
FIT-Based Reliability Driven: Example (1)
FM Category Target MTBF (hrs) Target FIT
Overall Product 50,000 20,000
Components (hardware) 117,647 8,500
Others (NFF) 250,000 4,000
Design 333,333 3,000
Manufacturing 500,000 2,000
Process 666,667 1,500
Software 1,000,000 1,000
MTBFFIT
910Notice =
27
FIT-Based Reliability Driven: Example (2) Product Target
FIT Categorical FM FIT Failure Mode Target FIT Current FIT Ownership
PCB (20,000)
Component (8,500)
Relay 2,000 2,491 Tom
Op-Amp 3,000 4,097 Jones
Resistor 1,500 2,786 Carlos
DC-DC converter 800 1,393 Jesson
ASICs 1,200 1,716 Jim
Design (3,000)
Eng Change Order 1,300 2,383 David
FPGA Rev Upgrade 900 1,643 Kim
Change relay type 800 1,498 John
Manufacturing (2,000)
cold Solder 1,600 3,092 Tony
backward component 250 355 Joe
Faked component 150 255 Paul
Process (1,500) broken part 700 942 Jen
Missing part 300 447 Chris
OES 500 515 Andrew
Software (1,000) Sever bugs 200 398 Eileen
Medium bugs 400 665 Ed
Trivial bugs 400 497 Eric
Others (4,000) NFF 3,000 457 Mark
PCFD 1,000 1,669 Jeff
28
Topic Three:
Reliability Growth Prediction &
Corrective Action under Budget/Cost Constraints
29
Crow/AMSAA Growth Model
∑ ⎟⎟⎠
⎞⎜⎜⎝
⎛=
=
N
i i
s
tt
N
1ln
β̂β
α ˆˆstN=
1ˆˆˆ −= ββαλ tFailure Intensity:
22/1,2ˆ
2θχ
β −< NN 2
2/,2ˆ2
θχβ NN >Reject H0
Where
Hypothesis Testing: H0: β=1, HPP
H1: β1, NHPP
or
0
1
2
3
4
5
6
0 1 2 3 4 5
Failu
re In
tens
ity
Time
Various Failure Intensity Models
beta 1beta 0.5beta 1.5
=1 for all
ts=termination time, ti=ith failure arrival time
30
An Example
797.0ln
ˆ
1
=∑ ⎟⎟⎠
⎞⎜⎜⎝
⎛=
=
N
i i
s
tt
Nβ
0266.0ˆ ˆ ==β
αstN
N=10 Cumulative
FailuresFailure Arrival Time (hours)
Interarrival Time (hours)
ln(ts/ti)
1 67 67 3.232 150 83 2.433 234 84 1.984 360 126 1.555 533 173 1.166 720 187 0.867 912 192 0.628 1102 190 0.439 1345 243 0.2310 1632 287 0.04ts 1700 sum 12.55
31
Failure Modes (FM) Pareto Chart
Cumulative operating time is 4800 hours, total failures is 14. Current MTBF=4800/14=343 hours.
Which FM should be fixed? Given limited budget.
Given $10 budget for corrective actions. Option one: Fix relays MTBF=4800/(14-2.5) =417 hours Option two: fix all others MTBF=4800/(14-9) =960 hours
32
New Reliability Growth Model
1. Failure mode based growth prediction
2. Reliability growth subject to CA budget constraints
3. No assumption of parametric models
4. CA effectiveness function
33
Limit Recourses ($) Spent on CA due to
1. Retrofit 2. ECO
Maximize Reliability
Growth
CA Effectiveness Function
Why Need the CA Effectiveness Function?
34 An Example: ECO or Retrofit
A type of relays used on a PCB module fails constantly due to a known failure mechanism. Two options available for corrective actions 1. Replace all on-board relays upon the failure return of the
module 2. Pro-actively recall all modules and replace with new types
of relays having much higher reliability
CA Option Cost ($) CA Effectiveness
ECO Low Low
Retrofit High High
35
0 c
x
1
effe
ctiv
enes
s
b
cxxh ⎟⎠⎞⎜
⎝⎛=)(
h(x)
CA budget ($)
Effectiveness Model
b>1 b=1
b<1
Modeling CA Effectiveness
b and c to be determined
Effectiveness= Failure rate before CA – Failures rate after CA
Failure rate before CA
36 An Example
The current failure rate a type of relay is 210-8 faults per hour. Upon the implementation of CA, the rate is reduced to 510-9. The CA effectiveness can be expressed as 0.75, that is
75.0102
1051028
98
=×
×−×−
−−
37 Incorporate h(x) into System Failure Rate
)()()(11ttnt
m
kii
k
iiis ∑+∑=
+==λλλ
b
cxxh ⎟⎠⎞⎜
⎝⎛=)(
∑ −+∑ −=+==
m
kiiii
k
iiiiiCAs txhtxhnt
11, )())(1()())(1()( λλλ
HW Non-HW
38
Making The Prediction via MS Excel (I)
Week No. 1 2 3 4 5 6 7 8Cum Failures
by FM Cum Opting Hours 1680 3360 5040 6720 8400 10080 11760 134407 Replay 2 0 1 0 3 0 1 06 resistors 1 1 0 1 0 2 0 14 op-‐amp 0 0 0 1 0 1 1 15 capacitor 1 0 0 1 1 0 0 22 design error 0 0 1 0 0 0 1 04 software bugs 1 0 0 0 1 1 1 06 cold solder 0 2 0 1 0 0 1 22 bad process 0 0 1 0 0 0 0 14 NFF 1 2 0 0 0 1 0 0
Latent Failure Modeweekly cum failures 6 5 3 4 5 5 5 7
Actual MTBF 280 305 360 373 365 360 356 336
39
Making The Prediction via MS Excel (II)
Week No. 1 8 9 10 11 12 13 14 15 16Required Budget ($)
Cost for fix FM ($) Target FM %
Cum Failures by FM Cum Opting Hours 1680 13440 15120 16800 18480 20160 21840 23520 25200 26880
150 300 50% 7 Replay 2 0 1.0 0 0.5 0 1.5 0 0.5 0500 500 0% 6 resistors 1 1 0 0 0 0 0 0 0 0100 200 50% 4 op-‐amp 0 1 0 0 0 0.5 0 0.5 0.5 0.5350 350 0% 5 capacitor 1 2 0 0 0 0 0 0 0 0700 700 0% 2 design error 0 0 0 0 0 0 0 0 0 0125 250 50% 4 software bugs 1 0 0.5 0 0 0 0.5 0.5 0.5 0100 100 0% 6 cold solder 0 2 0 0.0 0 0.0 0 0 0.0 0.00 50 100% 2 bad process 0 1 0 0 1.0 0 0 0 0 1.0225 450 50% 4 NFF 1 0 0.5 1.0 0 0 0 0.5 0 0
Latent Failure Mode 0.3 0.2 0.2 0.1 0.3 0.2 0.2 0.2weekly cum failures 6 7 2 1 2 1 2 2 2 2
Actual MTBF 280 3362250 Predicted MTBF 336 318 348 372 404 420 439 457 473
40
Reliability Growth Planning Process
SystemManufacturer
Repair Center
In-serviceSystems
Stocks
RetrofitTeam
Retrofit Loop
ECO Loop1. Failure analysis2. CA decisions3. Reliability prediction
ECO=Engineering Change Order CA=Corrective Actions
FRACA
41
Conclusions 1. Design for reliability (DFR) should incorporate hardware and
non-hardware issues along with the variation of the failure rates.
2. Trade-off should be made between the reliability growth and the associated availability of CA resources.
3. The CA effectiveness function links the CA budget with the expected failure mode reduction rate.
4. A reliability database system such as FRACAS is essential for performing RGP.
42
Thanks ! &
Questions/Comments ?