72
1 © Life Cycle Engineering 2008 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

Embed Size (px)

Citation preview

Page 1: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

1© Life Cycle Engineering 2008© Life Cycle Engineering 2008

Root Cause (Failure) AnalysisEugene T. CottleReliability Engineer

Page 2: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

2© Life Cycle Engineering 2008

Page 3: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

3© Life Cycle Engineering 2008

Page 4: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

4© Life Cycle Engineering 2008

A confluence of events, factors and conditions which conspire to produce an outcome.

Page 5: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

5© Life Cycle Engineering 2008

A confluence of events, factors and conditions which conspire to produce an (undesirable) outcome.

Event(s), factor(s) or condition(s) which are under your control and which, if corrected or eliminated, will prevent

recurrence of the undesirable outcome.

Page 6: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

6© Life Cycle Engineering 2008

More terminology…

• RCA (Root Cause Analysis)– A disciplined process for focusing ideas to identify

root cause(s). A class of problem solving methods

• RCFA (Root Cause Failure Analysis)– Reactive, in response to a failure

• RCCA (Root Cause and Corrective Action)– Incorporates preventive corrective action into the

process (i.e., elimination of special causes)

Page 7: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

7© Life Cycle Engineering 2008

Root Cause Analysis

• Safety-based RCA– accident analysis and – occupational safety and

health

• Production-based RCA – quality control for

industrial manufacturing

• Process-based RCA– Expanded scope to

include business processes

• Failure-based RCA– Based on failure analysis– employed in engineering

and maintenance.

• Systems-based RCA – amalgamation of the all

the others, and includes• change management,

• risk management, and

• systems analysis

Page 8: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

8© Life Cycle Engineering 2008

Objectives

• Prevent recurrence• Responsibility

– “Hand-off” the investigation• Begins with an assumption of “cause”

– Liability– Blame

Page 9: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

9© Life Cycle Engineering 2008

Deming’s 14 points

1. Create constancy of purpose toward improvement of product and service, with the aim to become competitive and stay in business, and to provide jobs.

2. Adopt the new philosophy. We are in a new economic age. Western management must awaken to the challenge, must learn their responsibilities, and take on leadership for change.

3. Cease dependence on inspection to achieve quality. Eliminate the need for inspection on a mass basis by building quality into the product in the first place.

4. End the practice of awarding business on the basis of price tag. Instead, minimize total cost. Move towards a single supplier for any one item, on a long-term relationship of loyalty and trust.

5. Improve constantly and forever the system of production and service, to improve quality and productivity, and thus constantly decrease cost.

6. Institute training on the job. 7. Institute leadership (see Point 12 and Ch. 8 of "Out

of the Crisis"). The aim of supervision should be to help people and machines and gadgets to do a better job. Supervision of management is in need of overhaul, as well as supervision of production workers.

8. Drive out fear, so that everyone may work effectively for the company. (See Ch. 3 of "Out of the Crisis")

9. Break down barriers between departments. People in research, design, sales, and production must work as a team, to foresee problems of production and in use that may be encountered with the product or service.

10. Eliminate slogans, exhortations, and targets for the work force asking for zero defects and new levels of productivity. Such exhortations only create adversarial relationships, as the bulk of the causes of low quality and low productivity belong to the system and thus lie beyond the power of the work force.

11. a. Eliminate work standards (quotas) on the factory floor. Substitute leadership.b. Eliminate management by objective. Eliminate management by numbers, numerical goals. Substitute workmanship.

12. a. Remove barriers that rob the hourly worker of his right to pride of workmanship. The responsibility of supervisors must be changed from sheer numbers to quality.b. Remove barriers that rob people in management and in engineering of their right to pride of workmanship. This means, inter alia, abolishment of the annual or merit rating and of management by objective (See CH. 3 of "Out of the Crisis").

13. Institute a vigorous program of education and self-improvement.

14. Put everyone in the company to work to accomplish the transformation. The transformation is everyone's work.

Page 10: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

10© Life Cycle Engineering 2008

Deming’s 14 points

1. Create constancy of purpose toward improvement of product and service, with the aim to become competitive and stay in business, and to provide jobs.

2. Adopt the new philosophy. We are in a new economic age. Western management must awaken to the challenge, must learn their responsibilities, and take on leadership for change.

3. Cease dependence on inspection to achieve quality. Eliminate the need for inspection on a mass basis by building quality into the product in the first place.

4. End the practice of awarding business on the basis of price tag. Instead, minimize total cost. Move towards a single supplier for any one item, on a long-term relationship of loyalty and trust.

5. Improve constantly and forever the system of production and service, to improve quality and productivity, and thus constantly decrease cost.

6. Institute training on the job. 7. Institute leadership (see Point 12 and Ch. 8 of "Out

of the Crisis"). The aim of supervision should be to help people and machines and gadgets to do a better job. Supervision of management is in need of overhaul, as well as supervision of production workers.

8. Drive out fear, so that everyone may work effectively for the company. (See Ch. 3 of "Out of the Crisis")

9. Break down barriers between departments. People in research, design, sales, and production must work as a team, to foresee problems of production and in use that may be encountered with the product or service.

10. Eliminate slogans, exhortations, and targets for the work force asking for zero defects and new levels of productivity. Such exhortations only create adversarial relationships, as the bulk of the causes of low quality and low productivity belong to the system and thus lie beyond the power of the work force.

11. a. Eliminate work standards (quotas) on the factory floor. Substitute leadership.b. Eliminate management by objective. Eliminate management by numbers, numerical goals. Substitute workmanship.

12. a. Remove barriers that rob the hourly worker of his right to pride of workmanship. The responsibility of supervisors must be changed from sheer numbers to quality.b. Remove barriers that rob people in management and in engineering of their right to pride of workmanship. This means, inter alia, abolishment of the annual or merit rating and of management by objective (See CH. 3 of "Out of the Crisis").

13. Institute a vigorous program of education and self-improvement.

14. Put everyone in the company to work to accomplish the transformation. The transformation is everyone's work.

Page 11: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

11© Life Cycle Engineering 2008

Deming’s 14 points

1. Create constancy of purpose toward improvement of product and service, with the aim to become competitive and stay in business, and to provide jobs.

2. Adopt the new philosophy. We are in a new economic age. Western management must awaken to the challenge, must learn their responsibilities, and take on leadership for change.

3. Cease dependence on inspection to achieve quality. Eliminate the need for inspection on a mass basis by building quality into the product in the first place.

4. End the practice of awarding business on the basis of price tag. Instead, minimize total cost. Move towards a single supplier for any one item, on a long-term relationship of loyalty and trust.

5. Improve constantly and forever the system of production and service, to improve quality and productivity, and thus constantly decrease cost.

6. Institute training on the job. 7. Institute leadership (see Point 12 and Ch. 8 of "Out

of the Crisis"). The aim of supervision should be to help people and machines and gadgets to do a better job. Supervision of management is in need of overhaul, as well as supervision of production workers.

8. Drive out fear, so that everyone may work effectively for the company. (See Ch. 3 of "Out of the Crisis")

9. Break down barriers between departments. People in research, design, sales, and production must work as a team, to foresee problems of production and in use that may be encountered with the product or service.

10. Eliminate slogans, exhortations, and targets for the work force asking for zero defects and new levels of productivity. Such exhortations only create adversarial relationships, as the bulk of the causes of low quality and low productivity belong to the system and thus lie beyond the power of the work force.

11. a. Eliminate work standards (quotas) on the factory floor. Substitute leadership.b. Eliminate management by objective. Eliminate management by numbers, numerical goals. Substitute workmanship.

12. a. Remove barriers that rob the hourly worker of his right to pride of workmanship. The responsibility of supervisors must be changed from sheer numbers to quality.b. Remove barriers that rob people in management and in engineering of their right to pride of workmanship. This means, inter alia, abolishment of the annual or merit rating and of management by objective (See CH. 3 of "Out of the Crisis").

13. Institute a vigorous program of education and self-improvement.

14. Put everyone in the company to work to accomplish the transformation. The transformation is everyone's work.

Page 12: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

12© Life Cycle Engineering 2008

DMAIC

Page 13: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

13© Life Cycle Engineering 2008

Page 14: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

14© Life Cycle Engineering 2008

5 Whys

• 5 Whys1. Why? - The battery is dead. (first

why) 2. Why? - The alternator is not

functioning. (second why) 3. Why? - The alternator belt has

broken. (third why) 4. Why? - The alternator belt was well

beyond its useful service life and has never been replaced. (fourth why)

5. Why? - I have not been maintaining my car according to the recommended service schedule. (fifth why, root cause)

Sakichi Toyoda ( 豊田 佐吉 Toyoda Sakichi,

February 14, 1867 – October 30, 1930)

Page 15: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

15© Life Cycle Engineering 2008

5 why’s continued

• Why 5?– Nothing magic about the number 5– After about 5 it can get absurd or go out of scope– Do we have control over this cause?– Will eliminating this cause prevent recurrence?

• Shortcomings– Oversimplifies cause and effect relationships

• Multiple causal and contributing factors• Confluence of events

– Not a structured method for effective investigations• Other methods help identify possible factors

• Fundamental idea underlying all RCA’s (Cause Effect)

Page 16: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

16© Life Cycle Engineering 2008

Ishikawa Diagram(also “fish-bone” diagram)

• Can come at any point in the process• Helps direct activities• Brainstorming tool• Followed by data collection, verification,

tests, etc.Tague’s, Nancy R. The Quality Toolbox, Second Edition, ASQ Quality Press, 2004, pages 247-249

Page 17: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

17© Life Cycle Engineering 2008

Ishikawa diagrams

• The 6 “M”s1. Machine,

2. Method,

3. Materials,

4. Maintenance,

5. Man and

6. Mother Nature (Environment)

• The 8 “P”s1. Price,

2. Promotion,

3. People,

4. Processes,

5. Place / Plant,

6. Policies,

7. Procedures, and

8. Product (or Service)

• The 4 “S”s1. Surroundings,

2. Suppliers,

3. Systems,

4. Skills

Page 18: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

18© Life Cycle Engineering 2008

Failure Model

• The level at which any root cause should be identified is the level at which it is possible to identify an appropriate failure management policy

Page 19: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

19© Life Cycle Engineering 2008

“8 Disciplines” or “8D”

• The 8 Disciplines1. Use Team Approach2. Describe the Problem3. Implement and Verify Short-Term Corrective Actions4. Define and Verify Root Causes5. Verify Corrective Actions6. Implement Permanent Corrective Actions7. Prevent Recurrence8. Congratulate Your Team

• Other tools can be incorporated into the steps of an 8D

Page 20: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

20© Life Cycle Engineering 2008

Page 21: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

21© Life Cycle Engineering 2008

Kepner-Tregoe (KT) analysisPioneered in early 1960’s

USAF and NASA“built on the premise that people can be taught

to think critically”

• Invite someone from a different area as a “fresh set of eyes”– “Could you please explain…?”– “How do you know…?– “Do you have any data to show that…”

Page 22: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

22© Life Cycle Engineering 2008

What is acceptable?

What do you expect?

Everything fails …

If you push it hard enough

If you run it long enough

If it gets hot enough

Etc.

It will fail.

• Also, “failure probability distribution” –Answers simultaneously– “How many as a portion of the population?” and– At what point in their life (age, cycles, etc.)

• You need good data to answer these questions

Page 23: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

23© Life Cycle Engineering 2008

Statistical Analysis

• Important to understand…– Coincidence– Correlation– Cause

Page 24: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

24© Life Cycle Engineering 2008

Statistical Analysis

• Important to understand…– Coincidence– Correlation– Cause

Page 25: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

25© Life Cycle Engineering 2008

Statistical Analysis

• Important to understand…– Coincidence– Correlation– Cause

Page 26: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

26© Life Cycle Engineering 2008

Statistical Analysis

• Important to understand…– Coincidence– Correlation– Cause

Page 27: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

27© Life Cycle Engineering 2008

Statistical Analysis

• Important to understand…– Coincidence– Correlation– Cause

Page 28: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

28© Life Cycle Engineering 2008

Statistical Analysis

• Important to understand…– Coincidence– Correlation– Cause

Page 29: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

29© Life Cycle Engineering 2008

Statistical Analysis

• Important to understand…– Coincidence– Correlation– Cause

• Tools…– Design of Experiments (DOE)– Analysis of Variance (ANOVA)– Correlation analyses– Hypothesis testing

“Smoking is one of the leading causes of statistics.”-- Fletcher Knebel

Page 30: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

30© Life Cycle Engineering 2008

Selecting and prioritizing actions

• Requires some knowledge of probability of occurrence - Data

SEVERITY Catastrophic Critical Marginal Negligible

From To Definition Probability

~1 8 x 10 –2 Likely to occur frequently Frequent 1 3 6 10

8 x 10 –2 8 x 10 –3 Will occur several times in life of an item Probable 2 5 9 14

8 x 10 –3 8 x 10 –4 Likely to occur sometime in life of an item Occasional 4 8 13 17

8 x 10 –4 8 x 10 –5 Unlikely but possible to occur in the life of an item Remote 7 12 16 19

8 x 10 –5 ~0 So unlikely it may be assumed that it won't occur Improbable 11 15 18 20

Probability Range

Customer Notification Containment Corrective Action1~5 Immediate Restrict field use. Purge existing

stock.Complete field retrofit as quickly as

possible.6~10 Immediate Warn customer to avoid conditions

leading to the failure. Hold shipments till design change is incorporated.

Complete paced field retrofit at earliest opportunity.

11~15 Service Bulletin No containment required Change design, offer upgrade to customer.

16~20 Revision notes No containment required Change design at next opportunity, or correct the problem in the next

generation product.

Page 31: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

31© Life Cycle Engineering 2008

Selecting and prioritizing actions

• FMEA (Failure Modes and Effects Analysis• Requires some knowledge of probabilities

Page 32: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

32© Life Cycle Engineering 2008

Keys for Success

• You aren’t the expert– Challenge everything– Speak with data, act on fact

• Have the data – and use it• Don’t let motivations drive conclusions• Resources

– Always resource-constrained– Depends on risk and criticality

• Finish the job - verification

Page 33: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

33© Life Cycle Engineering 2008

“In theory, there is no difference between theory and practice; In practice, there is.”-- Chuck Reid

Page 34: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

34© Life Cycle Engineering 2008

Page 35: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

35© Life Cycle Engineering 2008

Department of Defense Inspector General Auditing Report 99-193, C-17 Landing Gear Durability and Parts Support, June 24, 1999

Page 36: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

36© Life Cycle Engineering 2008

• Designed for and subjected to 600 hour durability test, vibration and thermal (Specification requirement)

• A couple of redesigns already– Identified location and mechanism of failure– Made it more robust both times

• Discarding 10-13 sensors per month• Problem: Solve the high failure rate.

Page 37: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

37© Life Cycle Engineering 2008

• Discarding 10-13 sensors per month• A couple of redesigns already• Problem: Solve the high failure rate.• Although each redesign had made the sensor

stronger, there was never clear definition of the requirement

• Initial problem was an inadequate specification• Most of the sensors currently being discarded had not

failed• “Swaptronics”

• Resolution: Improve troubleshooting

Page 38: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

38© Life Cycle Engineering 2008

Blue Screen of Death (BSOD)

Page 39: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

39© Life Cycle Engineering 2008

BSOD continued…

Page 40: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

40© Life Cycle Engineering 2008

Key take-aways

• Conclusion– Micro-bubbles forming on the disk drives– Only happens if the (computer) is left on all the time– Corrective action was to turn off the computers at

and restart them once every 24 hours• Not a true corrective action

• Lessons for RCFA– Took about 18 months from initiation of activity to

report– Dedicated (and determined) engineer

Page 41: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

41© Life Cycle Engineering 2008

Conveyor Failures

• High failure rate– Motors tripping– Gearbox failures

• Solve the high failure rate

Page 42: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

42© Life Cycle Engineering 2008

Conveyor Failures continued…

• Problem definition– The corrective action team determined that the

failures were generally of two types,

1. premature wear out consistent with long term, slightly elevated loading, and

2. failures consistent with transient torque overloads.

– One side has a higher failure rate than the other– Load?

Page 43: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

43© Life Cycle Engineering 2008

Strain gauges applied at the couplings on both conveyors

• Setup a remote data acquisition system (WebDaq)

• Began gathering long-term data– About 8 days of continuous

data– Then about 137 hours of

intermittent (triggered) data

Cutouts for strain gauge

s.

Strain gauge locatio

ns

Page 44: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

44© Life Cycle Engineering 2008

Normal operation Overload event

Overload event (zoom)

• Frequency indicates coupling slipping

Page 45: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

45© Life Cycle Engineering 2008

Conveyor Failures conclusion

• Life difference between drives is normal wear-out due to higher load during normal operation

• Premature failures due to overload events…– “Clamping” of the belts due to

programming errors in control system

• Latent causes not addressed…– Development, installation and

run-off process that permitted the programming errors

– Process that failed to catch the errors

• Fundamental Principles / Lessons Learned (for Root Cause Failure Analysis…)– Devoted adequate

resources– Did not do a design

change based on initial “apparent” cause

– Problem definition / Data collection

– Time commitment• 10 Months from

identification of failure for RCFA to final report

Page 46: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

46© Life Cycle Engineering 2008

Mobile hydraulic pumpsTruck pumps leakage

• Problem– Reported substantial increase in

failure rate due to leakage– Initial conclusion (assumed) – faulty

pump– Initiated a campaign to replace all the

pumps

• Very good data– Extensive details on every failure

• Model, serial number, application, hours in service, calendar time in service…

Page 47: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

47© Life Cycle Engineering 2008

Mobile hydraulic pumps continued…Truck pumps leakage continued…

• Established 13 year timeline showing entire history of design and application

• Reviewed detailed removal history and failure probability distributions

• Identified 2 different failure modes…

1.00 100.0010.00

0.10

0.50

1.00

5.00

10.00

50.00

90.00

99.00

0.10

0.5

0.6

0.7

0.8

0.9

1.0

1.2

1.4

1.6

2.0

3.0

4.0

6.0

ReliaSoft's Weibull++ 6.0 - www.Weibull.com

Probability - Weibull

Time in Service (months)

Unr

elia

bilit

y, F

(t)

478

610

809

938

11531193

13141289

13781307

1251134111691120105393388877267163755350743241538330332126621322817118915411812292433227149

x 115

40

x 164

34

x 121

12

x 151

91

x 134

49

x 124

10

x 101

44

x 114

59

x 110

72

x 123

95

x 119

13

x 851

7

x 931

6

x 820

6

x 122

31

x 138

65

x 602

9

x 818

9

x 571

1

x 696

9

x 547

2

x 501

4

x 571

2

x 607

8

x 472

8

x 325

2

x 516

6

x 646

6

x 547

1

x 561

4

x 430

0

x 610

9

x 540

3

x 596

1

x 618

1

x 530

4

x 451

7

x 418

0

x 468

1

x 560

6

x 582

2

x 651

2

WeibullAllParts_Months_Warr

W5 MLE - SRM MEDF=24896 / S=334701CB[FM]@90.00%1-Sided-U [T1]

Page 48: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

48© Life Cycle Engineering 2008

1.00 100.0010.00

0.01

0.05

0.10

0.50

1.00

5.00

10.00

50.00

90.00

99.00

0.01

0.5

0.6

0.7

0.8

0.9

1.0

1.2

1.4

1.6

2.0

3.0

4.0

6.0

ReliaSoft's Weibull++ 6.0 - www.Weibull.com

Probability - Weibull

Time in Service (months)

Unr

elia

bili

ty,

F(t

)

80

99

124120

112941188910590807880576147564042172322917121510888892354

x 135

x 143

x 136

x 161

x 187

x 188x 2

00x 2

07x 1

93x 2

40x 2

14x 2

37x 2

42x 2

65x 2

45x 2

51x 2

71x 3

15x 3

26x 3

09x 3

30x 3

00x 2

67x 3

12x 3

36x 2

17x 3

24x 2

65x 3

09x 2

48x 2

46x 2

48x 1

94x 2

22x 1

53x 1

87x 2

26x 2

01x 1

73

Weibull16000K

W5 MLE - SRM MEDF=1752 / S=9223

133

128

118 5 711336364 52 2 3

x 2 x 2 x 4 x 6 x 6 x 4 x 6 x 5x 3x 2x 4x 4x 4x 3x 3x 3x 5x 3x 5x 8x 3x 6x 7x 10

x 3x 8x 13

x 7x 7x 7x 9x 8x 4x 6x 7x 21

16000L

W5 MLE - SRM MEDF=120 / S=209

223

280

427473

622665

731703

7947346957576455655694674594223753443023132732542361621841571151348610976525942158

x 996

x 905

x 831

x 933

x 904

x 939

x 109

6

x 117

9

x 120

3

x 167

2

x 172

7

x 197

6

x 258

2

x 247

8

x 279

8

x 233

5

x 229

1

x 241

0

x 267

0

x 257

1

x 217

2

x 249

5

x 208

2

x 243

1

x 328

1

x 238

9

x 356

4

x 309

2

x 299

9

x 360

4

x 307

3

x 303

4

x 240

8

x 255

1

x 247

3

x 329

2

x 321

9

x 307

3

x 372

6

16000M

W5 MLE - SRM MEDF=13527 / S=89454

8

17

1921

201512111617141413101179332623222

x 33

x 35

x 37

x 55

x 62

x 91x 7

9x 9

4x 9

9x 1

22x 2

25x 2

69x 3

83x 3

88x 5

09x 4

07x 4

80x 4

50x 4

08x 4

33x 3

59x 3

53x 3

70x 3

45x 3

86x 2

90x 3

89x 3

39x 3

93x 2

88x 2

01x 1

22x 8

6x 1

37x 8

1x 9

5x 8

6x 5

1x 7

6

16000N

W5 MLE - SRM MEDF=264 / S=9106

37

5050

42322844383334292637172311121915138677665443

x 121

x 157

x 217

x 266

x 292

x 341x 4

45x 5

92x 6

58

x 100

7

x 123

6

x 153

4

x 187

7

x 182

9

x 249

2

x 207

0

x 179

5

x 166

3

x 162

9

x 160

7

x 121

1

x 121

4

x 147

9

x 122

1

x 140

8

x 119

0

x 211

7

x 131

9

x 122

1

x 123

0

x 101

7

x 116

0

x 757

x 830

x 733

x 684

x 769

x 643

x 725

AllOthers

W5 MLE - SRM MEDF=648 / S=42756CB[FM]@90.00%1-Sided-U [T1]

Further analysis permitted us to isolate and identify subpopulations with distinctly different failure distributions

Page 49: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

49© Life Cycle Engineering 2008

Mobile hydraulic pumps continued…Truck pumps leakage continued…

• Truck test results– 1. The highest acceleration

levels are always associated with rapid pressure drops, |dP/dt| about 1800 bar per second or greater.

– 2. Pressure drops (|dP/dt|) on Truck 2 were on average a little greater truck 1, but they never result in the impact signature.

– 3. |dP/dt| >= about 1800 bar per second ALWAYS results in an impact signature on truck 1

Page 50: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

50© Life Cycle Engineering 2008

Mobile hydraulic pumps continued…Truck pumps leakage conclusions

• Have the data• Statistical tools• Resources

– About 1 year

Page 51: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

51© Life Cycle Engineering 2008

The word 'politics' is derived from the word 'poly', meaning 'many', and the word 'ticks', meaning 'blood sucking parasites‘.”-- Larry Hardiman

Page 52: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

52© Life Cycle Engineering 2008

Acme* Gearbox - Background

• 3-stage, 1800kW gearbox driving a rock crusher

• Late in the evening there was a vibration alarm

• Alarm was “not unusual”, they continued operating

• Early the next morning there was a loud noise, and shutdown for vibration

*Some details have been changed

Page 53: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

53© Life Cycle Engineering 2008

Background continued…

• Over the next few days they replaced the gearbox with a spare

• Vendor was consulted. They “knew exactly what went wrong”

• Insurance company requested an independent Root Cause Failure Analysis

Page 54: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

54© Life Cycle Engineering 2008

Background continued…

• Over the next few days they replaced the gearbox with a spare

• Vendor was consulted. They “knew exactly what went wrong”

• Insurance company requested an independent Root Cause Failure Analysis

Page 55: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

55© Life Cycle Engineering 2008

Complications

• “Independent”– Implies limited cooperation between experts

• People who designed and built the equipment• People who maintained and operated equipment

– Don’t take everything at face value• Consider everyone’s motivations• There are vested interests in different possible conclusions

• Limited access to the hardware– Resources

Page 56: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

56© Life Cycle Engineering 2008

Investigation

What did the people do?

Why did they do it?(systems, procedures, motivations)

This is where you usually find the “root” cause

Page 57: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

57© Life Cycle Engineering 2008

Investigation

Induced• Application• Environment• “You broke it”

(vendor)

Inherent• Design• Materials• “It broke”

(user)

Answers the question “which humans?”

Page 58: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

58© Life Cycle Engineering 2008

Data(“Describe the problem” from 8D form)

• Loading, both before the incident and historically

• Equipment design, ratings (what was it expected to do?)

• Maintenance history• Vibration analyses / reports

Page 59: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

59© Life Cycle Engineering 2008

Contaminant report…

Page 60: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

60© Life Cycle Engineering 2008

Vibration

• Requested source data, FFT parameters, etc.

(monthly checks… one year history)

(motor bearing)

Page 61: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

61© Life Cycle Engineering 2008

Vibration (source data)

Page 62: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

62© Life Cycle Engineering 2008

0:00 12:00 24:00 36:00 48:00 60:00 72:00 84:00 96:00

Loading

At time of failure

1 Year Earlier

2 Years Earlier

Page 63: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

63© Life Cycle Engineering 2008

Power – 30 days leading up to failure

Page 64: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

64© Life Cycle Engineering 2008

Power – 30 days 1 year earlier

Page 65: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

65© Life Cycle Engineering 2008

Motor is replaced(-2 days)

Internal winding failure

Gearbox is rebuilt

Maintenance crew fixes an oil leak

Maintenance crew fixes an oil leak

Maintenance crew fixes an oil leak

Oil leaks repaired “many” times, most undocumented

due to vibration alarm

( date and time)

Plant shuts down Tripped due to vibration alarm

(-2 hours)

Supervisor decides to

continue running(-2 hours)

Maintenance crew fixes an oil leak

(-2 days)

Page 66: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

66© Life Cycle Engineering 2008

Interviews – the picture that emerges

• 2 days prior – high speed shaft was not properly drawn up to engage the pinion– Crew did not have specs or

manuals– No one knew where they were

• Oil leaks had been repaired “many times” since rebuild

• Could have been improperly reinstalled any of those times

• Prior to failure, crews heard “Rumble” typical of loading too much material (common occurrence)

• Other crews described the proper procedure, “tribal knowledge”

• Maintenance records were incomplete

• Vendor reported no apparent problems when new motor was installed

• Control room vibration monitoring was not helpful

• Alarms occurred “all the time” with no action taken

• There were indications a failure was imminent

Page 67: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

67© Life Cycle Engineering 2008

Page 68: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

68© Life Cycle Engineering 2008

Remaining questions:•Was damage accumulating over time?•Were there material or design contributors?

Page 69: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

69© Life Cycle Engineering 2008

Metallurgical report

• Two contact patterns…– “Frosting” below the pitch line, indicating a period of

normal wear– Obvious indications of wear near tooth tips

• Bearings indicated a severe misalignment• Nothing anomalous in material properties

(hardness, case depth, chemical and microstructure)

• Failure was due to low cycle fatigue prior to overload

Page 70: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

70© Life Cycle Engineering 2008

Root cause conclusions

• Induced failure due to – improper maintenance,

resulting in low cycle fatigue then overload

– High loads due to material overloading were a likely contributor

• Latent factors: – Poor cooperation with

supplier(s)– Inadequate documentation

and equipment specific training

– Ineffective warning system and propensity to ignore warnings

• Proposed corrective actions– Acquire up-to-date

specifications, documentation and maintenance procedures for critical equipment

– Ensure equipment specific training for maintenance personnel

– Review adequacy of alarm system to ensure warnings are adequate and meaningful

– Define appropriate responses– Instill a culture that expects

response and action

Page 71: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

71© Life Cycle Engineering 2008

Conclusions, or if you remember nothing else about Root Cause Analysis, remember this:

• Do it. RCA is the engine that drives continuous improvement.

• Have the data– Keep good records, not just of failures but of

• All maintenance actions• When did it begin service? … end?• Operating conditions• If you don’t have a good CMMS, get one.• If you do (or when you do), USE IT

• Resources. Have the right– People– Training, and – Tools.

Page 72: 1 © Life Cycle Engineering 2008 Root Cause (Failure) Analysis Eugene T. Cottle Reliability Engineer

72© Life Cycle Engineering 2008

The last word…

Problem Solving Flow Chart

Don’t Mess With I t!

YES NO

YES

YOU POOR FOOL!NO

Are You I n Hot Water?

NO

Throw Away The Evidence

Does Anyone Know? TOO BAD!

YESYES

NO

Hide I t Can You Blame Someone Else?

NO

NO PROBLEM!

YES

I s I t Working?

Did You Mess With I t?