105
GEG8124 Understanding Reliability Study Notes 2006 Matthew McLeod #19303432 Contents Contents..........................................................1 STUDY GUIDE 1 – Introduction to Reliability.......................8 Objectives.......................................................8 Study Guide 1 Notes..............................................8 Assurance Technologies.........................................8 Reliability Fundamentals......................................10 System Effectiveness..........................................10 Quality And Reliability.......................................10 Determination Of Cost Drivers.................................11 Introduction To Cost Effective Analysis.......................11 Profitability Of Reliability..................................11 O’Conner Chapter 1 - Introduction to Reliability Engineering....12 Why Do Engineering Items Fail?................................12 Probabilistic Reliability.....................................13 Repairable and Non Repairable Items...........................13 Non-Repairable Items..........................................13 Repairable Items.............................................. 14 Development of Reliability Engineering........................14 Reliability As An Effectiveness Parameter.....................14 Reliability Programme Activities..............................14 Reliability Economics And Management..........................15 Smith Chapter 1 - The history of reliability and safety technology ................................................................15 Definitions................................................... 15 Failure Data.................................................. 16 Last saved 11/1/2006 04:51:00 AM Last printed 5/9/2006 01:59:00 PM 1

8124 Study Notes 31 July

Embed Size (px)

DESCRIPTION

Study notes for Monash University Master of Maintenance and Reliability Engineering subject GEG8124. A good overview of reliability engineering taken from various sources including O'Conner "Practical Reliability Engineering".

Citation preview

Page 1: 8124 Study Notes 31 July

GEG8124 Understanding Reliability

Study Notes 2006

Matthew McLeod #19303432

ContentsContents.................................................................................................................................................1

STUDY GUIDE 1 – Introduction to Reliability...................................................................................8

Objectives...........................................................................................................................................8

Study Guide 1 Notes...........................................................................................................................8

Assurance Technologies..................................................................................................................8

Reliability Fundamentals...............................................................................................................10

System Effectiveness.....................................................................................................................10

Quality And Reliability.................................................................................................................10

Determination Of Cost Drivers......................................................................................................11

Introduction To Cost Effective Analysis.......................................................................................11

Profitability Of Reliability.............................................................................................................11

O’Conner Chapter 1 - Introduction to Reliability Engineering........................................................12

Why Do Engineering Items Fail?..................................................................................................12

Probabilistic Reliability.................................................................................................................13

Repairable and Non Repairable Items...........................................................................................13

Non-Repairable Items....................................................................................................................13

Repairable Items............................................................................................................................14

Development of Reliability Engineering.......................................................................................14

Reliability As An Effectiveness Parameter...................................................................................14

Reliability Programme Activities..................................................................................................14

Reliability Economics And Management......................................................................................15

Smith Chapter 1 - The history of reliability and safety technology.................................................15

Definitions.....................................................................................................................................15

Failure Data...................................................................................................................................16

Hazardous Failures........................................................................................................................16

Reliability and Risk Prediction......................................................................................................17

Achieving Reliability and Safety-Integrity....................................................................................17

RAMS Cycle..................................................................................................................................18

Contractual Pressures....................................................................................................................18

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM1

Page 2: 8124 Study Notes 31 July

Smith Chapter 2 - Understanding terms and jargon.........................................................................19

Defining Failure And Failure Modes.............................................................................................19

Failure Rate And MTBF................................................................................................................19

Interrelationship Of Terms............................................................................................................19

Bathtub Distribution......................................................................................................................19

Down Time And Repair Time.......................................................................................................19

Availability, Unavailability And Probability Of Failure On Demand...........................................20

Hazard And Risk Related Terms...................................................................................................20

Choosing The Appropriate Parameter...........................................................................................20

Smith Chapter 3 - A cost-effective approach to quality, reliability and safety................................21

Reliability And Cost......................................................................................................................21

Costs And Safety...........................................................................................................................21

Cost Of Quality..............................................................................................................................21

Study Guide 1 – Self Assessment Questions....................................................................................23

STUDY GUIDE 2 – Reliability in Management and Quality Control...............................................24

Objectives.........................................................................................................................................24

AS2561-1982 Guide to the determination and use of quality costs.................................................24

O’Conner Chapter 15 – Reliability Management.............................................................................25

Corporate policy for reliability......................................................................................................25

Integrated reliability programmes.................................................................................................25

Reliability and Costs......................................................................................................................25

Safety and Product Liability..........................................................................................................25

Standards for reliability, quality and safety...................................................................................25

Specifying Reliability....................................................................................................................26

Smith Chapter 18 - Project Management.........................................................................................28

Setting Objectives and Specifications...........................................................................................28

Planning, Feasibility and Allocation.............................................................................................28

Programme Activities....................................................................................................................28

Responsibilities..............................................................................................................................29

Standards and Guidance Documents.............................................................................................30

Smith Chapter 19 – Contract clauses and their pitfalls....................................................................31

Essential Areas..............................................................................................................................31

Other Areas....................................................................................................................................32

Pitfalls............................................................................................................................................33

Penalties.........................................................................................................................................34

Subcontracted Reliability Assessments.........................................................................................34

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM2

Page 3: 8124 Study Notes 31 July

Smith Chapter 20 – Product liability and safety legislation.............................................................35

The general situation.....................................................................................................................35

Strict Liability................................................................................................................................35

Insurance and Product Recall........................................................................................................35

Smith Chapter 21 – Major Incident Legislation...............................................................................36

Problem Areas...............................................................................................................................36

Smith Chapter 22 – Integrity of safety-related systems...................................................................37

Safety-related or safety critical?....................................................................................................37

Study Guide 2 Self Assessment Questions.......................................................................................37

STUDY GUIDE 3 - Reliability in Design..........................................................................................39

Objectives.........................................................................................................................................39

AS2529-1982 Collection of Reliability, Availability and Maintainability Data for Electronics and Similar Engineering Use...................................................................................................................39

1 Scope..........................................................................................................................................39

2 Application and Purpose.............................................................................................................39

3 Data Required.............................................................................................................................39

4 Guidelines...................................................................................................................................39

5 Reports........................................................................................................................................39

6 Field Performance Reports.........................................................................................................40

AS2530-1982 Presentation of Reliability Data on Electronic and Similar Components.................40

1 Scope..........................................................................................................................................40

2 Identification of Components Tested.........................................................................................40

3 Test Conditions...........................................................................................................................41

5 Data on changes in Characteristics.............................................................................................41

AS3960-1990 Guide to Reliability and Maintainability Program Management..............................42

1 Scope and General......................................................................................................................42

2 Reliability and Maintainability Program....................................................................................42

3 Specification of Reliability and Maintainability.........................................................................42

4 Assessment and prediction of Reliability and Maintainability...................................................43

5 Production, Flow, Analysis and Interpretation of Reliability and Maintainability Data............44

O’Conner Chapter 6 – Reliability Prediction and Modelling..............................................................46

Introduction...................................................................................................................................46

Fundamental Limitations of Reliability Prediction.......................................................................46

Reliability Databases.....................................................................................................................48

The Practical Approach.................................................................................................................48

System Reliability Models.............................................................................................................48

Availability of Repairable Systems...............................................................................................49

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM3

Page 4: 8124 Study Notes 31 July

Modular Design.............................................................................................................................50

Block Diagram Analysis................................................................................................................50

Fault Tree Analysis........................................................................................................................51

Petri Nets.......................................................................................................................................51

State Space Analysis (Markov Analysis)......................................................................................51

Monte Carlo Simulation................................................................................................................52

Reliability Apportionment.............................................................................................................52

Standard Methods for Reliability Prediction and Modelling.........................................................52

Conclusions...................................................................................................................................52

O’Conner Chapter 7 – Reliability in Design (not required).............................................................53

Introduction...................................................................................................................................53

Computer-Aided Engineering........................................................................................................53

Environments.................................................................................................................................53

Design Analysis Methods..............................................................................................................53

Quality Function Deployment.......................................................................................................53

Load Strength Analysis.................................................................................................................53

Failure Modes, Effects and Criticality Analysis (FMECA)..........................................................53

Reliability Predictions for FMECA...............................................................................................53

Hazard and Operability Study (HAZOPS)....................................................................................53

Parts, Materials and Processes Review (PMP)..............................................................................53

Non-Material Failure Modes.........................................................................................................53

Human Reliability..........................................................................................................................53

Design analysis for processes........................................................................................................53

Critical Items List..........................................................................................................................53

Summary........................................................................................................................................53

Management of Design Review....................................................................................................53

Configuration Control....................................................................................................................53

O’Conner Chapter 8 – Reliability of Mechanical Components and Systems..................................54

Introduction...................................................................................................................................54

Mechanical Stress, Strength and Fracture.....................................................................................54

Fatigue...........................................................................................................................................54

O’Conner Chapter 9 – Electronic System Reliability......................................................................55

Introduction...................................................................................................................................55

Reliability of Electroninc Components.........................................................................................55

Component Types and Failure Mechanisms.................................................................................55

Summary of device failure modes.................................................................................................55

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM4

Page 5: 8124 Study Notes 31 July

Circuit and System aspects............................................................................................................55

Electronic System Reliability Prediction.......................................................................................55

Reliability in electronic system design..........................................................................................55

Parameter variation and tolerances................................................................................................55

Design for production, test and maintenance................................................................................55

O’Conner Chapter 10 – Software Reliability...................................................................................56

Introduction...................................................................................................................................56

Software in engineering systems...................................................................................................56

Software Errors..............................................................................................................................56

Preventing Errors...........................................................................................................................56

Software Structure and Modularity...............................................................................................56

Programming Style........................................................................................................................56

Fault Tolerance..............................................................................................................................56

Redundancy/Diversity...................................................................................................................56

Languages......................................................................................................................................56

Data Reliability..............................................................................................................................56

Software Checking........................................................................................................................56

Software Design Analysis Methods..............................................................................................56

Software Testing............................................................................................................................56

Error Reporting..............................................................................................................................56

Software Reliability Prediction and Measurement........................................................................56

Hardware/Software Interfaces.......................................................................................................56

Conclusions...................................................................................................................................56

Smith Chapter 17 – Systematic Failures, especially software..........................................................58

Programmable Devices..................................................................................................................58

Software-related Failures...............................................................................................................58

Software Failure Modelling...........................................................................................................58

Software Quality Assurance..........................................................................................................58

Modern/Formal Methods...............................................................................................................58

Software Checklists.......................................................................................................................58

Study Guide 3 Self Assessment Questions.......................................................................................58

STUDY GUIDE 4 - Reliability, Maintainability and Availability.....................................................59

Objectives.........................................................................................................................................59

O’Conner Chapter 14 – Maintainability, Maintenance and Availability.........................................59

Introduction...................................................................................................................................59

Maintenance Time Distributions...................................................................................................59

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM5

Page 6: 8124 Study Notes 31 July

Preventative Maintenance Strategy...............................................................................................60

FMECA and FTA in Maintenance Planning.................................................................................60

Maintenance Schedules.................................................................................................................60

Technology Aspects......................................................................................................................61

Calibration.....................................................................................................................................62

Maintainability Predictions............................................................................................................62

Maintainability Demonstrations....................................................................................................62

Design for Maintainability.............................................................................................................62

Integrated Logistic Support...........................................................................................................62

O’Conner pages xxv – xxvi..............................................................................................................63

MIL-HDBK-472...............................................................................................................................63

SAE J817..........................................................................................................................................63

[Reader] Collcot Chapter 13 “Fault Analysis Planning and System Availability”..........................63

[Reader] Patton Chapter 8 “Reliability, Availability and Maintainability”.....................................63

Study Guide 4 Self Assessment Questions.......................................................................................63

STUDY GUIDE 5 - Reliability Prediction and Modelling.................................................................64

Objectives.........................................................................................................................................64

O’Conner, Chapter 6 Conclusion (limitations for reliability modelling).........................................64

O’Conner Chapter 12 (pgs 341-346)................................................................................................64

Reliability Analysis of Repairable Systems..................................................................................64

Smith Chapter 8 – Methods of Modelling........................................................................................65

Block Diagrams and Repairable Systems......................................................................................65

Common Cause (Dependent) Failure (CCF).................................................................................66

Fault Tree Analysis........................................................................................................................66

Event Tree Diagrams.....................................................................................................................66

Smith Chapter 9................................................................................................................................67

Duane, J.T., Learning Curve Approach to Reliability Monitoring, IEEE Transactions on Aerospace, Volume 2, Number 2, April 1964..................................................................................67

Summary........................................................................................................................................67

The Learning Curve.......................................................................................................................67

Analysis.........................................................................................................................................67

Discussion......................................................................................................................................67

Crow, Larry, Evaluating the Reliability of Repairable Systems, Proceedings Annual Reliability and Maintainability Symposium, 1990.............................................................................................68

Abstract..........................................................................................................................................68

Introduction...................................................................................................................................68

http://www.weibull.com/RelGrowthWeb/Crow-AMSAA_(N.H.P.P.).htm.....................................69

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM6

Page 7: 8124 Study Notes 31 July

Minitab Help File.............................................................................................................................69

Study Guide 5 Self Assessment Questions.......................................................................................69

STUDY GUIDE 6 - Reliability Testing..............................................................................................71

Objectives.........................................................................................................................................71

O-Conner Chapter 11 – Reliability Testing......................................................................................71

Introduction...................................................................................................................................71

Planning Reliability Testing..........................................................................................................72

Test Environments.........................................................................................................................72

Testing for Reliability and Durability: Accelerated Testing.........................................................72

Smith, Chapter 12 –..........................................................................................................................73

AS3960 Section 2.............................................................................................................................73

AS3960 Page 26...............................................................................................................................73

Self-Assessment Questions..............................................................................................................73

STUDY GUIDE 7 - Managing & Solving Reliability Problems........................................................75

O’Conner Chapter 12 – Analysing Reliability Data.........................................................................75

Introduction...................................................................................................................................75

Pareto Analysis..............................................................................................................................75

Accelerated Test Data Analysis.....................................................................................................75

Reliability Analysis of Repairable Systems..................................................................................75

CUSUM Charts..............................................................................................................................76

Exploratory Data Analysis and Proportional Hazards Modelling.................................................77

Reliability Demonstration..............................................................................................................77

Combining Results Using Bayesian Statistics...............................................................................77

Non-Parametric Methods...............................................................................................................77

Reliability Growth Modelling.......................................................................................................78

O’Conner, Cautionary Note, page 22...............................................................................................79

Smith, Chapter 3...............................................................................................................................79

Study Guide 7: Self Assessment Questions......................................................................................79

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM7

Page 8: 8124 Study Notes 31 July

STUDY GUIDE 1 – Introduction to Reliability

ObjectivesStart developing answers to the following key questions:

What is the practical significance of different Assurance Technologies?

How are reliability issues addressed at each stage in the life cycle of a product or project? Are there any shifts in focus? Are there nonetheless underlying principles that should remain unshakeable through the life cycle?

How might the primary cost drivers at the different stages in the life cycle be identified?

Can system effectiveness and the life cycle costs be optimised or at least rationalised?

At what stage in the life cycle can we most profitably focus our reliability efforts?

Study Guide 1 Notes

Assurance TechnologiesWe are interested in assuring ourselves that a product will perform within specified parameters during its Life Cycle.

(Life CycleLife Cycle of a product includes the following typical phases:

Concept

Research & development

Full scale development

Production

Operation and Support, and

Disposal

In the Cat world, this is called NPI (New Product Introduction).

In most cases, 50-80% of the total costs are incurred during the operation and support phases, which makes is an important focus for control of costs and losses.

New Product Introduction

https://npi.cat.com/

The New Product Introduction process simply builds on the 6 Sigma product and process creation methodology, DMEDI, with which most Caterpillar employees are familiar. DMEDI methodology is embedded within the NPI process, so any NPI program that follows the NPI process meets 6 Sigma criteria. The NPI process is structured into phases, like DMEDI, but includes more phases.

First, there is the Strategy phase, in which the groundwork is laid for all future decision-making and all

Relevant strategies are aligned. Customers must be identified and segmented, and the program charter must be drafted. The NPI program is registered and officially launched, and the NPI team is commissioned at the very end of this phase.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM8

Page 9: 8124 Study Notes 31 July

Second, there is the Concept phase, in which the program elements outlined by the strategy team are refined and solidified. The Concept phase is divided into three sub-phases, each which corresponds with a similarly named part of the DMEDI process: Define, in which the program charter is refined by the newly commissioned NPI team and the alignment of relevant strategies is reaffirmed, Measure, in which the market and customer research is completed and prioritized, and Explore, in which the market and customer research is made tangible through high-level conceptual designs of processes and product features.

Third, there is the Development phase, in which the high-level designs are further developed and verified. The Development phase, which corresponds with the DMEDI Develop phase, is divided into two sub-phases: Design, in which the detailed designs are constructed, and Verify, in which the detailed designs are confirmed to meet program requirements. These sub-phases are not sequential steps but continually cycling steps in which the designs are verified using the appropriate combination of virtual and physical processes into an ever improving product.

Fourth, there is the Pilot phase, in which the verified designs are further validated through pilot testing in actual customer situations, processes are validated and preparations for production are made.

Finally, there is the Production phase, in which the product is produced, delivered and supported worldwide)

The key action is to assure us about product performance. A family of processes generally known as assurance technologies can establish the assurance we require. The primary assurance technologies include:

Quality AssuranceThis includes all those planned and systematic actions necessary to provide adequate confidence that a product or service will satisfy given requirements for quality

Human Factors EngineeringSometimes called Ergonomics. The goal is to optimise the man-machine interface. Humans are required to operate and maintain machines, and their ability to detect and respond to failure conditions must be taken into account in the design of the product.

System Analysis

Product SafetyThe chances of safety-related incidents must be eliminated - for example, those that might be caused by misuse or design oversights. We are trying to eliminate design-induced defects.

Logistic EngineeringIncludes the support-related activities that deal with system design and development. It covers the support of the primary equipment and the support infrastructure.

Maintainability AnalysisMaintainability is the ease that a machine can be repaired i.e. how long it takes to repair. Analysis included the assessment of accessibility, interchange ability, modularity, standardisation, operator and maintainer requirements, test and maintenance requirements, spares provisioning, and maintenance policy. Refer AS3969-1990 Para 2.2.3.(m)

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM9

Page 10: 8124 Study Notes 31 July

Reliability AnalysisReliability is how often a machine needs maintenance or repair. Reliability depends upon product design, component quality control, manufacturing processes and maintenance skills. Reliability analysis aims to lower product failure rate over its life and reduce warranty costs. Early analysis of hardware or software requirements and a clear understanding of requirements

Reliability FundamentalsTwo terms that are often used in describing Reliability are Failure Rate and Hazard Rate. Others are MTTF and MTBF

Reliability is the probability that the product will perform a specified function for a specified operating interval under a specified set of conditions. The important criteria are thus probability, function, interval and conditions. It is important that these four criteria are defined and quantified otherwise reliability cannot be described.

Failure rate is the number of failures per unit time and change over the life of the product.

Mean Time To Failure - MTTF is used to measure the average life of an item that is not usually repaired (e.g. light bulb, circuit board). Note this is an average life and is often subject to wide variation

Mean Time Between Failures - MTBF is used to measure the average life of an item that is usually repaired. MTBF is the reciprocal of failure rate.

E.g. analysis of data shows 20 failures during 10000 operating hours.

Failure rate (lambda) = 20/10000 = 0.002 failures per hour

MTBF = 1/lambda = 1/0.002 = 500 hours between failure

System EffectivenessWhen assessing a system, the fundamental principle is that the parts should be optimised as a composite set, not as individual parts.

System Effectiveness is the probability that a system can effectively meet an operational demand within a given time when operated under specific conditions. It is usually considered in terms of technical performance, capability, availability and dependability.

Capability is a measure of how well a product performs

Availability is the probability the product is ready for use when needed

Dependability is the probability of successful performance

Durability is a point where system wears out starts to increase

Reliability is an inherent characteristic of design and cannot be altered without modification to the design. Additional maintenance cannot make the system more reliable; it will simply make the system more dependable

Quality And ReliabilityQuality can be described as a range of attributes inherent in a product, all of which influence its ability to satisfy stated or implied needs. Reliability is just one attribute that impacts on most aspects of the product throughout its life.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM10

Page 11: 8124 Study Notes 31 July

Determination Of Cost DriversThere are many factors that impact on system cost. Reliability determines how often a part will need repair, and maintainability will determine how long the maintenance will take. It is these two factors that will determine the cost of maintenance.

Introduction To Cost Effective AnalysisThere are three key requirements for developing cost effective analysis.

1. The systems being evaluated must meet the same objectives

2. At least two feasible solutions must exist

3. Sufficient reliable data about the systems must be available to perform the analysis.

The following steps form a standardised approach to analysis:

Clearly define all goals

Determine evaluation criteria

Select basis for selection (fixed cost or fixed effectiveness)

Prepare analysis report

o Main Headings

Objectives

Assumptions

Evaluation Criteria

Analysis Techniques

Conclusion

Recommendations

Profitability Of ReliabilityProfit = Revenue – Expense

Profitability margins are sensitive to costs incurred through attention to reliability: for example,

Maintenance costs

Inventory holding costs

Warranty

Product recall

Product reject

Down time

Product liability

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM11

Page 12: 8124 Study Notes 31 July

O’Conner Chapter 1 - Introduction to Reliability EngineeringThe simplest view of reliability is that in which a product is assessed against a specification or set of attributes, and when passed is delivered to the customer. The customer, having accepted the product, accepts that it might fail at some future time. Reliability is usually concerned with failures in the time domain. We come to the need for a time-based concept of quality. This distinction marks the difference between quality control and reliability engineering. Whether a failure occurs or not and its time to occurrence can seldom be forecast accurately. Reliability is therefore an aspect of engineering uncertainty. Whether an item will work or not is usually answered as a probability. Thus the usual engineering definition of reliability is:

"The probability that an item will perform a required function without failure under stated conditions for a stated period of time".

"Durability" is a particular aspect of reliability, related to the ability to withstand the effects of time-dependant mechanisms such as fatigue, wear, corrosion. Durability is usually expressed a minimum time before the occurrence of wear out failures.

Mathematical and statistical methods can be used for quantifying and analysing reliability data despite significant uncertainty. In practice, the uncertainty is often in orders of magnitude, and appreciation of the uncertainty is important in order to minimise the chances of performing inappropriate analysis and generating misleading results.

Variability and chance play a vital role in determining the reliability of must products. Basic parameters like mass, dimensions, friction coefficients, strengths and stresses are never absolute, but are in practice subject to variability due to process and material variations, human factors and applications.

Understanding the laws of chance and the causes and effects of variability is therefore necessary for the creation of reliable products and for the solution of problems of unreliability.

Why Do Engineering Items Fail?Knowing the potential causes of failure is fundamental to preventing them.

1. The design might be inherently incapable. It might be too weak, consume too much, suffer resonance etc.

2. The item might be overstressed in some way. If the applied stress exceeds the strength then failure will occur. Factors of safety and de-rating are two methods of providing some margin of between the strength of the component and the applied stress.

3. Failures might be caused by variation. In the situations above, the values of strength and load are fixed and known. If, for example, the known load never exceeds the known strength, then failure will not occur. However, in most cases there is uncertainty about both. The actual strength of the population of components will vary, and the loads may be variable. However if there is an overlap between the distributions of load and strength, then there is potential for failure to occur.

4. Failures can be caused by wear out. This includes and mechanism or process that causes in item to degrade or become weaker with age.

5. Failures can be caused by other time-dependant mechanisms, such as battery run down, creep cause by temperature and applied stress, and drift of electronic components.

6. “Sneaks” can cause failures. A "sneak" is a condition where the system doe not work properly, even though every part does. Sneaks can occur in software designs.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM12

Page 13: 8124 Study Notes 31 July

7. Failures can be caused by errors, such as incorrect specification, designs or software coding, faulty assembly or test, inadequate or incorrect maintenance or by incorrect uses.

8. There are many others, such as noisy parts, leaks, incorrect instructions, electromagnetic interference etc.

Probabilistic ReliabilityThe concept of reliability as a probability means any attempt to quantify it must involve statistical methods. Reliability statistics are concerned with reliability values that are very high or very low.

Quantifying such numbers brings increased uncertainty, since we need correspondingly more information. The application of statistics in reliability is less straightforward than in other areas. In reliability we are concerned with the behaviour in the extreme tails of distributions, where variation is hard to quantify and data is expensive. Further difficulties arise in application of statistics owing to the fact that variation is often a function of time (cycles, seasons, maintenance periods etc). Therefore, the reliability data from any past situation cannot be used to make credible forecasts of the future behaviours, without taking into account non-statistical factors such as design changes, maintainer training, or even production or service problems. The statistician working in reliability engineering needs to be aware of these realities.

Repairable and Non Repairable ItemsIt is important to distinguish between repairable and non-repairable items when predicting or measuring reliability.

For a non-repairable item such as a light bulb, reliability is the survival probability over the items expected life or for a period during its life when only one failure can occur. During an items life the instantaneous probability of the first and only failure is called the hazard rate. When a part fails in a non-repairable system, the systems fails, and the system reliability is a function of the time to the first part failure.

For repairable items, reliability is the probability that failure will not occur in the period of interest, when more than one failure can occur. It can be expressed as the failure rate or rate of occurrence of failure (ROCOF). The failure rate expresses the instantaneous probability of failure per unit time when several failures can occur in a time continuum.

Repairable system reliability can be described by the Mean Time Between Failure (MTBF), but only under the particular condition of constant failure rate. We are also concerned with the availability of the repairable item, since repair takes time. Availability is affected by the ROCOF and by maintenance time. We therefore need to understand the relationship between reliability and maintenance, and how reliability and maintainability affect availability.

Non-Repairable ItemsThere are three ways the pattern of failures can change with time.

1. The hazard rate may be decreasing

2. The hazard rate may be constant

3. The hazard rate may be increasing

The hazard function h(t) is a function such that the probability that an item which has survived to age t fails in the small interval t to t+t is h(t) t. This is the function, known loosely as the “failure rate”, which is represented in the BTC. So, h(t) = f(t)/R(t)

Constant hazard rate is characteristic of failures caused by loads in excess of the design strength, at a constant average rate. For example, overstress failures or maintenance-induced failures typically occur randomly and at a generally constant rate. Material fatigue brought about by strength deterioration dur

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM13

Page 14: 8124 Study Notes 31 July

to cyclic loading is a failure mode which dies not occur for a finite time, and then exhibits an increasing probability of occurrence. Decreasing hazard rates are observed in items less likely to fail as their survival time increases. This is often observed in electronic parts. The combined effect generates the so-called bathtub curve. This shows an initial decreasing hazard rate or infant mortality period, an intermediate useful life period, and a final wear out period.

Repairable ItemsROCOF can also vary with time, and important implications can be derived from these trends.

A constant failure rate (CFR) is indicative of externally induced failures, as in the constant hazard rate situation for non-repairable items. A CFR is also typical of complex systems subject to repair and overhaul, where different parts exhibit different patterns of failure with time and parts have different ages due to repair or replacement.

Repairable systems can also show a decreasing failure rate (DFR) when reliability is improved by good parts replace progressive repair as defective parts, which fail relatively early. An increasing failure rate (IFR) occurs in repairable systems when wear out failure modes of parts begin to predominate.

Development of Reliability EngineeringReliability engineering originated as a separate engineering discipline in the USA in the 1950's. The increasing complexity of military electronic systems was generating failure rates, which resulted in a greatly reduced availability and increased costs.

Against this background, the US DoD and the electronics industry jointly set up the Advisory Group on Reliability of Electronic Equipment (AGREE) in 1952. The AGREE report concluded that disciplines must be laid down as integral activities in the development cycle for electronic equipment. The report also recommended that formal demonstrations of reliability, in terms of statistical confidence of MTBF, be instituted as a condition for acceptance of equipment by the procuring agency. AGREE testing soon became an accepted practice.

It became evident that designers, often working at the fringes of advanced technology, could not produce highly reliable equipment without it being subjected to a test regime to show ups its weaknesses. The US DoD released the AGREE report as MIL-STD 781 "Reliability Qualification and Production Approval Tests". Engineering reliability effort in the USA developed quickly, and the AGREE and reliability program concepts were adopted by NASA and other suppliers/purchasers of high tech equipment.

In 1965, the DoD issued MIL0-STD 785 "Reliability Programs for Systems and Equipment". This document mandated a program of reliability engineering as it was realised potential problems would be detected and eliminated at the earliest and therefore cheapest stage in the development cycle. The concept of LCC originated around this time. In the UK, Defence Standard 00-40 "The Management of Reliability and Maintainability" and BS5760 "Guide on Reliability of Systems, Equipment and Components" were issued. In the 1980s, the reliability of Japanese industrial and commercial goods took Western competitors by surprise. The Japanese "quality revolution" had been driven by lessons from American teachers Juran and Deming in the post-WWII recovery, and their teaching were based on those of Peter Drucker.

Reliability As An Effectiveness Parameter

Reliability Programme ActivitiesWhat actions can managers and engineers take to influence reliability?

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM14

Page 15: 8124 Study Notes 31 July

Quality Control is essential, and often sufficient to ensure reliability of simple products, such as matches, or simple die-castings. Risks are low when safety margins can be made high, such as in structural engineering.

Formal reliability programs are necessary when risks are high. Risk normally increases in proportion to complexity or number of components in a system.

A reliability program must begin at the earliest (conceptual) phase of a project. It is at this stage that fundamental decisions are made that significantly affect reliability.

As development proceeds from initial to detailed design, risks are controlled by a formal documented approach to the review of design and imposition of design rules relating to components, materials, de-rating, tolerance etc.

The program continues through initial prototype manufacturing and test stages, by planning and executing tests to generate confidence in the design.

During production, QC ensures that the proven design is repeated.

Throughout the product life cycle, performance is fed back to generate corrective action and to provide data and guidelines for the future.

Reliability Economics And ManagementSince less than perfect reliability is the result of failures, and all failures have causes, we should ask, "What is the cost of preventing or correcting the cause, compared with the cost of doing nothing?"

It is not easy to quantify the effects of a given reliability program. However Deming (Out of the Crisis) showed that effort expended on a reliability program is an investment, demonstrated by the success of the companies that adopted this teaching.

There are three kinds of engineering product:

1. Intrinsically reliable components (electronic components, mechanical non-moving components, software).

2. Intrinsically unreliable components (light bulbs, turbine blades, parts in contact like gears and bearings).

3. Systems (of many components and interfaces with many possibilities for failure to occur.)

The essential points are:

1. Failures are caused primarily by people (designers, suppliers, assemblers, users, maintainers). Therefore, the achievement of reliability is essentially a management task.

2. Reliability (and quality) is not separate functions that can ensure the prevention of failures.

3. There is no fundamental limit to the extent to which failures can be prevented. We can design and build for ever-increasing reliability.

Smith Chapter 1 - The history of reliability and safety technologySince no human activity can enjoy zero risk, and no equipment a zero fate of failure, there has grown a safety technology for optimizing risk.

This attempts to balance the risk against the benefits of the activities and the cost of further risk reduction

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM15

Page 16: 8124 Study Notes 31 July

Definitions

Reliability Probability Density FunctionThe distribution of reliability values (for a PDF, the distribution of whatever value is in question). If we measure a large number of points and further reduce the measurement interval, the frequency histogram tends to a curve that describes the population probability density function (pg 32). The pdf is a plot such that the area under the curve between any two ages is equal to the probability that a new item fails in the give age interval. (This differs from the BTC in which the probability of failure is conditional on the item having survived to the current age)

Reliability FunctionCorresponds to the probability that an item will survive to any given age.

Distribution Function (=cumulative distribution function cdf)The probability of failure at or before age t.

ETA ValueWeibull scale parameter, also known as the Characteristic life, or when 62.3% of the population has failed. 62.3% indicates the average of an exponential distribution that represents a model for random events.

Suspended itemWhen a test is run and ceases before a given item fails, it is a suspended item.

Random FailureBeta = 1 in a Weibull distribution, as the item’s age increases there is not an increasing risk of failure, and the component should only be replaced on failure. This is typical of many electronic components – where the risk of failure is constant over their lifetime.

Hazard FunctionHazard Function is the failure rate – more specifically the probability that an item that has survived to age t fails in the small interval t to t +dt. This is the function that is represented by the BTC

Life Cycle CostSum of the acquisition, ownership and disposal costs for a product over its entire life cycle.

Failure DataReliability growth / reliability improvement arising from natural consequences of the analysis of failure has been a central feature of product development. "Test and correct" was practiced long before development of formal processes for data collection and analysis because failure is usually self-evident and inevitably leads to design modifications. Nineteenth- and early twentieth-century designs were less constrained by cost and schedule pressure of today. In many cases, reliability was the result of over-design. The need for quantified reliability assessment was not needed. Thus failure rates were not required, and consequently there was little incentive for the formal collection of failure data. The advent of the electronic age, and the experience with poor field reliability of military equipment in the 1940s and 1950s led to the need for more complex mass-produced component parts. This gave rise to the collection of failure information from the field and from interpretation of test data. This activity was stimulated by the development of reliability prediction techniques that require component failure rates as inputs to the prediction equations.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM16

Page 17: 8124 Study Notes 31 July

Hazardous FailuresIn the 70s, process plants with large inventories of hazardous materials realised that the process of learning from mistakes was no longer acceptable. Methods were developed for identifying hazards and for quantifying the consequence of failure. These were evolved largely to assist decision-making when developing or modifying plant. Pressure to identify and quantify risk came later. The techniques for

Quantifying the predicted frequency of failures had previously applied mostly in the domain of availability, where the cost of failure was prime concern. These techniques are now used in the field of hazard assessment.

Reliability and Risk PredictionThe subject of reliability prediction, based on the concept of validly repeatable component failure rates has become controversial. First, the extremely wide variability of failure rates of allegedly identical components under supposedly identical environmental and operating conditions is now acknowledged. The apparent precision of the reliability prediction tool is thus not compatible with the accuracy of the failure rate parameter. As a result, it can be concluded that simple assessments of rates and the use of simple models suffice. In any case, more accurate predictions can be misleading and a waste of time and money.

The main benefit in reliability prediction of complex systems lies not in the absolute figure but in the ability to repeat the assessment for different repair times, redundancy arrangements in the design configuration and different values of component failure rates. Thus judgements can be made on the basis of relative predictions with more confidence than can be placed on the absolute values.

In practice, prediction addresses the component-based "design reliability" and it is necessary to take account of the additional factors when assessing the integrity of the system.

"Design reliability" is likely to be the figure suggested by a prediction exercise; however there will be many source of failure in addition to simple random hardware failures predicted in this way. Thus the "achieved reliability" of a new product or system is likely to be an order or more less than "design reliability". Reliability growth is the improvement that takes place as modifications are made as a result of field failure information. A well-established design with tens of thousands of field hours might start to approach the "design reliability".

As a result, whereby systematic failures cannot be necessarily quantified, two separate approaches might be takes side-by-side:

1. Quantitative assessment - predict frequency of hardware failures.

2. Qualitative assessment - attempt to minimise the occurrence of systematic failures (eg software) by applying a variety of defences and design disciplines appropriate to the severity of the target.

Achieving Reliability and Safety-IntegrityIf we try to identify the characteristics of design or construction which have secured longevity, then three factors emerge:

1. Complexity: the fewer the components and the fewer the types of materials involved, then the greater the likelihood of a reliable item.

2. Duplication/Replication: the use of additional, redundant, parts whereby a single failure does not cause overall system failure us a frequent method of achieving reliability

3. Excess strength: deliberate design to withstand stresses higher than anticipated will reduce failure rates. Modern commercial pressures lead to the optimisation of tolerance and stress margins, which just meet the functional spec.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM17

Page 18: 8124 Study Notes 31 July

The last two methods are costly, and the cost for reliability improvements needs to be paid for be a reduction in failure rates or reduction in operating cost. We see therefore that reliability and safety are "built-in" features of a product.

Maintainability also contributes since it is the combination of failure rate and repair/down time. Achieving reliability, safety and maintainability results from activities in three main areas:

1. Design: complexity, duplication, stress, testing, feedback

2. Manufacture: materials, methods, standards

3. Field use: operation, maintenance, feedback

RAMS CycleLoops shown in Figure 1.2 represent RAMS activities as follows:

1. Review of the system RAMS feasibility calculations against the initial RAMS targets

2. Review of the conceptual design RAMS predictions against the RAMS target

3. Review of the detailed design against the RAMS target

4. Review of the RAMS test, at the end of design and development, against the requirements

5. Review of the acceptance demonstration against requirements

6. Review of the field RAMS performance against the targets\

Contractual PressuresIt is now common for reliability parameters to be specified in tender and other contractual documents. There are problems arising from:

Ambiguity of definition

Hidden statistical risks

Inadequate coverage of the requirements

Unrealistic requirements

Unmeasureable requirements

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM18

Page 19: 8124 Study Notes 31 July

Smith Chapter 2 - Understanding terms and jargon

Defining Failure And Failure ModesFailure: Non-conformance to some defined performance criterion

Quality: Conformance to specification

Reliability: The probability that an item will perform a required function under stated conditions for a stated period of time. Reliability is the extension of quality into the time domain.

Maintainability: The probability that a failed item will be restored to operational effectiveness within a given period of time when the repair action is performed in accordance with the prescribed procedures.

Failure Rate And MTBFThe observed failure rate: For a stated period in the life of an item, the ratio of the total number of failures to the total cumulative observed time. Failure rate is only meaningful for situations where it is constant. Most failure rates are stated to two significant figures. It is seldom justified to exceed this level of accuracy.

The observed mean time between failures: For a stated period in the life of an item, the mean value of the length of time between consecutive failures, computed as a ratio of the cumulative observed time to the total number of failures.

The equality MTBF = 1/failure rate must be treated with caution since it is inappropriate to compute failure rate unless it is constant.

The observed mean time to fail: For a stated period in the life of an item the ratio of cumulative time to the total number of failures.

Mean Life: The mean of the times to failure where each item is allowed to fail over the entire life period.

Interrelationship Of Terms

Bathtub Distribution1. Decreasing Failure Rate (infant mortality, burn-in, early failures): Related to manufacture (welds,

joints, connections, dirt, impurities, cracks)

2. Constant Failure Rate (random failures, useful life, stress-related failures, stochastic failures): Assumed to be stress related, random fluctuations of stress exceeding component strength.

3. Increasing Failure Rate (wear out failures): Owing to corrosion, oxidation, breakdown of insulation, atomic migration, friction wear, shrinkage, fatigue.

Down Time And Repair TimeElements of downtime and repair time:

a) Realisation Time

b) Access Time

c) Diagnosis Time

d) Spare Part Procurement

e) Replacement Time

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM19

Page 20: 8124 Study Notes 31 July

f) Checkout Time

g) Alignment Time

h) Logistic Time

i) Administrative Time

Availability, Unavailability And Probability Of Failure On DemandA = MTBF / (MTBF + MTTR) is known as the steady state availability.

Usually it is more convenient to use Unavailability or Probability of Failure on Demand (PFD)

PFD = 1 – A

Hazard And Risk Related TermsHazard: potential for injury or fatality to occur

Risk: probability of an event occurring, in conjunction with the consequence (severity)

Hazard rate is the failure rate at any instant in time. It is the first differential of the Failure Rate. In practical terms, for a mature system, the hazard rate and constant failure rate are assumed to be equal. The difference is: failure rate is an average and hazard rate is instantaneous.

Choosing The Appropriate Parameter

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM20

Page 21: 8124 Study Notes 31 July

Smith Chapter 3 - A cost-effective approach to quality, reliability and safety

Reliability And CostTotal costs incurred over the period of ownership of equipment are often referred to as Life-Cycle Costs. Therese can be separated into:

Acquisition costs

Ownership Costs

Operating Costs

Administration Costs

They will be influenced by:

Reliability - determines the frequency of repair

Maintainability – affects training, equipment, downtime, and manpower

Safety Factors – affects operating efficiency and maintainability.

Costs of carrying out RAMS-cycle predictions will usually be small compared with the potential safety or life cycle cist savings.

Cost of carrying out RAMS prediction activities is in the order of 3-5% of total project cost. It is credible that the assessment procedure will lead to savings, which exceed this outlay.

Costs And SafetyOnce a hazardous event has been assessed, the costs of measures to reduce that risk are inevitably considered. If the risk is sufficiently low, then reduction in risk for a given expenditure can be examined to see if it can be justified. At this point, the concept of As Low As Reasonably Practicable becomes relevant.

Cost Of QualityAttempts to set budget levels for various elements of quality costs are rare. Quality costs can be grouped under three headings:

1. Prevention Costs

- Design review

- Quality and reliability training

- Vendor quality planning

- Audits

- Installation prevention activities

- Product qualification

- Quality engineering

2. Appraisal Costs

- Test and inspect

- Maintenance & calibration

- Test equip depreciation

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM21

Page 22: 8124 Study Notes 31 July

- Line quality engineering

- Installation testing

3. Failure Costs

1. Design changes - vendor rejects\n- rework\n- scrap & material renovation

2. Warranty

3. Commissioning failures

4. Fault finding in test

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM22

Page 23: 8124 Study Notes 31 July

Study Guide 1 – Self Assessment Questions1. What are the key elements in the reliability definition?

Reliability can be defined as: “The probability that a product will perform a specified function for a specified operating interval under a specified set of conditions.”

The key elements are thus probability, function, interval and conditions, and they need to be defined and quantified otherwise reliability cannot be adequately described.

2. Failure rate is defined as?

Failure rate is the number of failures per unit time and its subsequent change over the life of a product.

3. Maintenance data shows that a component has 25 failures during the last 100,000 system operating hours. The MTBF for the component is?

The Mean Time Between Failures can be calculated by simply dividing the total operating hours by the number of failures.

In this case, 100000 hours/25 failures = 4000 hours, or stated in plain English, on average, we will experience a failure of this component every 4000 system operating hours.

4. The failure rate of equipment will most likely vary in three distinct phases during its life. What are these phases?

Decreasing Failure Rate (Weibull shape factor <1) (also known as infant mortality, burn-in, early failures): related to manufacture (welds, joints, connections, dirt , impurities, cracks)Constant Failure Rate (Weibull shape factor = 1) (also known as random failures, useful life, stress-related failures, stochastic failures): assumed to be stress related, random fluctuations of stress exceeding component strength.Increasing Failure Rate (Weibull shape factor >1) (also known as wearout failures): corrosion, oxidation, breakdown of insulation, , friction wear, shrinkage, fatigue.

5. The ratio of tolerance to process variation is called? What is another name for this process variation?

This ratio is denoted Cp, and is called the process capability. If a product has a tolerance, and it is to be produced by a process which generates variation in the product, it is obviously important that the process variation be less than the tolerance.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM23

Page 24: 8124 Study Notes 31 July

STUDY GUIDE 2 – Reliability in Management and Quality Control

ObjectivesDiscuss:

1. Why management must view reliability as a key attribute of the inputs, processes and outputs of production

2. The close relationship between the methodologies of reliability and quality control

3. The costs and benefits of reliability and quality programs

AS2561-1982 Guide to the determination and use of quality costs

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM24

Page 25: 8124 Study Notes 31 July

O’Conner Chapter 15 – Reliability Management

Corporate policy for reliabilityA really effective reliability function can exist only in an organisation where the achievement of high reliability is recognised as part of the corporate strategy and is given top management attention. If not, reliability effort will be cut back whenever cost or time pressures arise.

Integrated reliability programmesReliability effort should be treated as an integral party of the product development, not a parallel activity unresponsive to the rest of the development program. This is the justification for putting reliability with the project manager.

Since production quality will be the final determinant of reliability, quality control is an integral part of the reliability program. Quality control cannot make up for design shortfalls, but poor quality can negate much of the reliability effort. QC can contribute effectively to reliability effort if:

1. QC procedures are related to factors that affect reliability

2. QC test and inspection data are integrated with other reliability data

3. QC personnel are trained to recognise the relevance of their work to reliability and trained and motivated to contribute

Reliability and CostsAchieving high reliability is expensive, especially when the product is complex or contains untried technology. It requires trained engineers, management time, test equipment and products for testing, but there are practical limits on what can be spent. The earlier in the development program the failure mode is identified and corrected the cheaper it will be.

Obviously it is necessary to minimise the sum of quality and reliability costs over the longer term. Thus the immediate costs of prevention must be related to the anticipated effects on failure costs. Investment analysis related to Q & R is an uncertain business, because of the impossibility of accurately predicting and quantifying the results. Therefore the analysis should be performed using a range of assumptions to determine the sensitivity of the results to the assumed effects.

Cost of unreliabilityThe cost of unreliability in service should be evaluated early in the development phase, so that the effort on reliability can be justified and requirements set, related to expected costs. There are other costs, such as goodwill and market share. These can be hard to quantify. In extreme cases unreliability can lead to litigation if damage or injury occurs

Safety and Product LiabilityProduct liability was an outgrowth of the Ralph Nader campaigns in the USA, and it makes the manufacturer of a product liable as a result of failure of his product. A designer can now be held liable even if the product is old and the user did not maintain or operate it correctly. Claims can only be defended successfully when the producer demonstrates he has taken all the practical steps towards identifying and eliminating the risk.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM25

Page 26: 8124 Study Notes 31 July

Standards for reliability, quality and safety

US MIL-STD-785 – Reliability Programs for Systems and Equipments, Development and ProductionBest known, covering all development programs in the US DoD.

UK Defence Standards 00-40 and 0-41Covers reliability program management and methods for defence equipment

BS5760Published for commercial use

ARMP-1NATO standard on reliability and maintainability

ISO-IEC60300 DependabilityCovers reliability, maintainability and safety (“dependability”). Describes management and methods related to product design and development. Covers reliability prediction, design analysis, reliability demonstration tests, maths/statistical techniques. Manufacturing not included. Methods are inconsistent with modern best practice, in particular sections on reliability testing define rigid environmental and other conditions to be applied, and for pass/fail criteria based on statistical methods described and rejected in O’Conner Ch 2 and 12.

ISO9000 Quality SystemsFramework for assessing the “quality management system” which an organisation operates in relation to the goods or services produced. Developed from US MIL STD MIL-Q-9858 in the 1950s. Does not specifically address the quality of the products or services, nor prescribe methods how one might achieve quality. It describes the system, vaguely, that should be in place to assure quality. Registration cannot be taken as assurance of quality.

Specifying ReliabilityHow NOT to do it:

1. Do not write vague requirements, such as “reliable as possible”. Such statements do not provide assurance against reliability being compromised.

2. Do not write unrealistic requirements “Will not fail under the specified operating conditions”. However an unrealistically high reliability requirement will not be accepted as a credible design parameter, and is likely to therefore be ignored.

The reliability specification must contain:

1. A definition of the failure related to the products function. The definition should cover all failure modes relevant to its function.

2. A full description of the environments the product will be stored, transported, operated and maintained.

3. A statement of the reliability requirement, and /or a statement of the failure modes and effects which are particularly critical and which must therefore have a very low probability of occurrence.

Definition of failureFailure should always be related to a measurable parameter or a clear indication.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM26

Page 27: 8124 Study Notes 31 July

Environmental SpecificationsThe environment spec must cover aspects of the loads and other effects that can influence the products strength of probability of failure.

Stating the reliability requirementThe reliability requirement should be specified in a way that can be verified, and makes relative sense to the use of the product. The simplest requirement is that no failure will occur under the stated conditions. The requirement should not include statements on the s-confidence levels of the measured reliability. The requirement relates to the population; s-confidence levels apply to the results of tests or other limited sample data. S-Confidence limits may be used for pass/fail decision-making and test planning, but should not be included with the requirement.

Reliability specifications based on life parameters must be framed in relation to the appropriate life distributions. Two common parameters are MTBF (when a constant failure rate is assumed), and B-life, related to Weibull life distributions. MTBF should not be specified if a constant failure rate assumption couldn’t be justified. This assumption can usually be made for complex, repairable systems. Otherwise a B-life should be specified.

Specified life parameters must clearly state the life characteristic related to the duty cycle. The life parameter may be stated as some time-dependant function eg mile travelled or it may be stated as some time-dependant function, with a stipulated operating cycle.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM27

Page 28: 8124 Study Notes 31 July

Smith Chapter 18 - Project Management

Setting Objectives and SpecificationsRealistic reliability and maintainability objectives need to be set with due regard for customer requirements and cost constraints. Some discussion may be required to establish economic reliability values which meet requirements and are achievable with proposed technology at the costs allowed for.

When specifying MTBF it is a common mistake to state a confidence level, in fact the MTBF requirement stands alone. Addition of a confidence level implies a statistical demonstration would suffice.

Vague statements should be avoided at all costs as they are subjective and cannot be measured and thus cannot be demonstrated or proved.

Engineering requirements should include:

1. Functional description: speeds, functions, human interfaces and operating periods.

2. Environment: temperature, humidity, etc.

3. Design life: related to wearout and replacement policy.

4. Physical Parameters: size and weight restrictions, power supply limits.

5. Standards: BS, US MIL, Def Con, etc., standards for materials, components and tests.

6. Finishes: appearance and materials.

7. Ergonomics: human limitations and safety considerations.

8. Reliability, availability and maintainability: module reliability and MTTR objectives. Equipment R and M related to module levels.

9. Manufacturing quantity: Projected manufacturing levels – First off, Batch, Flow.

10. Maintenance philosophy: Type and frequency of preventive maintenance. Repair level, method of diagnosis, method of second-line repair.

Planning, Feasibility and AllocationThe design and assurance activities in this book simply will not take place unless there is real management understanding and commitment to a reliability and maintainability program with specific resources allocated.

There are three levels of RAM measurement

Prediction: a modelling exercise which relies on the validity of historical failure rates to the design in question. This provides the lowest level of confidence

Statistical Demonstration Test: provides sample failure information, normally from a test environment rather than the field. Provides more confidence than paper Prediction but still subject to statistical risk and limitation of the test environment.

Field Data: Except in the case of a very high reliability system, realistic numbers of failures are obtained and can be used in a reliability growth program as well as for comparison to the original target.

Programme ActivitiesExtent of activities will depend upon:

The severity of the requirement.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM28

Page 29: 8124 Study Notes 31 July

The complexity of the product.

Time and cost constraints.

Safety considerations.

The number of items to be produced.

A safety and reliability plan must be produced for each project/product. Without this there is nothing to audit progress against, and no formal measure of progress.

Activities might include

Feasibility Study

Setting objectives

Contract requirements

Design reviews

o Electrical factors

o Software reliability

o Mechanical features

o Quality & reliability, testing, RAM predictions & demonstrations, FMEA, test equipment, procedures

o Maintenance philosophy, policy, MTTR prediction, resource forecasts, training and manuals

o Purchased items

o Manufacturing and installation, tolerancing, burn it, packaging and transport, costs

o Other, patents, value engineering, safety, documentation standards and product reliability

RAM predictions

Design trade offs

Prototype tests

Parts selection and approval

Demonstrations

Spares Provisioning

Data Collection and Failure Analysis

Reliability growth

Training

ResponsibilitiesReliability and maintainability are engineering parameters and the responsibility for their achievement is therefore primarily with the design team. Quality assurance techniques play a vital role in achieving the goals but cannot be used to ‘test in’ reliability to a design which has its own inherent level.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM29

Page 30: 8124 Study Notes 31 July

Standards and Guidance Documents

BS5760 Reliability of systems, equipment and componentsPart 1 is Guide to Reliability Programme Management and outlines the reliability activities such as those above. Other parts deal with prediction, data, practices and so on

UK Ministry of Defence 00-40 Reliability and MaintainabilityParts 1 and 2 are concerned with project requirements and the remainder with documents, training, procurement and so on

US MIL-STD-785A Reliability Program for Systems and Equipment Development and ProductionSpecifies programme plans, reviews, predictions and so on.

US MIL-STD-470 Maintainability Programme RequirementsProgram plan and activities for design criteria, design review, trade offs, data collection predictions and status reporting.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM30

Page 31: 8124 Study Notes 31 July

Smith Chapter 19 – Contract clauses and their pitfalls

Essential AreasTwo types of pitfalls arise from contractual conditions:

1. Those due to the omission of essential conditions or definitions

2. Those due to inadequately worded conditions which present ambiguities, concealed risk, eventualities unforeseen by one or both parties etc

The following headings are essentially if reliability or maintainability is to be specified.

DefinitionsIf MTTR is specified then the meaning of repair time must be defined in detail. MTTR is often used then mean down time is intended.

Failure itself must be thoroughly defined at system and module levels. It may be necessary to define more than one type of failure, or failures for different operating modes (eg in flight or on the ground) in order to describe all the requirements. MTBFs might then be ascribed to different failure types. MTBF and failure rates often require clarification of “failure” and “time”.

The bathtub curve depicts early, random and wearout failures. Reliability parameters usually refer to random failure unless stated to the contrary, it being assumed that burn-in failures are removed by screening and wearout is eliminated as far as possible by preventative replacement.

Parameters should not be used without due regard to their meaning and applicability. Failure rate, for example, has little meaning except when describing random failures. Availability, MTBF or reliability should be specified in preference.

Reliability and maintainability are often combined by specifying Availabilty. This can be defined in more than one way, and should thus be clearly specified. The usual form is Steady State Availability (MTBF/(MDT+MTBF)).

EnvironmentA common mistake is to fail to specify the environmental conditions under which the product is to work. The spec is often confined to temp range and humidity, which may not be sufficient. Other parameters include pressure, vibration and shock, chemical attack, power supply variations/interference, radiation, human factors and many others. The combination or cycling of parameters may have significant results.

Where equipment is used as standby or held as spares, the conditions will be different to those experienced by operating units. It is often assumed that because a unit is not powered or is stored, it will not fail. In fact this environment might be more conducive to failure. Transport environmental conditions and liabilities for component failures should also be considered.

Maintainability can also be influenced by environment. Conditions can influence repair times since the use of particular protective clothing, remote handling devices. Safety precaustions increased the active elements of repair time.

Maintenance SupportThe provision of spares, test equipment, personnel, transport and the maintenance of such is a responsibility that must be described in the contract and the supplier must be conscious of the risks involved in the customer not meeting their side of the agreement.

Levels of skill and training should be specified.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM31

Page 32: 8124 Study Notes 31 July

Maintenance philosophy must be defined as it plays a part in defining reliability when under the client;s control.

MTTR and identification of faults automatically needs to be specified up front as the cost and delay in the design of product to incorporate such features is likely to be considerable.

Demonstration and predictionIn the case of a maintainability demonstration, it is essential to define the tools and equipment, the maintenance instructions, test environment, technician level task selection, spares, and level of repair.

In the case of a reliability demonstration, it is essential to define environmental conditions, allowable failures (eg maintenance induced), operating mode, preventive maintenance, burn-in, testing costs.

Statistical risks apply and the supplier needs to calculate the probability of failing the test with good equipment and the customer that of passing inadequate goods.

Consider, that if 100 items of equipment meet their stated MTBF under random failure conditions, then after operating for a period equal to one MTBF, 63 of them, on average, will have failed.

From a suppliers point of view, a warranty period is a form of reliability demonstration since, having calculated the expected number of failures during the warranty period, there is a probability that more will occur.

LiabilityThe exact natures of a supplier’s liability must be spelt out, including the maximum penalty that can be incurred.

If part of the liability for failure or repair is to fall to some other subcontractor, then care must be taken in defining each party’s area.

Other Areas

Reliability and maintainability programmeSometimes the R&M activities are specified in the contract. In a development contract this allows the customer to monitor activities against agreed milestones. Sometimes standard programs are used:

BS5760 Reliability of systems, equipment and components

Part 1 is Guide to Reliability Programme Management and outlines the reliability activities such as those above. Other parts deal with prediction, data, practices and so on

BS5760 Reliability of systems, equipment and components

Part 5 Reliability programs for equipment

US MIL-STD-785 Reliability Program for Systems and Equipment Development and Production

Specifies programme plans, reviews, predictions and so on.

US MIL-STD-470 Maintainability Programme Requirements

Program plan and activities for design criteria, design review, trade offs, data collection predictions and status reporting.

Reliability and maintainability analysisThe supplier may be required to offer a detailed reliability or maintainability prediction together with an explanation of the techniques and data used.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM32

Page 33: 8124 Study Notes 31 July

StorageIf equipment is stored for some time then the storage conditions and durations will have to be defined. Similar applies to storage and transport of spares and test equipment.

Design standardsSpecific standards are sometimes described or referenced. A problem exists that these standards are very detailed and most manufacturers have their own version. The fine detail can be overlooked until some formal acceptance inspection takes place, by which time retrospective action is difficult, time consuming and costly.

PitfallsThe following lists those aspects of Reliability and Maintainability likely to be mentioned in an invitation to tender or in a contract.

DefinitionsMost likely area of dispute is the definition of what constitutes a failure and whether or not a particular incident ranks as one or not. There are levels of failure, types of failure, causes of failure and effects of failure. Careful definition of failure types covered by the contract is therefore important.

Repair TimeRepair times can be grouped into active and passive elements. Broadly speaking, the active elements are dictated by system design and passive by maintenance and operating arrangements. For this reason, the supplier should never guarantee any part of the repair time that is influenced by the user.

Statistical risksIn both maintainability and reliability tests, producer and consumer risks apply.

Conclusion DrawnAccept Ho Reject Ho

True State Ho True Correct Type I

-risk

Producer risk

Ho False Type II

-risk

Consumer risk

Correct

Quoted specificationsSometimes a reliability or maintainability program or test plan is specified by calling up a published standard. The danger is the possibility that not all the quoted terms are suitable and the standard will not be studied in every detail.

EnvironmentIf environmental factors are likely to be present in the field then they must be specifically allowed for in the design and price. It may not be desireable to specify every parameter possible since this leads to over-design.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM33

Matt McLeod, 03/01/-1,
The risk of rejecting the null and taking action when none was needed (“I’ve discovered something that really isn’t there”)
Matt McLeod, 03/01/-1,
Matt McLeod, 03/01/-1,
The risk of accepting the null when you should have rejected it (“I’ve missed a significant effect”)
Page 34: 8124 Study Notes 31 July

LiabilityWhen stating the supplier’s liability it is important to establish its limit in terms of both cost and time. Suppliers must ensure they know when they are finally free of liability.

In summaryThe biggest pitfall is to assume either party wins any advantage from ambiguity or looseness in the conditions of a contract. Effort expended from a dispute far outweigh any advantage that might have been secured. If every effort is made to cover all the aread as clearly and simply as possible then both parties will gain.

PenaltiesAny cash penalty must be a genuine and reasonable pre-estimate of the damages thought to result from a system outage.

Apportionment of costs during guaranteeThe customer should never be permitted to benefit from poor maintenance, therefore any arrangement by which the supplier pays for maintenance the customer undertakes should be avoided.

Payment according to down time

In summary

Subcontracted Reliability Assessments

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM34

Page 35: 8124 Study Notes 31 July

Smith Chapter 20 – Product liability and safety legislationProduct liability is the liability of a supplier, designer or manufacturer to the customer for injury or loss resulting from a defect in that product.

The general situationLegislation generally requires that goods are of merchantable quality and are reasonably fit for the purpose intended.

Common law related to the Tort of Negligence, for which a claim for damages can be made. The onus is on the plaintiff to prove negligence, which requires proof:

1. That the product was defective

2. That the defect was the cause of the injury

3. That this was foreseeable and that the defendant failed in his or her duty of care.

The present situation involved a form of strict liability but:

Privity of Contract excludes third parties in the contract claims

The onus is to prove negligence unless the loss results from a breach of contract

Exclusion clauses involving death and personal injury are void

Strict LiabilityThis concept hinges on the idea that liability exists for no other reason than the mere existence of a defect. No breach of contract or act of negligence is required in order to incur responsibility.

Insurance and Product RecallThe effect of Product Liability trends tend to:

Increase the number of claims

Increase premiums

Generate separate Product Liability Policies

Involve insurance companiesin defining quality and reliability standards and procedures

Require the designer to insure the customer against genuine and frivolous consumer claims.

A design defect causing a potential hazard to life, health or safety may become evident when a number of products are already in use. It may then become necessary to recall a batch of items. The extent will be determined by the nature of the defect. A full evaluation of the hazard must be made and a report prepared.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM35

Page 36: 8124 Study Notes 31 July

Smith Chapter 21 – Major Incident Legislation

Problem AreasReports must be site specific and the use of generic procedures and justifications is to be discouraged. Adopting similar is valid providing care is taken to ensure the end result is site specific.

The hazards from a dangerous substance may be various and it is necessary to consider secondary as well as primary hazards.

The events which could lead to the major accident scenario have to be identified fully. The fault tree approach (Ch 8) needs to identify all the initiators of the tree. This is an open ended problem in that is it s a subjective judgement as to when they have all been listed. An obvious checklist would include (in addition to hardware failure):

Earthquake

Human error

Software

Vandalism, terrorism

External collision

Meteorology

Out of spec substances

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM36

Page 37: 8124 Study Notes 31 July

Smith Chapter 22 – Integrity of safety-related systems

Safety-related or safety critical?“Safety critical” has tended to be used where the hazard leads to a fatality, whereas “safety related” has been used in a broader context. There are many definitions, all of which vary slightly:

Some distinguish between multiple and single deaths

Some include injury, illness and incapacity without death

Some include effects on the environment

Some include system damage

Saferty-related systems are those which, singly or together with other safety-related systems, achieve or maintain a safe state for equipment under their control

Safety critical systems are those which on their own achieve or maintain a safe state fk the equipment under their control

Study Guide 2 Self Assessment Questions1. Quality operations cost categories are:

These quality costs are the costs of activities directed at reliability and quality control and the cost of failure. Costs are usually considered in these three categories:

a. Prevention Costs

b. Appraisal Costs

c. Failure Costs

Prevention Costs are those related to activities that prevent failures from occurring, such as reliability activities, quality control of purchased goods and materials, training and management

Appraisal costs are those related to test and measurement, process control and quality audit.

Failure costs include internal and external costs. Internal costs include scrap and rework incurred during manufacture. External costs include warranty.

2. Product liability

Product liability is the liability of a supplier, designer or manufacturer to the customer for injury or loss resulting from a defect in that product. Product liability was an outgrowth of the Ralph Nader campaigns in the USA, and it makes the manufacturer of a product liable as a result of failure of his product. A designer can now be held liable even if the product is old and the user did not maintain or operate it correctly. Claims can only be defended successfully when the producer demonstrates he has taken all the practical steps towards identifying and eliminating the risk.

3. Reliability is the responsibility of the:

Reliability and maintainability are engineering parameters and the responsibility for their achievement is therefore primarily with the design team. Quality assurance techniques play a vital role in achieving the goals but cannot be used to ‘test in’ reliability to a design which has its own inherent level.

4. A reliability specification must contain:

The reliability specification must contain:

a. A definition of the failure related to the products function. The definition should cover all failure modes relevant to its function.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM37

Page 38: 8124 Study Notes 31 July

b. A full description of the environments the product will be stored, transported, operated and maintained.

c. A statement of the reliability requirement, and /or a statement of the failure modes and effects which are particularly critical and which must therefore have a very low probability of occurrence.

Engineering requirements should include:

a. Functional description: speeds, functions, human interfaces and operating periods.

b. Environment: temperature, humidity, etc.

c. Design life: related to wearout and replacement policy.

d. Physical Parameters: size and weight restrictions, power supply limits.

e. Standards: BS, US MIL, Def Con, etc., standards for materials, components and tests.

f. Finishes: appearance and materials.

g. Ergonomics: human limitations and safety considerations.

h. Reliability, availability and maintainability: module reliability and MTTR objectives. Equipment R and M related to module levels.

i. Manufacturing quantity: Projected manufacturing levels – First off, Batch, Flow.

j. Maintenance philosophy: Type and frequency of preventive maintenance. Repair level, method of diagnosis, method of second-line repair.

5. When and why should the costs of unreliability be evaluated?

The cost of unreliability in service should be evaluated early in the development phase, so that the effort on reliability can be justified and requirements set, related to expected costs. There are other costs, such as goodwill and market share. These can be hard to quantify. In extreme cases unreliability can lead to litigation if damage or injury occurs.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM38

Page 39: 8124 Study Notes 31 July

STUDY GUIDE 3 - Reliability in Design

ObjectivesYou should be able to:

Choose the appropriate Australian or international standard to use for reliability design

Recognise the specific techniques for each of the different design processes

Recognise the importance of design to cost goal setting

AS2529-1982 Collection of Reliability, Availability and Maintainability Data for Electronics and Similar Engineering Use

1 ScopeThis standard provides guidance for the collection of reliability data relating to the field performance of electronic items but may also be applicable to other engineering items.

2 Application and PurposeThe specific objectives of the collection of reliability data are:

a. To provide for a survey of the actual reliability,

b. To provide data for improving reliability

c. To provide data for the organisation and management of any maintenance operation.

3 Data RequiredConsideration of the foregoing objectives defines the need for a system which provides for the collection of documented data covering:

a. The total population under observation

b. Operational conditions

c. Failures of the items

d. Maintenance operations

4 GuidelinesIt is the intention of this standard to provide guidelines for setting up data collection.

5 ReportsGeneral Comments: the relative content of use and failure reports will vary markedly with the items considered and the type of operation.

Use Reporting: Data reporting should be supported by information on the use of the items

Failure Reporting: Failure reports should cover all the failures which have been observed. They should also contain sufficient information to identify misuse failures. Failures considered to be attributable to any maintenance action should be so noted.

Preventative Maintenance Reporting: Essentially, preventative maintenance is scheduled so as to forestall failure or eliminate failure entirely. When no replacements or repairs are made, the action can be classified as a “Use” report. When the preventative maintenance actions results in a

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM39

Page 40: 8124 Study Notes 31 July

replacement or repair, the report may be treated as a “Failure” report even though the item has in fact not failed in operation.

6 Field Performance ReportsGeneral: Sufficiently detailed reports permit the estimation of reliability not only of the items b ut of the devices of which they are component parts.

Content of Field Performance Reports

Number and date of report

Name and address of user, location of item

Nature of report (Use, failure, PM)

Item Identification

Number of items considered: it is possible to cover more than one item of basically identical design

History of Item

o Date of manufacture

o Original or modified state

o Date first placed into use

o Cumulative operating time

o Storage or transportation conditions and cumulative time prior to last use

o Nature and date of last maintenance task, operating time since this date.

o Cumulative time non-operational but believed serviceable

o Cumulative time non-operational but believed unserviceable.

o Cumulative time on standby

General operating conditions

Item Failure description

Item Failure analysis

Action taken

Assessment by field or maintenance engineer

AS2530-1982 Presentation of Reliability Data on Electronic and Similar Components

1 ScopeThis standard is intended to provide guidance on the presentation of data necessary to distinguish the reliability characteristics of an electronic component, but may also be applicable to other engineering terms.

2 Identification of Components TestedInformation identifying the components shall be in accordance with the relevant Australian Standards or other component specifications.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM40

Page 41: 8124 Study Notes 31 July

3 Test ConditionsThe test conditions to be used should be those given in the relevant component standards.

4 Data on FailuresThe following information shall be supplied:

a. The number of failures observed, categorised by test conditions and type of failure

b. The times at which the failures occurred or were verified

c. Particular incidents which occurred during testing which might have affected the results

d. Statement of failure mechanism

e. Discarded test data and the reasons why they were not used in the presentation or results

Additional requirements:

a. Failure criteria

b. Failure rate which can be assumed to be constant

c. Failure rate which cannot be assumed to be constant

d. Influence of stress

Presentation of data:

a. The failure rates of components failing in the sample tested shall preferably be supplied in terms of the test period, eg 4 x 10-6 in 2000 hours, rather than the failure rate alone.

b. The upper confidence level (and the lower, where appropriate), shall be stated. Preferred confidence levels are 60% and 90%. It shall be stated whether the failure rate is observed, assessed or extrapolated.

5 Data on changes in CharacteristicsThe following information shall be supplied as a part of the test data:

a. Primary characteristics data

b. A graphical representation of changes in characteristics

c. A tabular representation of changes in characteristics

d. Particular incidents which occurred during testing which might have affected the results

e. Discarded test data and the reasons why they were not used in the presentation of the results, shall be separately stated.

Additional requirements:

a. If the changes can be satisfactorily approximated by a mathematical function, that function should be stated together with time duration for which it applies.

b. If the drift of the characteristic depends on type and magnitude of stress, these should be stated together with the data.

Presentation of the data:

The various forms of data presentation are

a. Primary methods

b. Graphical methods

c. Numerical methods

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM41

Page 42: 8124 Study Notes 31 July

AS3960-1990 Guide to Reliability and Maintainability Program Management

1 Scope and GeneralThis standard provides guidance on reliability and maintainability program management of manufactured and constructed products. In management terms, it is concerned with what has to be done and why, and when and how it is to be done.

2 Reliability and Maintainability Program

GeneralLife Cycle concept

Aim of a Reliability and Maintainability Program

General considerations on maintainability

Cost Considerations

Relative effectiveness of program activities

Training

Program ActivitiesDefinition

Design and Development phase

Production phase

Installation and Commissioning phase

Operation-usage and maintenance phase

3 Specification of Reliability and Maintainability

GeneralTypes of specification

Purpose of reliability and maintainability clauses

Qualitative versus quantitative approach to reliability and maintainability

Quantitative reliability clauses

Problems in applying the quantitative approach

Qualitative approach

Quantitative maintainability clauses

Quantitative maintainability requirements

Writing Reliability and Maintainability Clauses in a SpecificationNecessary clauses

Function of an item

Criteria for failure

Choice of a reliability characteristic

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM42

Page 43: 8124 Study Notes 31 July

Required value of the reliability characteristic

Choice of a maintainability characteristic

Required value of the maintainability characteristic

Operating conditions and regime

Reliability and Maintainability assurance

Specification or Reliability and Maintainability in Practice

4 Assessment and prediction of Reliability and Maintainability

GeneralAims of reliability assessment

Reliability and Maintainability characteristics

Reliability Assessment

Reliability Prediction by Modelling

Provision of Reliability Data

Reliability Growth TestingGeneral

Preparation

Results of reliability growth testing

Factors governing reliability growth testing effectiveness

Reliability Demonstration and TestingGeneral

Aims of a test program

Choice of a test program

Evaluation of test data using Bayesian methods

Proof test

Suitability of statistical methods for analysis of test results

Maintainability PredictionMaintainability prediction

Prediction advantages

Techniques

Basic Assumptions and Interpretations

Elements of maintainability prediction techniques

Maintainability Demonstration and TestingGeneral requirements

Maintainability testing programLast saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM43

Page 44: 8124 Study Notes 31 July

Maintainability demonstration

Test Conditions

Maintenance task selection

Compliance illustration by means other than testing.

5 Production, Flow, Analysis and Interpretation of Reliability and Maintainability Data

GeneralBenefits

Organisation

Effectiveness of communication

Data InputReporting systems

Specification and description

Operating history

Failure history

Data SourcesGuidelines

Past Experience

Design and development

Production

Factory test

Guarantee or warranty reports – product liability test reporting

Supply of replacement parts

Material or component supply

Repair department

Field installation, demonstration or commissioning tests

User reporting system

Field surveys

Designing the Data Collection Form

Validity of DataProduct manufacturer

Materials or component supplier

Field data retrieval programs

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM44

Page 45: 8124 Study Notes 31 July

Collection and Flow of Reliability Data

Analysis of DataQuantitative data

Qualitative data

Requirement specifications

Failure Classification

Interpretation and Presentation of Data

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM45

Page 46: 8124 Study Notes 31 July

O’Conner Chapter 6 – Reliability Prediction and Modelling

IntroductionAccurate prediction of the reliability of a new product before it is manufactured is obviously highly desirable. Advance knowledge of reliability can allow forecasts of support ocsts, spares requirements, warranty costs, marketability.

It can be argued that prediction acknowledges causes of failure that could then be eliminated. IN fact a reliability prediction can rarely be made with high accuracy or confidence.

Nevertheless it can often provide the basis for forecasting of dependant factors such as life cycle costs, and be valuable as part of the study and design processes, for comparing options and highlighting critical reliability features of designs.

Eventually, in principle, it is necessary to consider the reliability contributions of individual parts. However, the lower the level of analysis, the greater is the uncertainty. The great majority of modern engineering components are sufficiently reliable that, for practical peruposes they generate no inherent quantifiable failure rate.

Since reliability is affected strongly by human-related factors such as training and motivation of design and test engineers, quality of production and maintenance skills, these factors must also be taken into account.

Fundamental Limitations of Reliability PredictionIn engineering and science we use mathematical models for prediction. These laws and models are valid within the appropriate domain (eg Ohm’s law does not hold at temps near absolute zero). However for everyday, practical purposes such deterministic laws serve our purposes well, and we use them to make predictions, taking account of such practical aspects as measurement errors in initial conditions.

While most laws in physics can be considered deterministic, the underlying mechanisms can be stochastic. It is only at the level of individual or very few actions and interactions that physicists take into account the uncertainty due to underlying stochastic processes. For practical purposes, we ignore infinitesimal variations.

Of course some physical systems are comprised of very few components, actions and interactions with no underlying stochastic mechanism, and can be treated as deterministic because of the small numbers involved. If however, the numbers become larger, the computational problems become significant and begin to degrade the credibility of the predictions. Also, small errors and variations progressively accumulate, leading to increased uncertainty and divergent behaviour.

Physical laws are thus a useful predictor of system behaviour either when very small or very large numbers are involved. For systems involving moderately large numbers the predictive power in empirical and deterministic laws diminishes. Very fast computing is applied, and new empirical relationships are derived, and the results often fall short of our requirements. Prediction power is increased if we can simplify the problem to a few variables or complicate it to a very large number.

Prediction power is greater in time-invariant or cyclical systems. Systems such as weather, aerodynamics, and explosives are time-variant, and this leads to divergent behaviour. Thus predictions of the instantaneous state become progressively less credible as time proceeds.

It follows from the arguments presented that one or a few measurements of a physical quantity can usually provide us with sufficient information on which to make a prediction of the future state of a physical time-invariant or cyclical system. When the quantity is time-variant, more measurements

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM46

Page 47: 8124 Study Notes 31 July

might need to be taken before we can predict confidently from the data. It is only when the system is moderately complex that one or a few measurements do not enable us to predict the future state.

For a mathematical model to be accepted as a basis for scientific prediction, it must be based on a theory which explains the relationship. Finally, we expect the predictions made to be repeatable.

Predictions in ReliabilityThe concept of deriving mathematical models, which can be used to predict reliability, is intuitively appealing. Sometimes these models are as simple as a single fixed value for failure rate or reliability. However, some of the models derived are quite complex, taking account of many factors likely to affect reliability.

Like other predictive models in science and engineering, these have been based upon consideration of what might affect the parameter of interest, in this case reliability, or in other words, failure. Thus there have been attempts to create theories. However, this approach is of severely limited validity for predicting reliability.

Whilst an engineering component might have properties such as conductance and mass, it is very unlikely to have an intrinsic reliability that meets such criteria.

Failure or the absence of failure is heavily dependant upon human actions and perceptions. This is never true of laws of nature. This represents a fundamental limitation of the concept of reliability prediction using mathematical models.

Onset of failure is nearly always a discontinuous function, subject to predictive difficulties described for models of the behaviour of a system which contain moderately large number of factors and interactions, and whose progression to a failed state is time-variant.

We saw in Chapter 4 how reliability can vary by orders of magnitude with small changes in load nad strength distributions, and the large amount of uncertainty inherent in estimating reliability from the load-strength model. These real uncertainties must be borne in mind when synthesizing the reliability of a system by considering the likely failure rates of its parts.

Another limitation arises from the fact reliability models are usually based upon statistical analysis of past data. Much more data is required to derive a statistical relationship, and even then there will be uncertainty because the sample can seldom be taken to represent the population. Sometimes we can say that the likelihood increases but we can very rarely predict the time of failure. A statistically derived relationship can never be proof of a causal connection – it must be supported by theory based on an understanding of the underlying cause and effect relationship.

It is never sensible to make a prediction based on past data unless we can be sure the underlying condition that affect future behaviour will not change. However since engineering is concerned with deliberate changes, predictions of reliability based on past data ignore the fact that changes might be made to improve reliability. The use of past data to predict the future can be very misleading and unduly pessimistic.

A reliability prediction for a system containing many parts is likely to be more accurate than for a small system. It is important to remember the variances in reliability at the part level can be orders of magnitude greater than the variances at system level.

Therefore any reliability prediction based on mathematical models or growth models must be treated with some scepticism.

A designer cannot design for an MTBF, unless he places as much faith in the reliability math model as he does in, say, Ohm’s law. The MTBF cannot be measured as can, say power consumption, and there is no reason or logic to believe they will all show the same MTBF or patters of failure of a period of time.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM47

Page 48: 8124 Study Notes 31 July

Unfortunately the mathematical modelling approach to reliability prediction has been given undue and insufficiently critical attention in the literature and in reliability standards. The naïve presentation of reliability predictions has done much to undermine the credibility of reliability engineering, particularly since so often systems achieve reliability levels far higher than the predictions.

Reliability DatabasesThere are several published databases that give reliability info in engineering components and sub systems.

Best known would be MIL-HDBK-217 for electronic components.

The Practical ApproachIt is possible to make credible reliability predictions for systems under certain conditions, these are:

1. The system is similar to systems developed, built and used previously

2. The new system does not involve significant technological risk

3. The system will be manufactured in large quantities, or is very complex (ie many parts, or the parts are complex) or will be used over a long time or a combination ie there is an asymptotic property

4. There is a strong commitment to the achievement of the reliability predicted, as an overriding priority

Reliability prediction for new high tech products must be based upon identification of objectives and assessment of risks, in that order. This assessment can be aided by the educated use of appropriate models and data which help to quantify the risks.

Reliability predictions for systems should be made “top down” not synthesized from the parts level.

The purpose to which the prediction will be applied should also influence the methods used and the estimates provided.

The predictions should always take account of objectives and related management aspects, such as commitment and risk. If management does not drive the reliability effort, the prediction can become a meaningless exercise. As overriding considerations, it must be remembered there is no theoretical limit to the reliability that can be attained, and this does not necessarily entail higher costs.

System Reliability Models

The basic series reliability modelIn general for a series of n, s-independent components:

Where Ri is the reliability of the ith component. This is known as the product rule or series rule.

and

This is the simplest basic model on which parts count reliability prediction is based.

Active RedundancyIn this system, composed of two s-independent parts with reliability R1 and R2, satisfactory operation occurs if either one or both parts function.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM48

Page 49: 8124 Study Notes 31 July

(R1 + R2) = R1 + R2 - R1R2

The general expression for active parallel redundancy is:

m-out-of-n redundancyIn some active parallel redundant configurations, m out of n units may be required to be working for the sustem to function. The reliability of an m/n system , with n, s-independent components in which all the unit reliabilities are equal, is the binomial reliability function.

Standby redundancyStandby redundancy is achieved when one unit does not operate continuously but is switched on when the primary unit fails. The standby unit and the sensing and switching system may be considered to have a “one-shot” reliability of starting and maintaining system function until the primary component is repaired.

Further redundancy considerationsFor systems where very high safety or reliability is required, more complex redundancy is frequently applied:

1. In aircraft, dual or triple active redundant hydraulic power systems are used.

2. Aircraft electronic flying controls typically utilize trip voting active redundancy. A sensing system automatically switches off one system if it transmits signals, which do not match those transmitted by the other two.

3. Fire detection and suppression systems consist of detectors, which may be in parallel active redundant configuration

Availability of Repairable SystemsAvailability is defined as the probability that an item will be available when required, or as the proportion of total time that the item is available for use. Therefore the availability of a repair able item is a function of its failure rate, λ, and its repair or replacement rate, μ. For a simple unit with a constant failure rate λ, and a constant repair rate, μ, where μ = 1/MTTR, the steady state availability is equal to:

The instantaneous availability is equal to:

Steady state unavailability = 1 – A

If scheduled maintenance is necessary and involves taking the system out of action, this must be included in the availability formula.

Availability is an important consideration in relatively complex systems. In such systems, high reliability by itself is not sufficient to ensure that the system will be a available when needed. It is also necessary to ensure that it can be repaired quickly and that essential scheduled maintenance can be performed quickly. Therefore maintainability is an important aspect of design for maximum availability.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM49

Page 50: 8124 Study Notes 31 July

Availability is also affected by redundancy. Large gains in reliability and steady-state availability can be provided by redundancy. However, these are relatively simple situations, particularly as constant failure rate is assumed.

Modular DesignAvailability and the cost of maintaining the system can be influenced by the way the design is partitioned. Modular design is used in many complex products to ensure a failure can be corrected by a relatively easy replacement of the defective module, rather than by replacement of a complete unit.

Block Diagram AnalysisThe failure logic of a system can be shown with a reliability block diagram (RBD), which shows the logical connections between components of the system.

Block diagram analysis consists of reducing the overall RBD to a simple system that can then be analysed using the formula for series and parallel arrangements. It is necessary to assume s-independence of block reliabilities.

Cut and Tie SetsComplex RBDs can be analysed using the cut set or tie set methods. A cut set is produced by drawing a line through blocks in the system to sho the minimum number of failed blocks which would lead to a system failure. Tie sets are produced by drawing lines through blocks which, if all were working, would allow the system to work.

Their use is appropriate for the analysis of large systems in which various configurations are possible, such as aircraft controls.

Common Mode FailuresExamples of common mode failures are:

1. Changeover systems to activate standby redundant units

2. Sensor systems to detect failure of a path

3. Indicator systems to alert personnel to failure of a path

4. Power or fuel supplies which are common to different paths.

Enabling EventsAn enabling event is one which, whilst not necessarily a failure or a direct cause of failure, will cause a higher level failure event when accompanied by a failure. Examples are:

1. Warning systems disabled for maintenance

2. Controls incorrectly set

3. Personnel following procedures incorrectly or not at all.

Practical AspectsIt is essential that practical engineering considerations are applied to system reliability analysis. Examples of situations in which practical and logical error can occur are:

1. Two diodes connected in series. If either fails open circuit there will be no current flow. If either fails short circuit, the other will provide the required system function, so they will be in parallel from a reliability point of view.

2. Common mode failures are often difficult to predicts, but can dominate the real reliability or safety of a system

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM50

Page 51: 8124 Study Notes 31 July

3. Unexpected combinations of events can occur

4. System failures can be caused by events other than failure of components or subsystems.

These illustrate the need for reliability and safety analysis to be performed by engineers with practical knowledge and experience of the system design, manufacture, operation and maintenance.

Fault Tree AnalysisFault Tree Analysis (FTA) is a reliability/safety design technique which starts from consideration of system failure effects, referred to as top events.

In addition to showing the logical connection between failure events in relation to defined top events can be caused by different failure modes or different logical connection between failure events.

Petri NetsA Petri net is a general-purpose graphical and mathematical tool for describing relations existing between conditions and events.

Owing to the variety of logical relations that can be represented with Petri nets, it is a powerful tool for modelling systems. Petri nets can be used not only for simulation, reliability analysis and failure monitoring, but also form dynamic behaviour observation.

State Space Analysis (Markov Analysis)A system or component can be in one of two states (failed or non-failed). The probability of being in one or the other at a future time can be valuated using state-space (or state-time) analysis. In reliability and availability analysis, failure rate and repair rate are the variables of interest.

The best known state-space analysis techniques is Markov analysis, which can be applied under the following major constraints:

1. The probabilities of changing from one state to another must remain constant. Thus the method can only be used when a constant hazard or failure rate assumption can be justified.

2. Future states of the system are indepenedant of all past states except the immediately preceding one. This is an important constraint in the analysis of repairable systems, since it implies the repair returns the system to an “as new” condition.

The tree diagram approach quickly becomes intractable if the system is much more complex than the one-component system described, and analysed over just a few increments. For more complex systems, matrix methods can be used, particularly as these can be readily solved using computer programs.

Continuous Markov ProcessesSo far we have considered discrete Markov processes. We can also use Markov to evaluate the availability of systems in which the failure rate and the repair rate are assumed to be constant in a time continuum. Markov analysis an also be used ior availability, taking account of the holding an repair rate for spares.

LimitationsMarkov analysis method suffers one major disadvantage. It is necessary to assume constant rates for both failures and repairs. It is also necessary to assume events are s-independent, which is hardly ever the case in the real world. The effect to which these might effect the situation should be carefully considered when evaluating a Markov analysis.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM51

Page 52: 8124 Study Notes 31 July

Monte Carlo SimulationIn a Monte Carlo simulation, a logical model of the system being analysed is repeatedly evaluated, each run using different values of the distributed parameters. The selection of parameters is made randomly, but with probabilities govered by the relevant distribution function

Monte Carlo simulations can be used for reliability and availability modelling using computer programs. Since MC involves no complex mathematical analysis it is an attractive alternative approach. There are no constraints regarding the nature of input assumotions on parameters such as failure and repair rates, so non-constant values can be used.

One problem is the expensive use of computer time. Also, since the simulation of probabilistic events generates variable results, in effect simulating real life, it is usually necessary to perform a number of runs in order to obtain estimates for mean and variances, such as availability, number of repairs arising and facility utilization.

Reliability ApportionmentSometimes it is necessary to break an overall system reliability requirement down to individual sub system reliabilities.

The starting point for apportionment is an RBD for the system drawn to show the appropriate system structure. It is important to take account of the uncertainty inherent in any early prediction.

Standard Methods for Reliability Prediction and ModellingMIL-STD-756

IEEE 1413

NASA-CR-1129

ConclusionsSystem reliability prediction and modelling can be a frustrating exercise, since even quite simple syste,s can lead to complex logic when redundancy , repair times, testing and monitoring are taking into account.

In real life, availability is often determeined more by holding spares, admin times, rather than predictable factors such as mean repair time.

Prediction and modelling are concepts which have generated much attention literature and controversy in the reliability field.

Most of these work, however this is only obtuse interest since reliability is not a parameter, which is inherently predictable, on the basis of the laws of nature or of statistical extrapolation.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM52

Page 53: 8124 Study Notes 31 July

O’Conner Chapter 7 – Reliability in Design (not required)

Introduction

Computer-Aided Engineering

Environments

Design Analysis Methods

Quality Function Deployment

Load Strength Analysis

Failure Modes, Effects and Criticality Analysis (FMECA)

Reliability Predictions for FMECA

Hazard and Operability Study (HAZOPS)

Parts, Materials and Processes Review (PMP)

Non-Material Failure Modes

Human Reliability

Design analysis for processes

Critical Items List

Summary

Management of Design Review

Configuration Control

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM53

Page 54: 8124 Study Notes 31 July

O’Conner Chapter 8 – Reliability of Mechanical Components and Systems

IntroductionMechanical components can fail due to two causes:

1. Overstress leading to fracture

2. Degradation of strength

Mechanical components and systems can also fail for many other reasons:

Backlash in controls, linkages and gears

Incorrect adjustments

Seizing of moving parts

Leaking of seals

Loose fasteners

Excessive vibration or noise

Designers must be aware of these and other potential causes of failure, and must design to prevent or minimise their occurrence.

Mechanical Stress, Strength and FractureMechanical stress can be either tensile, compressive or shear. The amount of deformation is called the strain. The relationship between stress () and strain () is described by Hooke’s Law:

= E

where E is Young’s Modulus. A high value of E indicates the material is stiff. A low value means that the material is soft or ductile.

Another important material property is toughness. Toughness is the opposite of brittleness. It is the resistance to fracture, measured as the energy input per unit volume required to cause fracture.

Compressive strength is much more difficult to analyse and predict. It depends upon the mode of failure and the shape of the component.

Stress can also be applied in shear.

FatigueFatigue damage within engineering materials is caused when a repeated mechanical stress is applied, the stress being above a limiting value called the fatigue limit. Fatigue damage is cumulative, so that repeated stress above the fatigue limit will eventually result in failure.

Initiation and growth rate of the cracks varies depending upon the material properties and on surface and internal conditions. The material property that imparts resistance to fatigue damage is the toughness.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM54

Page 55: 8124 Study Notes 31 July

O’Conner Chapter 9 – Electronic System Reliability

Introduction

Reliability of Electroninc Components

Component Types and Failure Mechanisms

Summary of device failure modes

Circuit and System aspects

Electronic System Reliability Prediction

Reliability in electronic system design

Parameter variation and tolerances

Design for production, test and maintenance

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM55

Page 56: 8124 Study Notes 31 July

O’Conner Chapter 10 – Software Reliability

Introduction

Software in engineering systems

Software Errors

Preventing Errors

Software Structure and Modularity

Programming Style

Fault Tolerance

Redundancy/Diversity

Languages

Data Reliability

Software Checking

Software Design Analysis Methods

Software Testing

Error Reporting

Software Reliability Prediction and Measurement

Sneak AnalysisSneak conditions are:

1. Sneak Output – the wrong output is generated.

2. Sneak inhibit – undesired inhibit of an input or an output.

3. Sneak timing – the wrong output is generated because of its timing or incorrect input timing.

4. Sneak message – a program message incorrectly reports the state of the system.

Hardware/Software Interfaces

ConclusionsThe versatility and economy offered by software control can lead to an under-estimation of the difficulty and cost of software generation. To ensure the program will operate satisfactorily under all conditions that might exist requires an effort greater than that required for the basic design and first- program preparation. The cost and effort of debugging a large, unstructured program containing many errors can be so high that it is cheaper to scrap the whole program and start again.

The essential elements of a software development program to ensure a reliable project are:Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM56

Page 57: 8124 Study Notes 31 July

1. Specify the requirements completely and in detail

2. Make sure that all the project staff understand the requirements

3. Check the specifications thoroughly

4. Design a structured program and specify each module fully

5. Check the design and the module specifications against the system specifications

6. Check the written program for errors, line by line

7. Plan module and system tests to cover important input combinations, particularly at extreme values

Ensure full recording of all development notes, test, checks, errors and program changes.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM57

Page 58: 8124 Study Notes 31 July

Smith Chapter 17 – Systematic Failures, especially softwareSystematic failures are generally considered to be in addition to those we quantify by means of failure rate. Since they do not relate to past failure data, it follows that it is very difficult to justify their being predicted by conventional techniques. Qualitative measures have been developed in the hope they will minimise systematic failures. The following sections summarise these defences with particular reference to software-related failure.

Programmable Devices

Software-related Failures

Software Failure Modelling

Software Quality Assurance

Modern/Formal Methods

Software Checklists

Study Guide 3 Self Assessment Questions1. Reliability analysis

2. A sensitivity analysis

3. A trade-off analysis

4. What are four software sneak circuit conditions?

Sneak conditions are:

1. Sneak Output – the wrong output is generated.

2. Sneak inhibit – undesired inhibit of an input or an output.

3. Sneak timing – the wrong output is generated because of its timing or incorrect input timing.

4. Sneak message – a program message incorrectly reports the state of the system.

5. What are two primary causes for mechanical failure?

Mechanical components can fail due to two causes:

1. Overstress leading to fracture

2. Degredation of strength

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM58

Page 59: 8124 Study Notes 31 July

STUDY GUIDE 4 - Reliability, Maintainability and Availability

ObjectivesYou should be able to

Compare your definitions of these concepts with those presented in the various resources, and point out any differences in emphasis

Outline the significance of reliability, maintainability and availability to equipment design

Discuss the trade-off between the cost of reliability, maintainability and performance

Describe the principles of fault tree analysis and failure mode effect and criticality analysis

O’Conner Chapter 14 – Maintainability, Maintenance and Availability

IntroductionMost engineering systems are maintained. The ease with which repair and other maintenance work can be carried out determines a system’s maintainability.

Maintained systems may be subject to corrective and preventative maintenance. Corrective maintenance includes all action to return a system from a failed to an operating or available state. The amount of corrective maintenance is therefore determined by reliability.

Corrective maintenance can be quantified as the mean time to repair (MTTR), and this time can be divided into three groups:

1. Preparation time

2. Active maintenance time

3. Delay, or Logistics time.

Corrective maintenance is also specified as mean active corrective maintenance time (MACMT) since it is only the active time that the equipment designer can influence.

Preventative maintenance seeks to retain the system in an operational or available state by preventing failures from occurring. Preventative maintenance affects reliability directly. It is planned and should be performed when we want it to be performed. Preventative maintenance is measured by the time taken to perform the specified maintenance tasks and their specified frequency.

Maintainability affects availability directly. The time taken to repair failures and to carry out routine preventative maintenance removes the system from the available state. There is thus a close relationship between reliability and maintainability, one affecting the other, and both affecting availability and costs.

The maintainability of a system is clearly governed by the design. The design determines features such as accessibility, ease of test, diagnosis and repair and requirements for calibration, lubrication and other preventative maintenance actions.

Maintenance Time DistributionsMaintenance time tend to be lognormally distributed. In addition to job-to-job time variability, leading typically to this lognormal distribution, there is also variability due to learning. However, both the mean time and the variance should reduce with experience and learning.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM59

Page 60: 8124 Study Notes 31 July

Preventative Maintenance StrategyThe effectiveness and economy of preventative maintenance can be maximised by taking account of the time-to-failure distributions of the maintained parts and of the failure rate trend of the system.

In general, if a part has a decreasing failure rate, any replacement will increase the probability of failure. If the hazard rate is constant, replacement will make no difference to the probability of failure. If a part has an increasing hazard rate, then scheduled replacement at any time will, in theory, improve reliability of the system.

These are theoretical considerations. They assume that the replacement action does not introduce any other defects and that the time-to-failure distributions of parts are exactly defined. These assumptions must not be made without question. However, it is obviously of prime importance to take account of the time-to-failure distributions in planning a preventative maintenance strategy.

In addition to the effect of replacement on reliability as theoretically determined by considering the failure distributions of the replaced parts, we must also take account of the effects of maintenance action on reliability.

The effects of failures, both in terms of effects on the system and of costs of downtime and repair, must also be considered.

In order to optimise preventative replacement, it is therefore necessary to know the following for each part:

1. The time-to-failure distribution parameters for the main failure modes

2. The effect of all failure modes

3. The cost of failure

4. The cost of scheduled replacement

5. The likely effect of maintenance on reliability.

We have considered so far parts which do not give any warning of the onset of failure. If incipient failure can be detected, we must also consider:

6. The rate at which defects propagate to cause failure

7. The cost of inspection and test.

Note that from point 2, a Failure Modes and Effects Criticality Analysis (FMECA) is therefore an essential input to maintenance planning. This systematic approach to maintenance planning, taking account of reliability aspects, is called reliability centred maintenance (RCM).

FMECA and FTA in Maintenance PlanningThe FMECA is an important prerequisite to effective maintenance planning and maintainability analysis. The FMECA is also a very useful input for preparation of the diagnostic procedures and checklists, since the likely causes of the failure symptoms can be traced back using the FMECA results. When a fault tree analysis (FTA) has been performed it can also be used for this purpose.

Maintenance SchedulesWhen any maintenance activity is determined to be necessary, we must determine the most suitable intervals between its performance.

The most appropriate base is the one that best accounts for the equipment’s utilisation in terms of the causes of degradation (wear, fatigue, parameter change, etc) and is measured.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM60

Page 61: 8124 Study Notes 31 July

Technology Aspects

MechanicalMonitoring methods are used to provide periodic or continuous indications of the condition of mechanical components and systems. These include:

Non-destructive test (NDT) for detection of cracks

Temperature and vibration monitoring of bearing, gears and other rotating machinery

Oil analysis, to detect signs of wear or break-up in lubrication and hydraulic systems

Electronic and ElectricalElectronic components and assemblies generally do not degrade in service, so long as they are protected from environments such as corrosion. Therefore, apart from calibration for items like measuring instruments, scheduled tests are seldom appropriate.

“No Fault Found”A large proportion of the reported failures of many electronic systems are not confirmed on later test. There are several causes of these, including:

Intermittent failures, such as components that fail under certain conditions.

Tolerance effects, which can cause a unit to operate correctly in one system or environment but not another.

Connector troubles

Built in test (BIT) systems which falsely indicate failures that have not occurred (see below)

Failures that have not been correctly diagnosed and repaired, so that the symptoms reoccur.

Inconsistent test criteria between the in-service test and the test applied during diagnosis elsewhere such as the repair depot.

Human error or inexperience

In some systems the diagnosis of which item failed might be ambiguous.

SoftwareAs discussed in chapter 10 software does not fail in the ways hardware can, so there is no “maintenance”. If it is found to be necessary to change a program for any reason, this is really redesign of the program, not repair.

Built-In Test (BIT)Complex electronic systems now frequently include built-in test (BIT) facilities. BIT consists of additional hardware (and often software) which is used for carrying out functional test on the system.

BIT can be very effective in increasing system availability and user confidence in the system. However BIT inevitably adds complexity and cost and can therefore increase the probability of failure.

BIT can also adversely affect apparent reliability by falsely indicating that the system is in a failed condition.

It is important to optimise the design of BIT in relation to reliability, availability and cost.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM61

Page 62: 8124 Study Notes 31 July

CalibrationCalibration is the regular check or test of equipment for measuring physical parameters, by making comparisons with standard sources.

Whether an item needs to be calibrated or not depends primarily on its application, and also upon whether or not any inaccuracy would be apparent during normal use.

Maintainability PredictionsMaintainability prediction is the estimation of the maintenance workload which will be imposed by scheduled and unscheduled maintenance. A standard method used for this work is MIL-HDBK-472, which contains four methods for predicting Mean Time To Repair (MTTR) of a system. Method II is most frequently used and is based simply on summing the products of the expected failure repair times of the individual failure modes and dividing by the sum of the individual failure rates, eg:

The same approach is used for predicting the mean preventative maintenance time, with replaced by the frequency of occurrence of the preventative maintenance action.

MIL-HDBK-472 describes the methods to be used for predicting individual task times based upon design considerations such as accessibility, skills levels, etc

Maintainability DemonstrationsA standard approach to maintainability demonstration is MIL-HDBK-470. The technique is the same as maintainability prediction using method III of MIL-HDBK-472, except that the individual task times are measured rather than estimated from design.

Design for MaintainabilityIt is obviously important that maintained systems are designed so that maintenance tasks are easily performed, and that the skill levels required are not too high, considering the experience and training of likely maintenance personnel and users. As far as is practicable, the need for scheduled maintenance should be eliminated.

Design rules and checklists should include guidance to aid design for maintainability and to guide design review teams.

Design for maintainability is closely related to design for ease of production. If a product is easy to assembly and test, maintenance will usually be easier

Interchangeability is another important aspect of design for ease of maintenance of repairable systems. Replaceable components and assemblies must be designed so that no adjustment or recalibration is necessary after replacement.

Integrated Logistic SupportIntegrated logistic support (ILS) is concepts developed by the military, in which all aspects of design and of support and maintenance planning are brought together, to ensure that the design and the support system are optimised. The approach is described in MIL-HDBK-1388.

ILS requires inputs of reliability and maintainability data and forecasts, as well as data on costs, weights, special tools and test equipment, training requirements etc.

ILS outputs are thus very sensitive to the accuracy of the inputs. In particular, reliability forecasts can be highly uncertain (refer Chapter 6). Therefore, such analyses and the decisions based upon them, should take full account of these uncertainties.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM62

Page 63: 8124 Study Notes 31 July

O’Conner pages xxv – xxvi

MIL-HDBK-472

SAE J817

[Reader] Collcot Chapter 13 “Fault Analysis Planning and System Availability”Overview of paper, understand concepts in paras 13.1 to 13.1.5

[Reader] Patton Chapter 8 “Reliability, Availability and Maintainability”pages 76-101

Study Guide 4 Self Assessment Questions1. Maintainability

Most engineering systems are maintained. The ease with which repair and other maintenance work can be carried out determines a system’s maintainability.

Maintained systems may be subject to corrective and preventative maintenance. Corrective maintenance includes all action to return a system from a failed to an operating or available state. The amount of corrective maintenance is therefore determined by reliability.

2. Integrated logistic support

Integrated logistic support (ILS) is concepts developed by the military, in which all aspects of design and of support and maintenance planning are brought together, to ensure that the design and the support system are optimised. The approach is described in MIL-HDBK-1388.

ILS requires inputs of reliability and maintainability data and forecasts, as well as data on costs, weights, special tools and test equipment, training requirements etc.

ILS outputs are thus very sensitive to the accuracy of the inputs. In particular, reliability forecasts can be highly uncertain. Therefore, such analyses and the decisions based upon them, should take full account of these uncertainties.

3. Availability

4. What are the three groups of activities used to quantify MTTR?

Corrective maintenance can be quantified as the mean time to repair (MTTR), and this time can be divided into three groups:

1. Preparation time

2. Active maintenance time

3. Delay, or Logistics time.

5. What is the main objective of a reliability and maintainability program?

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM63

Page 64: 8124 Study Notes 31 July

STUDY GUIDE 5 - Reliability Prediction and Modelling

ObjectivesYou should be able to:

Identify the various modelling approaches

Identify the strengths and weaknesses of each of them

Determine if any of them would be relevant and useful to your work situation

O’Conner, Chapter 6 Conclusion (limitations for reliability modelling)System reliability prediction and modelling can be a frustrating exercise, since even quite simple systems can lead to complex logic when redundancy, repair times, testing and monitoring are taking into account.

In real life, availability is often determined more by holding spares, admin times, rather than predictable factors such as mean repair time.

Prediction and modelling are concepts that have generated much attention literature and controversy in the reliability field.

Most of these work, however this is only obtuse interest since reliability is not a parameter, which is inherently predictable, on the basis of the laws of nature or of statistical extrapolation.

O’Conner Chapter 12 (pgs 341-346)

Reliability Analysis of Repairable SystemsChapter 3 described methods for analysing data related to time to first failure. However, for repairable systems, which really represent the great majority of everyday reliability experience, the distribution of times to first failures are much less important than is the failure rate or rate of occurrence of failure (ROCOF) of the system.

Any repairable system may be considered as an assembly of parts, the parts being replaced when they fail. If we ignore replacement (repair) times, which are usually small in comparison with standby or operating times, and if we assume that the time to failure of any part is independent of any repair actions, then we can use the methods of event series analysis (shown in chapter 2) to analyse system reliability.

If we do not perform a centroid test and assume the data were independently and identically distributed (IID), we might order the data in rank order and plot on probability paper.

This example shows how important it is for failure data to be analysed correctly, depending on whether we need to understand the reliability of a non-repairable part or of a repairable system. The presence of a trend when the data are ordered chronologically shows that times to failure are not IID, and ordering by magnitude, which implies IID, will therefore give misleading results.

We can derive system reliability over a period by plotting the cumulative times to failure in chronological order rather than in rank order.

If there are no perturbations, the failure rate will tend to a constant value after most parts have been replaced at least once, regardless of the failure trends of individual parts. This is one of the main reasons the Constant Failure Rate (CFR) assumption has become so widely used for systems, and why part hazard rates has been confused with failure rate.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM64

Page 65: 8124 Study Notes 31 July

If part times to failure are individually and identically exponentially distributed (IID exponential) the system will have a CFR which will be the sum of the reciprocals of the part mean times to failure, eg:

The assumption of IID exponential for part times to failure for a repairable system can be misleading for the following reasons:

1. The most important failure modes of a system are usually caused by parts which have failure probabilities which increase with time (wearout failures)

2. Failure and repair of one part may cause damage to others. Therefore times between failures are not necessarily independent.

3. Repairs often do not “renew” the system.

4. Repairs might be made by adjustment, lubrication, thus providing a new lease of life, but not “renewal”

5. Replacement parts can make subsequent failure initially more likely to occur

6. Repair personnel learn by experience, so diagnostic ability improves with time. Conversely, changes in personnel can lead to reduced diagnostic ability and therefore repeated failures.

7. Not all part failures will produce system failures

8. Factors (such as cycling) are often more important than operating times

9. Reported failures are nearly always subject to human bias and emotion

10. Failure probability is affected by scheduled maintenance or overhaul

11. Replacement parts are not necessarily drawn from the same population – they may be better or worse.

12. System failures might be caused by parts whose combined tolerances cause the system to fail.

13. Many reported failures are not caused by part failures at all, but by events such as intermittent connections, improper use, maintainers using opportunities to replace “suspect” parts etc.

14. Within a system, not all parts operate to the overall system cycle.

The factors above often predominate in systems to be modelled and in collected reliability data.

A CFR is often a practicable and measurable first-order assumption, particularly when data are not sufficient to allow more detailed analysis.

Smith Chapter 8 – Methods of Modelling

Block Diagrams and Repairable Systems

Reliability Block DiagramsSteps to creating a reliability block diagram:

1. Establish failure criteria – define what constitutes a failure since this will determine which failure modes at the component level actually cause a system to fail.

2. Establish a reliability block diagram – it is necessary to describe the system as a number of functional blocks which are interconnected according to the effects of each block failure on the overall system reliability

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM65

Page 66: 8124 Study Notes 31 July

3. Failure Mode Analysis – complete an FMEA by examining individual component failure modes and failure rates

4. Calculation of system reliability – relating the block failure rates to the system reliability is a question of mathematical modelling

5. Reliability allocation – the block failure rates are taken as a measure of the complexity and improved, suitably weighted objectives are set.

Repairable SystemsIt is now generally acknowledged that traditional Markov modelling does not correctly represent the normal repair activities for redundant repairable systems when calculating the probability of failure on demand (PFD). The Journal of The Safety and Reliability Society, Vol 22, No 2, 2002 published papers by Gulland and Simpson, both of which agree with those findings.

Common Cause (Dependent) Failure (CCF)CCF often dominate the unreliability of redundant systems by virtue of defeating the random co-incident failure features of the redundant protection.

Whereas simple models of redundancy assume that failures are both random and independent, common cause failures modelling takes account of the failures which are linked, due to some dependency, and therefore occur simultaneously or, at least, within a sufficiently short interval as to be perceived as simultaneous.

Defences against CCF involve design and operating features .

Fault Tree AnalysisA fault tree is a graphical method of describing the combinations of events leading to a defined system failure. The system failure mode is known as the top event.

The fault tree involves three logical possibilities being:

1. the OR gate whereby any input causes the output to occur

2. the AND gate whereby all inputs need to occur for the output to occur

3. the Voted gate, similar to the AND gate, whereby two or more inputs are required to make the output to occur.

Event Tree DiagramsWhereby Fault Tree analysis is probably the most widely used technique for quantitative analysis, it is limited to combination of AND/OR events which contribute to a single defined failure (the top event).

Event Trees resemble decision trees which show the likely train of events between an initiating event and any number of outcomes. The main element in a ET is the decision box which contains a question/condition with a YES/NO outcome.

The main difference between Fault Trees and Event Trees is event trees model the order in which the elements fail.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM66

Page 67: 8124 Study Notes 31 July

Smith Chapter 9

Duane, J.T., Learning Curve Approach to Reliability Monitoring, IEEE Transactions on Aerospace, Volume 2, Number 2, April 1964

SummarySeveral different and complex electromechanical and mechanical systems are show to have remarkably similar rates of reliability improvement. These similarities provide the basis for a learning curve, which can be used to monitor development programs, predict growth patters, and plan programs for reliability improvement.

The Learning CurveIn an effort to determine the manner in which reliability performance changes during development and design improvement activity, data was analysed for a total of five different products. A remarkably consistent pattern emerged when cumulative failure rate (defined as total malfunctions since program start, divided by total operating hours since start) was plotted on log-log-paper as a function of cumulative operating hours.

Considering the wide range of equipment types and complexity represented by the data, a remarkable similarity in trends is evident. The fact that the curves are parallel indicates uniformity in rate of reliability improvement. Relative positions of the curves in the vertical direction are evidently a measure of inherent design reliability.

AnalysisIt can be seen that all the curves form reasonably straight lines that are similar in slope. In general, it appears that cumulative failure rate will vary in a manner directly proportional to some negative power of cumulative operating hours. This can be expressed mathematically as:

where:

K = Constant

= Exponent determined by slope

This equation implies that reliability will continually increase as operating experience is gained. This may not be true as operating time reaches extremely high levels, but the evidence presented does indicate that the relationship is valid over a long period. This relationship probably applies as long as active programs are in place to improve equipment reliability.

DiscussionThe techniques proposed here assume that a “normal” rate of reliability improvement exists. Sucha a normal growth rate can be useful as a standard against which to compare actual performance, but it must not be viewed as an absolute limit on the performance of a given product.

Since the proposed procedure is intended primarily for use in conjunction with development programs, it is important to note that test conditions have a major effect on data validity.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM67

Page 68: 8124 Study Notes 31 July

Extrapolation of failure rate data by straight-line extension of experimental curves provides an obvious way of predicting reliability at given points in the development cycle.

Crow, Larry, Evaluating the Reliability of Repairable Systems, Proceedings Annual Reliability and Maintainability Symposium, 1990.

AbstractMoat complex systems, for example automobiles, aircraft, communication systems, etc are repaired and not replaced when they fail. This paper discusses the Weibull process, or power law non-homogeneous Poisson process model for analysing the reliability of repairable systems. Estimation and other statistical procedures are given for this model which are appropriate when failure data are generated by multiple systems. It is shown that in a special case this repairable systems model reduces to a model for reliability growth.

IntroductionWhen a complex system with new technology is fielded or subjected to customer use environment, there is often considerable interest in assessing its reliability and other related performance parameters such as availability. Although operating tests are conducted for many systems during development, it is generally recognised that in many cases these tests may not yield complete data representative of an actual use environments. Other interests in measuring the reliability of a fielded system may centre on, for example, logistics and maintenance policies, quality and manufacturing issues, burn-in, wear out, mission reliability or warranties.

Most complex systems are repaired, not replaced, when they fail. A number of books and papers in the literature have stressed that the usual non-repairable reliability methodologies, such as the Weibull distribution, are not appropriate for the repair able sustme reliability analyses and have suggested the use of the non-homogeneous Poisson process models.

The homogeneous Poisson process is equivalent to the widely used Poisson distribution and exponential times between system failures model appropriate when the system’s failure intensity is not affected by the system’s age. However to realistically consider burn-in, wearout, useful life, maintenance policies, warranties, mission reliability etc will often require an approach that recognises that the failure intensity of these systems many not be constant over the operating life of interest but may change with system age. A useful and generally practical extension of the Poisson process which allows for the system failure intensity to change with system age.

Typically, the reliability analyses of a repairable system under customer use will involve data generated by multiple systems. The Weibull process or power law non-homogeneous Poisson process for this type of analysis are appropriate. This paper will discuss the specific application of these methods under several situations which are coming in practice and will illustrate the numerical calculations by examples.

In this paper it is strongly recommended that the reliability characteristics for each system under the study be analysed separately before the failure data are combined. The techniques described in this paper are combined failure data assumes that each system is governed by the same Weibull process failure intensity model.

The Model

In this paper we assume that the failure for each system under study are occurring according to a non-homogeneous Poisson process with intensity function

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM68

Page 69: 8124 Study Notes 31 July

where lambda and beta > 0 and t is the age of the system. This particular mathematical form for the intensity u(t) is the

http://www.weibull.com/RelGrowthWeb/Crow-AMSAA_(N.H.P.P.).htm

In "Reliability Analysis for Complex, Repairable Systems" (1974), Dr. Larry H. Crow noted that the Duane model could be stochastically represented as a Weibull process, allowing for statistical procedures to be used in the application of this model in reliability growth. This statistical extension became what is known as the Crow-AMSAA (N.H.P.P.) model. This method was first developed at the U.S. Army Materiel Systems Analysis Activity (AMSAA). It is frequently used on systems when usage is measured on a continuous scale. It can also be applied for high reliability, a large number of trials and one-shot items. Test programs are generally conducted on a phase by phase basis. The Crow-AMSAA model is designed for tracking the reliability within a test phase and not across test phases.

A development testing program may consist of several separate test phases. If corrective actions are introduced during a particular test phase then this type of testing and the associate data are appropriate for analysis by the Crow-AMSAA model. The model analyzes the reliability growth progress within each test phase and can aid in determining the following:

Reliability of the configuration currently on test Reliability of the configuration on test at the end of the test phase

Expected reliability if the test time for the test phase is extended

Growth rate

Available confidence intervals

Applicable goodness-of-fit tests

The reliability growth pattern for the Crow-AMSAA model is exactly the same pattern as for the Duane postulate. That is, the cumulative number of failures is linear when plotted on ln-ln scale. Unlike the Duane postulate the Crow-AMSAA model is statistically based. Under the Duane postulate the failure rate is linear on ln-ln scale. However for the Crow-AMSAA model statistical structure, the failure intensity of the underlying non-homogeneous Poisson process (NHPP) is linear when plotted on ln-ln scale.

Minitab Help File

Power-law processA non-homogeneous Poisson process with an intensity function that represents the rate of failures or repairs. The power-law process can model a system that is improving, deteriorating, or remaining stable.

With the default (maximum likelihood) estimation method, the power-law model is also known as the AMSAA model. With the least squares estimation method, the power-law process model is also known as the Duane model.

Study Guide 5 Self Assessment QuestionsA basic reliability models includes what parameters:

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM69

Page 70: 8124 Study Notes 31 July

Four methods of reliability modelling are:

Reliability block diagrams

Common Cause Failure (CCF) modelling

Fault Tree Analysis

Event Tree Analysis

Four methods of reliability prediction are:

Define a common failure mode

When developing a reliability block diagram, a general approach should include the following steps:

1. Establish failure criteria – define what constitutes a failure since this will determine which failure modes at the component level actually cause a system to fail.

2. Establish a reliability block diagram – it is necessary to describe the system as a number of functional blocks which are interconnected according to the effects of each block failure on the overall system reliability

3. Failure Mode Analysis – complete an FMEA by examining individual component failure modes and failure rates

4. Calculation of system reliability – relating the block failure rates to the system reliability is a question of mathematical modelling

5. Reliability allocation – the block failure rates are taken as a measure of the complexity and improved, suitably weighted objectives are set.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM70

Page 71: 8124 Study Notes 31 July

STUDY GUIDE 6 - Reliability Testing

ObjectivesYou should be able to:

Recognise the importance of reliability testing

Choose the approach that is best suited to your situation.

O-Conner Chapter 11 – Reliability Testing

IntroductionTesting is an essential part of any engineering development programme. Reliability testing is necessary because designs are seldom perfect and because designers cannot be aware of, or be able to analyse, all the likely causes of failures of their designs in service.

Reliability testing should be considered as part of an integrated test programme, which should include:

1. Functional Testing (to confirm the design meets the basic performance requirements)

2. Environmental Testing (to ensure the design is capable of operating under the expected range of environments)

3. Statistical Tests (as described in Chapter 5, to optimise the design of the product and production processes)

4. Reliability testing (to ensure the product will operating without failure during its expected life)

5. Safety Testing (when appropriate)

To provide the basis for a properly integrated development test programme, the design specificiations should cover all criteria to be tested (function, environment, reliability, safety).

The development test programme should include:

1. Model allocations (components, sub-assemblies, system)

2. Requirements for facilities such as test equipment

3. A common test and failure reporting system

4. Test schedule

One person should be put in charge of the entire programme, with the responsibility and authority for ensuring that all specification criteria will be demonstrated.

There is one conflict inherent in reliability testing. To obtain information about reliability in a cost-effective way (ie quickly) it is necessary to generate failures. Only then can safety margins be ascertained. On the other hand, failures interfere with functional and environmental testing. The development test programme must address this dilemma.

The development test dilemma should be addressed by dividing tests into two main categories:

1. Tests in which failures are undesirable

2. Tests which deliberately generate failures

Statistical, functional and most environmental testing are in category 1. Most reliability testing is in category 2. There must be a common reporting system for test results and failures, and for action to be taken to analyse and correct failure modes.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM71

Page 72: 8124 Study Notes 31 July

The category 2 test should be started as soon as the hardware is available. Tests should be planned to show up failure modes as early as is practicable.

Planning Reliability Testing

Using design analysis dataThe design analyses performed during the design phase should be used in preparing the reliability test plan. These should have highlighted the risks and uncertainties in the design, and the reliability test programme should specifically address these.

Considering VariabilityWe have seen (Chapters 4 and 5) how variability affects the probability of failure. A major source of variability is the range of production processes that convert designs into hardware. Therefore the reliability test programme must cover the effects of variability on the expected and unexpected failure modes.

DurabilityThe reliability test programme must take account of the pattern of the main failure modes with respect to time.

If the failure modes have increasing hazard rates, testing must be directed to assuring adequate reliability during the expected life.

Generally speaking, mechanical components and assemblies are subject to increasing hazard rates when wear, fatigue, corrosion or other deterioration processes can cause failure. Systems subject to repair and overhaul can also become less reliable with age, due to the effects of maintenance, so the appropriate maintenance actions must be included in the test plan.

Test EnvironmentsThe reliability test programme must cover the range of environmental conditions which the product is likely to have to endure. The main reliability-affecting environmental factors are:

Temperature

Vibration

Shock

Humidity

Power input and output

Dirt

People

In addition, electronic equipment might be subjected to:

Electromagnetic effects (EMI)

Voltage transients, including Electrostatic Discharge (ESD)

Testing for Reliability and Durability: Accelerated TestingIn Chapters 8 and 9 we reviewed how mechanical, electrical and other stresses can lead to failures, and in Chapter 4 how variations of strength, stresses and other conditions can influence the likelihood of failure or duration to failure. In this section we describe how tests should be designed and conducted to provide assurance that designs and products are reliable and durable in service.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM72

Page 73: 8124 Study Notes 31 July

For most engineering designs, we do not know what is the “uncertainty gap” between the theoretical and real capabilities of the design and of products made to it, for the whole population, over their operating lives and environments.

The conventional approach to this problem has been to treat reliability as a functional performance characteristic that can be measured, by testing items over a period of time whilst applying simulated or actual in-service conditions, then calculating the reliability achieved on the test.. These methods are fundamentally inadequate for providing assureance of reliability. The main reason is they are based on measuring the reliability achieved during the application of simulated or actual stresses that are within the specified service environments, in the expectation that the number of failures will be below some criterion for the test.

The correct approach is straightforward: we must test to cause failures, not test to demonstrate successful achievement. If the design is simple and there is an adequate margin between stress and strength, we might decide that no further testing is necessary. If however, constraints such as weight force us to design with smaller margins, and if the compnents function is critical, we might well consider it prudent to test some quantity to failure.

When failures occur on test we should ask whether they could occur in use. The questions that must be asked are:

1. Could this failure occur in use (on other items, after longer times, at other stresses)?

2. Could we prevent it from happening in use?

The stresses that were applied are relevant only in so far as they were tools to provide the evidence that an opportunity exists to improve the design. We have obtained information on how to reduce the uncertainty gap.

For even simple and common failure situations like these, there is not just one distribution that is important, but a number of possible distributions and interactions. This reasoning leads to the main principle of development testing for reliability: we should increase the stresses so that we cause failures to occur, then use the information to improve reliability and durability.

The logic that justifies the use of very high “unrepresentative” stresses is based upon four aspects of engineering reality:

1. The causes of failures that will occur in the future are often very uncertain

2. The probabilities of and durations to failures are also highly uncertain.

3. Time spent testing is expensive, so the more quickly we can reduce the uncertainty gap the better

4. Finding causes of failures during development and preventing recurrence is far less expensive than finding new failure causes during use/service.

It cannot be emphasised to strongly: testing at “representative” stresses, in the hope that failures will not occur, is very expensive in time and money and is mostly a waste of resources.

Smith, Chapter 12 –

AS3960 Section 2

AS3960 Page 26

Self-Assessment Questions1. List the five elements of reliability testing

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM73

Page 74: 8124 Study Notes 31 July

Functional Testing (to confirm the design meets the basic performance requirements)

Environmental Testing (to ensure the design is capable of operating under the expected range of environments)

Statistical Tests (as described in Chapter 5, to optimise the design of the product and production processes)

Reliability testing (to ensure the product will operating without failure during its expected life)

Safety Testing (when appropriate)

2. The widest range of reliability-affecting environmental categories are:

3. This is not one of the main types of test program?

4. Name four categories of testing

Temperature

Vibration/shock

Electromagnetic

5. How many systems to be tested can be determined by considering what three issues?

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM74

Page 75: 8124 Study Notes 31 July

STUDY GUIDE 7 - Managing & Solving Reliability Problems

O’Conner Chapter 12 – Analysing Reliability Data

IntroductionThis chapter describes a number of techniques, further to the probability plotting methods described in chapter 3, which can be used to analyse reliability data derived from development tests or field units, with the objectives of monitoring trends, identifying causes of unreliability, and measuring or demonstrating reliability.

Pareto AnalysisAs a first step in reliability data analysis we can use the Pareto principle of the ‘significant few and the insignificant many’. It is often found that a large proportion of failures in a product are due to a small number of causes. Therefore if we analyse the failure data, we can determine how to solve the largest proportion of the overall reliability problem with the most economical use of resources.

Accelerated Test Data AnalysisFailure and life data from accelerated stress tests can be analysed using the methods described in Chapters 2-5. If the mechanism is well understood, for example material fatigue, then the model for the process can be applied to interpret the results and to derive reliability of life values at different stress levels.

Extrapolation of accelerated test results to expected in-service conditions can be misleading if the test stresses are much higher, since the different failure mechanisms might be simulated. It is important that the primary objective of the test is understood; whether it is to determine or confirm a life characteristic, or to help create designs that are inherently failure free.

These methods are not appropriate for failures of assemblies or systems, when several different failure modes might be present.

Probability plotting methods (Chapter 3) can also be used for analysing such data, when sufficient data are available.

Reliability Analysis of Repairable SystemsChapter 3 described methods for analysing data related to time to first failure. However, for repairable systems, which really represent the great majority of everyday reliability experience, the distributions of times to first failures are much less important than is the failure rate or rate of occurrence of failure (ROCOF) of the system.

Any repairable system may be considered as an assembly of parts, the parts being replaced when they fail. If we ignore replacement (repair) times, which are usually small in comparison with standby or operating times, and if we assume that the time to failure of any part is independent of any repair actions, then we can use the methods of event series analysis (shown in chapter 2) to analyse system reliability.

If we do not perform a centroid test and assume the data were independently and identically distributed (IID), we might order the data in rank order and plot on probability paper.

This example shows how important it is for failure data to be analysed correctly, depending on whether we need to understand the reliability of a non-repairable part or of a repairable system. The presence of a trend when the data are ordered chronologically shows that times to failure are not IID, and ordering by magnitude, which implies IID, will therefore give misleading results.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM75

Page 76: 8124 Study Notes 31 July

We can derive system reliability over a period by plotting the cumulative times to failure in chronological order rather than in rank order.

If there are no perturbations, the failure rate will tend to a constant value after most parts have been replaced at least once, regardless of the failure trends of individual parts. This is one of the main reasons the Constant Failure Rate (CFR) assumption has become so widely used for systems, and why part hazard rates has been confused with failure rate.

If part times to failure are individually and identically exponentially distributed (IID exponential) the system will have a CFR which will be the sum of the reciprocals of the part mean times to failure, eg:

The assumption of IID exponential for part times to failure for a repairable system can be misleading for the following reasons:

1. The most important failure modes of a system are usually caused by parts which have failure probabilities which increase with time (wear out failures)

2. Failure and repair of one part may cause damage to others. Therefore times between failures are not necessarily independent.

3. Repairs often do not “renew” the system.

4. Repairs might be made by adjustment, lubrication, thus providing a new lease of life, but not “renewal”

5. Replacement parts can make subsequent failure initially more likely to occur

6. Repair personnel learn by experience, so diagnostic ability improves with time. Conversely, changes in personnel can lead to reduced diagnostic ability and therefore repeated failures.

7. Not all part failures will produce system failures

8. Factors (such as cycling) are often more important than operating times

9. Reported failures are nearly always subject to human bias and emotion

10. Failure probability is affected by scheduled maintenance or overhaul

11. Replacement parts are not necessarily drawn from the same population – they may be better or worse.

12. System failures might be caused by parts whose combined tolerances cause the system to fail.

13. Many reported failures are not caused by part failures at all, but by events such as intermittent connections, improper use, maintainers using opportunities to replace “suspect” parts etc.

14. Within a system, not all parts operate to the overall system cycle.

The factors above often predominate in systems to be modelled and in collected reliability data.

A CFR is often a practicable and measurable first-order assumption, particularly when data are not sufficient to allow more detailed analysis.

CUSUM ChartsThe ‘cumulative sum’, or CUSUM, chart, is an effective graphical technique for monitoring trends in quality control and reliability. The principle is that, instead of monitoring the measured value of interest, we plot the divergence, plus or minus, from the target value. The method enables us to report progress simply and in a way that is very easily comprehended.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM76

Page 77: 8124 Study Notes 31 July

The CUSUM chart also provides a sensitive indication of trends and changes. Instead of indicating measured values against the sample number, the plot shows the CUSUM, and the slope provides a sensitive indicator of the trend, and of points at which the trend changes.

Exploratory Data Analysis and Proportional Hazards ModellingExploratory data analysis is a simple graphical technique for searching for connections between time series data and explanatory factors. In the reliability context, the failure data and plotted as a time series chart, along with other information.

The method of presenting data can be very useful for showing up causes of unreliability in systems such as vehicle fleets, process plants etc.

Proportional hazards modelling (PHM) is a mathematical extension of EDA. In the proportional hazards model, the covariates are assumed to have a multiplicative effect on the total hazard rate. In standard regression analysis or analysis of variance the effects are assumed to be additive. The proportional hazards approach can be applied to failure rate data from repairable and non-repairable systems.

Reliability DemonstrationIt is often necessary to measure the reliability of equipment and systems during development, production and use. Two basic forms of reliability measurement are used. A sample of equipment may be subjected to a formal reliability test, with the condition specified in detail. Reliability may also be monitored during development and use, as test and utilization proceed, without tests being set up specifically for reliability measurement. This section describes standard methods of test and analysis which are used to demonstrate compliance with reliability requirements.

Probability ration sequential test (PRST) (US MIL-HDBK-781)MIL HDBK-781 testing is based on probability ratio sequential testing (PRST). Testing continues until the ‘staircase’ plot of failures versus time crosses a decision line. The reject line indicates a boundary beyond which the equipment will have failed to meet the test criteria. Crossing the accept line denotes that the test criteria have been met. Test time is stated as multiples of the specified MTBF.

Combining Results Using Bayesian StatisticsIt can be argued that the result of a reliability demonstration test is not the only information available on a product, but that information is available prior to the start of the test, from component and subassembly tests, previous tests on the product and even intuition based on experience. Why should this information not be used to supplement the formal test result? Bayes theorem enables us to combine such probabilities.

The Bayesian approach is very controversial in reliability engineering, particularly as it has been argued that is provides a justification for less reliability testing. Choosing a prior distribution based on subjective judgement or other test experience can also be very contentious. Combining subassembly test results in this way also ignores the possibility of interface problems. The Bayesian approach is not normally recommended and has not been approved in any formal national standards.

Non-Parametric MethodsNon-parametric statistical techniques (see page 65) can be applied to reliability measurement. They are arithmetically very simple and so can be useful as quick tests in advance of more detailed analysis, particularly when no assumption is made of the underlying failure distribution.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM77

Page 78: 8124 Study Notes 31 July

Reliability Growth Modelling

The Duane MethodIt is common for new products to be less reliable during early development than later in a program, when improvements have been incorporated as a result of failures observed and corrected. This was first analysed by J.T. Duane, who derived an empirical relationship based upon observation of the MTBF improvement of a range of items used on aircraft. Duane observed that the cumulative MTBF plotted against total time on log-log paper gave a straight line.

The slope of the lines gives an indication of the rate of MTBF growth and hence the effectiveness of the reliability programme in correcting failure modes. Duane observed that typical slopes ranged between 0.2 and 0.4, and that the value was correlated with the intensity of the effort on reliability improvement.

The Duane methods is applicable to a population with a number of failure modes which are progressively corrected, an in which a number of items contribute different running times to the total time. Therefore it is not appropriate for monitoring early development testing. The method is also not consistent with the use of accelerated tests during development, since the objective of these is to force failures, not to generate reliability statistics.

After the end of a development programme, the anticipated MTBF of production items is measured assuming the development testing accurately simulated the expected in-use stresses of the production items and that the standard of items being tested at the end of the development program fully represents production items. The empirical Duane method provides a reasonable approach to monitoring and planning MTBF growth for complex systems.

The Duane method, can also be used in principle to assess the amount of test time required to attain a target MTBF. If the MTBF is known as some early stage, the test time required can be estimated is a value is assumed for the growth slope.

The Duane method is criticised as being empirical and subject to wide variation. It is also argued that reliability improvements in development is not usually progressive but occurs in steps as modifications are made. However, the model is simple to use and it can provide a useful planning and monitoring method for reliability growth. As with any other failure data, trend tests as described in Chapter 2 should be performed to ascertain whether the assumption of a constant failure rate is valid.

Statistical tests for MTBF or success rate changes can also be used ot confirm reliability growth as described in Chapter 2.

The M(t) MethodThe M(t) method of plotting failure data is a simple and effective way of monitoring reliability changes over time. It is most suitable for analysing the reliability performance of equipment in service.

M(t) is the mean accumulated number of failures as a function of operating time. The slope of the line indicates the proportion per time unit failing, or the failure intensity. Reliability improvement will reduce the slope. A straight line indicates a constant (random) pattern. An increasing slope indicates an increasing patter. Changes in slope indicate changing trends.

The M(t) method can be used to monitor reliability trends such as the effectiveness of improvement actions.

The M(t) method can be useful for identifying and interpreting failure trends. It can also be used for evaluating logistics and warranty policies.

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM78

Page 79: 8124 Study Notes 31 July

O’Conner, Cautionary Note, page 22Whilst statistical methods can be very powerful, economic and effective in reliability engineering applications, they must be used in the knowledge that variation in engineering is in important ways different from variation in most natural processes or in repetitive engineering processes such as controlled machining or diffusion processes. Such processes are usually:

Constant in time, in terms of the nature (average, spread) of the variation

Distributed in a particular way, describable by a mathematical function known as the s-normal distribution

In fact, these conditions often do not apply in engineering. For example:

Past data cannot be used to forecast future reliability, using purely statistical methods. A change in a process might affect reliability and the change might be deliberate or accidental, know or unknown

Components might be selected according to criteria such as dimension or other measured parameters. This can invalidate the s-normal distribution assumption on which much of the statistical methods is based. This might or might not be important in assessing results.

A process or parameter might vary in time, continuously or cyclically, so that statistics derived at one time might not be relevant at others.

Variation is often deterministic by nature, for example spring deflection as a function of force, and it would not always be appropriate to apply statistical techniques to this sort of situation.

Variation in engineering can arise from factors that defy mathematical treatment.

Variation can be catastrophic, not only continuous.

These points highlight the fact that variation in engineering is caused to a large extent by people, as designers, makers, operators and maintainers. Therefore the human element must always be considered and statistical analysis must not be relied on without appropriate allowances being made for the effects of factors such as motivation, training, management and the many other factors that can influence reliability.

In any application of statistical methods ultimately all cause and effect relationships have explanation. Statistical techniques can be very useful in helping us to understand and control engineering situations; however they do not provide explanations on their own. We must seek to understand the causes of variation, since only then can we really be in control.

Smith, Chapter 3

Study Guide 7: Self Assessment Questions1. The usual indices of quality costs include

Prevention Costs

Appraisal Costs

Failure Costs

2. Reliability growth modelling

3. Profit from quality is measured by the difference between

4. What are four components of life cycle cost?Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM79

Page 80: 8124 Study Notes 31 July

I. Acquisition Costs

II. Ownership Cost

III. Operating Cost

IV. Administration Cost

5. How does reliability contribute to life cycle?

Determines the frequency of repair, fixes spares requirements, determines loss of revenue (with maintainability)

Last saved 11/1/2006 11:51:00 AM Last printed 5/9/2006 01:59:00 PM80