22
Human hazard analysis A prototype method for human hazard analysis developed for the large commercial aircraft industry Philip Lawrence and Simon Gill Aerospace Research Centre, UWE Bristol, Bristol, UK Abstract Purpose – This paper sets out to outline a human hazard analysis methodology as a tool for managing human error in aircraft maintenance, operations and production. The methodology developed has been used in a slightly modified form on Airbus aircraft programmes. This paper aims to outline a method for managing human error in the field of aircraft design, maintenance and operations. Undertaking the research was motivated by the fact that aviation incidents and accidents still show a high percentage of human-factors events as key causal factors. Design/methodology/approach – The methodology adopted takes traditional aspects of the aircraft design system safety process, particularly fault tree analysis, and couples them with a structured tabular notation called a human error modes and effects analysis (HEMEA). HEMEA provides data, obtained from domain knowledge, in-service experience and known error modes, about likely human-factors events that could cause critical failure modes identified in the fault tree analysis. In essence the fault tree identifies the failure modes, while the HEMEA shows what kind of human-factors events could trigger the relevant failure. Findings – The authors found that the methodology works very effectively, but that it is very dependent on locating the relevant expert judgement and domain knowledge.. Research limitations/implications – The authors found that the methodology works very effectively, but that it is very dependent on locating the relevant expert judgement and domain knowledge. Using the method as a prototype, looking at aspects of a large aircraft fuel system, was very time-consuming and the industry partner was concerned about the resource implications of implementing this process. Regarding future work, the researchers would like to explore how a knowledge management exercise might capture some of the domain knowledge to reduce the requirement for discursive, seminar-type sessions with domain experts. Practical implications – It was very clear that the sponsors and research partners in the aircraft industry were keen to use this method as part of the safety process. Airbus has used a modified form of the process on at least two programmes. Originality/value – The authors are aware that the UK MOD uses fault tree analysis that includes human-factors events. However, the researchers believe that the creation of the human error modes effects analysis is original. On the civil side of the aviation business this is the first time that human error issues have been included for systems other than the flightdeck. The research was clearly of major value to the UK Civil Aviation Authority and Airbus, who were the original sponsors. Keywords Aircraft, Quality control, Air safety Paper type Research paper 1. Human-centred design 1.1 Introduction Because the technological elements in aircraft systems are becoming ever more reliable, there has been an inevitable sense in recent years that human factors issues would become the central focus for improving safety. This conclusion was reached by The current issue and full text archive of this journal is available at www.emeraldinsight.com/0965-3562.htm DPM 16,5 718 Disaster Prevention and Management Vol. 16 No. 5, 2007 pp. 718-739 q Emerald Group Publishing Limited 0965-3562 DOI 10.1108/09653560710837028

DPM Human hazard analysis - cardioportal.ro · adopts the framework of Aerospace Recommended Practice (ARP) 4754 and ARP 4761. ... perform these functions in order to determine that

  • Upload
    dinhnhu

  • View
    219

  • Download
    1

Embed Size (px)

Citation preview

Human hazard analysisA prototypemethod for human hazard analysisdeveloped for the large commercial aircraft

industry

Philip Lawrence and Simon GillAerospace Research Centre, UWE Bristol, Bristol, UK

Abstract

Purpose – This paper sets out to outline a human hazard analysis methodology as a tool formanaging human error in aircraft maintenance, operations and production. The methodologydeveloped has been used in a slightly modified form on Airbus aircraft programmes. This paper aimsto outline a method for managing human error in the field of aircraft design, maintenance andoperations. Undertaking the research was motivated by the fact that aviation incidents and accidentsstill show a high percentage of human-factors events as key causal factors.

Design/methodology/approach – The methodology adopted takes traditional aspects of theaircraft design system safety process, particularly fault tree analysis, and couples them with astructured tabular notation called a human error modes and effects analysis (HEMEA). HEMEAprovides data, obtained from domain knowledge, in-service experience and known error modes, aboutlikely human-factors events that could cause critical failure modes identified in the fault tree analysis.In essence the fault tree identifies the failure modes, while the HEMEA shows what kind ofhuman-factors events could trigger the relevant failure.

Findings – The authors found that the methodology works very effectively, but that it is verydependent on locating the relevant expert judgement and domain knowledge..

Research limitations/implications – The authors found that the methodology works veryeffectively, but that it is very dependent on locating the relevant expert judgement and domainknowledge. Using the method as a prototype, looking at aspects of a large aircraft fuel system, wasvery time-consuming and the industry partner was concerned about the resource implications ofimplementing this process. Regarding future work, the researchers would like to explore how aknowledge management exercise might capture some of the domain knowledge to reduce therequirement for discursive, seminar-type sessions with domain experts.

Practical implications – It was very clear that the sponsors and research partners in the aircraftindustry were keen to use this method as part of the safety process. Airbus has used a modified form ofthe process on at least two programmes.

Originality/value – The authors are aware that the UK MOD uses fault tree analysis that includeshuman-factors events. However, the researchers believe that the creation of the human error modeseffects analysis is original. On the civil side of the aviation business this is the first time that humanerror issues have been included for systems other than the flightdeck. The research was clearly ofmajor value to the UK Civil Aviation Authority and Airbus, who were the original sponsors.

Keywords Aircraft, Quality control, Air safety

Paper type Research paper

1. Human-centred design1.1 IntroductionBecause the technological elements in aircraft systems are becoming ever morereliable, there has been an inevitable sense in recent years that human factors issueswould become the central focus for improving safety. This conclusion was reached by

The current issue and full text archive of this journal is available at

www.emeraldinsight.com/0965-3562.htm

DPM16,5

718

Disaster Prevention and ManagementVol. 16 No. 5, 2007pp. 718-739q Emerald Group Publishing Limited0965-3562DOI 10.1108/09653560710837028

the UK Civil Aviation Authority (UK CAA), after an analysis of the results of a highlysignificant “Key Risks” study of 671 fatal aviation accidents. (UK Civil AviationAuthority, 2000). In particular “crew and human factors” and “design related” issuesemerged as the two key causal factors in relation to fatal aviation accidents.

A second key factor that the UK CAA highlighted was the issue of maintenanceerror. Although maintenance human error had been topical throughout the 1990s, thefocus had chiefly been on the errors of maintenance staff as well as the workplace andcommercial constraints, which generated what James Reason had referred to as“organisational pathogens”. In Reason’s analysis these pathogens create latentconditions, which are in effect accidents waiting to happen. (Reason, 1997).

However, the UK CAA’s response to the “Key Risk” study also emphasised thedesign aspect of maintenance error. During the 1990s mandatory occurrence reports(MORs) in the UK had risen continually for maintenance-related issues. Although partof the problem was clearly the error of maintenance staff and also error inducers in thework environment, it was also clear that design could induce maintenance error. Thisaspect of design had long been a focus of attention for cockpit designers, but had notbeen systematically related to maintenance.

One psychological correlate for this realisation about design lies in the concept ofaffordance. According to Norman (1988) an affordance is an aspect of design that gives astrong cue as to how an artefact could/should be used. But with poor design, affordancescan be counter-intuitive. The “affordance perception” can be wrong. Doors can be made tolook as though they pull to open, when in fact they have to be pushed. Also objects andmachines can have hidden control logics, which override the obvious affordance. Withmore mundane artefacts users often come to accept quirky aspects of design, and Norman(1988) argues strongly that in our technologically oriented world people who findmachines difficult to operate tend to blame their own inadequacies. But his clear messageis that the fault often lies with the design, not the user or maintainer.

In the case of computerised control systems the possibility that humans willmisunderstand the operational logic of the system clearly jeopardises safety. In aircraftsystems a good example is that of highly automated flight management systems (FMS),where the logics that determine mode selection for flight control functions can remainopaque to some crews. Particularly when the automation does something unexpected,crews can respond in an inappropriate way because their mental model of what isoccurring is discrepant from the actual system behaviour. This problem was in evidenceat Nagoya in Japan in April 1994, when an Airbus A300-600 crashed with the loss of 264lives. The crew were attempting to select a glide slope approach, unaware that theautopilot was in go-around mode. As the aircraft adopted a pitch-up attitude the crewresponded with manual inputs to bring the nose down. In effect the crew were engaged ina tug of war with the autopilot, which ultimately resulted in them neutralising theaircraft’s anti-stall protection. At 1,000 ft, still in a pitch-up attitude, the aircraft’s speeddecayed to 90 mph, causing a stall[1]. In this case the human operator’s inability tounderstand the system’s behaviour had catastrophic consequences.

1.2 P-NPA 25.310In keeping with the insight of Norman outlined above, the UK CAA used their accidentdata and maintenance reporting information as the catalyst for an initiative onhuman-centred design (HCD) (UK Civil Aviation Authority, 2000). The aim was simply

Human hazardanalysis

719

to build requirements into the aircraft design process that would reduce the incidenceof operator and maintainer error. In 2000 one of the recommendations of this HCDinitiative formed the basis for a Preliminary Notice of Proposed Amendment (P-NPA)to JAR 25 (UK Civil Aviation Authority, 2000).

One of the catalysts for the P-NPA was the fact that the JAR 25.1309 (now CS 25)safety codes show a clear intent to include aspects of human performance in the systemsafety requirements for large aircraft, but these were seen to be limited and incomplete.Thus the P-NPA could be seen as doubly appropriate. On the one hand everythingpointed to the centrality of human factors issues as key causal agents in accidents; onthe other the existing safety requirements regime seemed to lack detailed andsystematic coverage of human factor issues.

Broadly speaking P-NPA 25-310 had two objectives:

(1) to make designers more aware of the need for “human-centred design”; and

(2) to identify safety critical tasks performed by humans and, wherever possible,address them, preferably by design.

Where this cannot be achieved, the potential for safety critical failures arising fromhuman error should be minimised.

The key passage in the P-NPA is found in the following paragraph:

It must be shown by analysis, substantiated where necessary by test, that as far asreasonably practicable all design precautions have been taken to prevent human errors inproduction, maintenance and operation causing Hazardous or Catastrophic effect. Where thepotential cannot realistically be eliminated, then the remaining safety critical tasks should besufficiently understood and the potential for Human Error mitigated (UK Civil AviationAuthority, 2000).

1.3 The system safety process and human hazard analysis (HHA)As can be seen this instruction in the P-NPA involved a substantial addition to the thentypical JAR 25.1309 safety process. The process is in accordance with CS 25.1309 andadopts the framework of Aerospace Recommended Practice (ARP) 4754 and ARP 4761.The framework is then developed to cover all phases of aircraft development, from theaircraft-level preliminary design phase to the in-service airworthiness monitoring aftercertification.

The system safety process identifies aircraft functions and the systems designed toperform these functions in order to determine that possible hazards have beenadequately addressed. A description of the initial stages of the safety process isprovided here to highlight what is required of the safety process in order to conduct anHHA. The phases of the HHA methodology are then delineated.

The process begins with the aircraft-level functional hazard assessment (FHA) toassess the significant failure conditions attached to given aircraft functions. Afterfunctions are assigned to specific aircraft systems, the FHA is repeated at the systemlevel. From the point of view of HHA, what the FHA will have provided is theidentification of functions and potential failure conditions, which may triggercatastrophic or hazardous failure conditions. From the system-level FHA significantfailure conditions are again determined and the HHA can begin to map and analyse thepossible contribution of human actions to those failures.

DPM16,5

720

The next stage is the preliminary system safety assessment (PSSA). This takesthe failure conditions that have been identified and using techniques, such as faulttree analysis (FTA), coupled with data about component reliability, seeks todetermine whether selected probability targets for given failure conditions havebeen met. After the PSSA has been issued, equipment suppliers are givenspecifications to meet for the reliability of equipment and they are mandated toproduce their own safety plans using mean time before failure (MTBF) data andfailure mode and effects analysis (FMEA).

The real impact of the HHA technique recommended here will materialise at thePSSA stage. The PSSA includes the following:

. examining the proposed system architecture to determine how it will meet FHArequirements and how failures can cause the functional hazards identified in theFHA;

. deriving safety/reliability (S/R) requirements from system-level FHA, for designof lower-level items (e.g. equipment, software, etc.);

. interacting with suppliers;

. validating S/R requirements from FHA;

. analysing through FTA showing the combinations of causal failure modeswhich can lead to the failure condition;

. determining development assurance levels (DALs) to be imposed on each systemcomponent; and

. determining a preliminary list of maintenance tasks[2].

As can be seen from the list above, the PSSA provides the right context to begin toassess systematically the role of human error in system failures, because it is onlywhen there is a derived architecture that the exact role of humans in the systembecomes clear. By using FTA the potential role of critical human error becomes clear,as does the combinations of failures and errors necessary to create hazardous orcatastrophic system states. In addition, from a maintenance and production point ofview, the PSSA provides information on requirements for equipment and components.Once the base level events of the FT are determined, suppliers must generate FMEAsto satisfy that failures within components cannot contribute to the listed base events.The key part of the HHA proposed here is to assess the potential impact ofmaintenance errors on failures of components with potentially hazardous orcatastrophic effects. One of the major premises of the research is that safety-criticalcomponents may have a design aspect likely to lead to a potentially safety-criticalmaintenance or production error.

The PSSA therefore assesses the proposed system architectures and derives safetyrequirements for subsequent levels. The next stage of the process, the system safetyassessment (SSA), verifies that the design satisfies the safety objectives. The SSAintegrates the results of the PSSA, results of detailed studies at the supplier level, andthe results of any additional tests that may have been requested in the PSSA. The SSAis delivered as a certification document.

1.3.1 Fault trees in human error assessment. Fault trees have been used in themethod developed here for the following reasons:

Human hazardanalysis

721

(1) Fault trees are widely used in the aerospace industry and can therefore easily beadapted to human error assessment (i.e. already existing fault trees for systemassessment can be used).

(2) Fault trees are a recognised and proven tool for system safety assessment.

(3) Fault trees are systematic and typically consider all contributory failures in alogical fashion. They have high information density and are unambiguousschematics.

(4) Fault trees are adaptable to any system or scenario.

(5) Fault trees can determine common mode failures/errors within a system.

(6) Fault trees can provide the criticality of human actions, component failures,sub-system failures and events.

The limitations of fault trees include the following:. Fault trees can only consider a component or a human action in a

failed/not-failed state.. They give no credit for system recovery or the human taking mitigating action

following the failure/error.. Fault trees cannot differentiate between various man-machine interfaces (i.e.

fault trees cannot determine which man-machine interface is less likely toproduce a human error).

. Merely a probability of failure can be used which may not be available orsuitable.

. Fault trees are poor at depicting sequential events that would take place inassembly, disassembly or removal of a component.

. Fault trees cannot show the system state or system health (including that ofoperators and maintainers) as the scenario begins. In this case, fault trees cannotshow if the operator or maintainer is working in poor conditions, is fatigued, etc.,as the job commences. In other words fault trees cannot illustrate context.

Because of the limitations listed above it is argued here that fault tree analysis ismainly suitable as a tool for logically determining the critical components within asystem. Further assessment of human error should take place in the tabular HEMEAexplained below.

1.3.2 Other safety techniques. In conjunction with the steps outlined above aircraftmanufacturers undertake additional safety assessments to identify risks across systems,at the aircraft level. These are studies that investigate the possible corruption ofindependence. These include zonal safety analysis (ZSA), particular risk analysis (PRA)and common mode analysis (CMA), to identify common mode failures (CMFs). Withregard to the model of HHA outlined here, the CMA is particularly relevant. Hollnagel(1993, p. 44) defines common mode failure (CMF) as “a failure which has the potential tofail more than one safety function and to possibly cause an initiating event or otherabnormal event simultaneously”. According to Hollnagel such events are rare intechnical systems, but in human beings they are quite frequent. One of the pioneers ofmodern engineering psychology, Jens Rasmussen, noted how maintenance engineers hehad observed in the nuclear industry were particularly prone to the omission of steps in

DPM16,5

722

procedures, particularly if the step was a functionally isolated action (Reason, 1990, p.184). The repetition of an error across identical components is a classic cause of a CMF,which Rasmussen believes are typical of maintenance engineers (Reason, 1997, p. 24).For example, a Lockheed L1011 suffered engine failures due to a CMF when master chipdetector assemblies on all three engines had been installed without O-ring seals[3].

ZSA is performed on zones of an aircraft to analyse the effects of interactionsbetween equipment within zones and between zones. As part of the ZSA, human erroris analysed in relation to the actions of maintenance personnel; for example, errorsassociated with the installation/removal of equipment. PRA are analyses that arecarried out for specific risks, which are normally based on analysis of an approved riskmodel followed by a study of the repercussions associated with the model. Anyexamples of human error that have been identified in the ZSA would be analysed ingreater detail in the PRA, depending upon the specific risk. Examples could includeuncontained engine rotor failure, wheel and tyre failure, emergency landing, etc. CMAanalyses failure conditions in order to show independence between functions and anyassociated monitoring equipment. Human errors, whether by flight, ground ormaintenance crews, are considered as potential common mode failures, and areconsidered as part of the CMA process. In system safety both redundancy anddissimilarity are deployed to protect functions from CMFs, but some systems withredundancy, such as engines, cannot possible exhibit dissimilarity in their majorstructural design components. Particularly in maintenance CMFs remain a core threat.

2. Human factors issues2.1 Human factors and safetyHuman factors continue to play a major role in the causes of aviation accidents. Overthe last 20 years a combined offensive on this problem has spawned significantchanges, including:

. improvements in cockpit design;

. crew resource management initiatives to foster better communication on theflight deck and between the flight crew and cabin staff;

. elaborate safety management techniques to sharpen up organisational safety;and

. a number of initiatives to explore the nature, causes and effects of maintenanceerror.

These initiatives have had some success, but despite this the impact of human error inboth operations and maintenance remains stubbornly persistent (NASA, 2001).

A key part of the human hazard analysis approach is to characterise human error insuch a way that system designers know how to avoid design features that may belikely to provoke operational or maintenance errors. Designers need to understandhuman error and the role of the human within the system. Certain types of error need tobe seen as predictable. Fortunately, while the occurrence of specific errors made byparticular individuals may be unpredictable, certain generic patterns of human errorare highly predictable and can be avoided. The safety challenge is to design in order tominimise the likelihood of error and to mitigate the effects of those errors that remainunavoidable.

Human hazardanalysis

723

2.2 Patterns of human error and human reliabilityHuman error can be defined as “a generic term to encompass all those occasions inwhich a planned sequence of mental or physical activities fails to achieve its intendedoutcome, and when these failures cannot be attributed to intervention of some chanceagency” (Reason, 1990, p. 9).

It is now widely accepted that the typical types of errors made by humans relate to thefundamental processes of human cognition. In other words, error is a feature of normalcognitive performance: “Correct performance and systematic errors are two sides of thesame coin” (Reason, 1990. p. 2). Error is chiefly a function of the interplay betweeninformation storage, information retrieval and the cognitive resources available for theplanning and execution of tasks. Although the resource and capability of the humanlong-term memory (knowledge base) is vast, the short-term memory and the resourcesavailable for task execution are quite limited. In what Reason (1990, p. 2) refers to as the“conscious workspace”, there is a fundamental competition between resources needed fortask execution and resources needed for storing and retrieving information.

2.2.1 Typical human errors: frequency bias and capture errors. In order to explainthe root causes of human error it is necessary to understand the consequences of thelimitation of processing resources in the individual’s conscious workspace – to put itsimply, the limited amount and type of cognitive work that can be done when retrievalof information, planning and task execution are required simultaneously. Thedemands of complexity in the environment, coupled with limited cognitive resources,calls forth an economic approach to human performance. This engenders a strongtendency in human processing towards simplification. The level of real detail and thespecific and nuanced differences in the field of human perception and interpretationwould paralyse the human processing capability, if it were not that individualssimplify experience by categorising perceptions and events through various schemata,which standardise stored experiences and also provide a set of stereotypical responsesto certain prompts or cues in the environment.

One such simplifier derives from a process (sometimes called a heuristic), whichcognitive psychologists refer to as bias. Two such biases that play a key role in humanerror are “frequency bias” and “similarity bias” (NASA, 2001, p. 2). Frequency bias is aform of cognitive gambling, and fortunately it normally pays off. The bias affects bothperception and action routines. In human perception, data or experience which appearssimilar to previous and well-understood data is interpreted through existing models orschemata. In aviation, standard operating procedures reflect the fact that operationsrarely throw up novel and unique events. However, when this does happen crews mayshow an inability to understand the facts. During the well-known Sioux City UnitedAirlines DC-10 incident, the flight crew were looking for support from a team ofengineers at a ground station. For a period the engineering support team simply couldnot believe that the engine rotor burst had destroyed all three hydraulic systems. Thisunwelcome fact did not sit readily with their experience and their mental model of thesystems’ operation.

For system designers a key message that comes from cognitive frequency andsimilarity bias is that operators and maintainers have a mental working model ofsystems and the environment which will be heavily dependent on stored andpreviously deployed schemata, but this may distort reality. Previous experience andthe way that events are habitually interpreted override the fact of novel or unique

DPM16,5

724

events. Here negative transfer is a powerful agent in error causation. Negative transferoccurs when knowledge and procedures from a previous and well-understood task arewrongly applied to a new and novel task. In aircraft maintenance subtle differences inprocedures between Airbus and Boeing aircraft are often belied by apparent similarity.One Airbus type requires the vertical fin to be rigged two degrees off from theperpendicular, an unusual divergence from normal practice. But for a maintainer, yearsof working on fins that are rigged conventionally may prefigure an error. Cognitively,the well-rehearsed and understood procedure is the easiest one to deploy.

In environments where routinised tasks are regularly repeated, frequency bias isstrongly associated with habit capture errors, which have been systematicallytheorised and extensively observed in a number of industrial operating environments.Habit capture errors occur when a novel or new routine is abandoned in favour of amore established pattern of behaviour. Again, the principle of cognitive economy is atwork, as the more familiar routine is accomplished with less cognitive effort. Suchhabit capture can easily be validated through personal introspection.

2.2.2 Double capture slips. The effects of the inherent biases in human perceptionand interpretation are amplified by the problem of attention capture. When puttogether, attention capture and frequency bias can trigger habit capture errors. Whenhabit capture errors occur the subject is normally undertaking a routinised set ofactions, but with an additional step, or perhaps a change in one of the normal steps inthe action sequence. Because of this capture of the individual’s attention the new planis unconsciously abandoned and the more strongly established routine asserts itself asthe action sequence which is actually undertaken.

In a maintenance hangar the potential for attention capture is strong. Maintenanceprocedures which are subtly different from previous or more familiar ones areparticularly prone to the risk of habit capture error when maintainers are distracted byevents in the work environment. However, experiments show that when subjectsundertake a cognitive rehearsal of the new or different task the likelihood of error isreduced.

2.3 Categorising errors in relation to human performanceThe type of cognitive processes that are at work when individuals undertake differentkinds of tasks have been referred to by engineering psychologists as levels ofperformance. These levels of performance are also linked to types of human error. Theperformance level is also significant in relation to the problem of maintenance error, ashumans are more error-prone in some levels of performance than others. Inmaintenance, the task and associated procedure will largely determine the performancelevel in play when the task is performed. In addition the concept of performance level isimportant, as it can give the system designer a deeper understanding of the broad issueof human performance and human error.

In the analysis of capture errors outlined in sections 2.2.1 and 2.2.2 it was shownhow a stage in a planned series of actions is omitted when a more frequently performedsequence of actions captures the individual’s behaviour. This type of error iscategorised by Reason (1990, p. 53) as a slip, which is essentially an error of taskexecution. Three other error types are skill-based lapses, rule-based mistakes andknowledge-based mistakes. These are listed in Table I and related to the relevant formof human performance. This categorisation of performance levels comes from the work

Human hazardanalysis

725

of Rasmussen (1986), who created and defined the categories after observing engineersengaged in trouble-shooting electronic faults. However, it should be noted that thesecategories came from observation and inference and although they are widely acceptedin the safety engineering community, they are not universally accepted in academicpsychology.

2.3.1 Skill-based performance. Skill-based performance refers to unconscious andautomated actions, or as Rasmussen (1986, p. 100) puts it, “smooth, automated andhighly integrated patterns of behaviour”. This kind of behaviour is largelyunconscious, as the repertoire of necessary skills and the task actions are embeddedin schematics in the long-term memory. The performance requires little consciousattention, other than some monitoring or checking. But when the monitoring/checkingprocess fails or is absent a slip can occur. A lapse is also a failure at the skill-based levelof performance, but a lapse is more of an information storage failure. Sudden memoryfailures of this type are hard to predict, whereas the execution slip is genericallyspeaking a highly predictable pattern of error. In aviation accidents it is sometimesfound that crews have forgotten a stage in a procedure with which they are veryfamiliar. Such lapses are problematic, as current models of cognition do not easilyexplain why humans have these unpredictable lapses of memory.

2.3.2 Rule-based performance. Rule-based performance is enacted in a moreconscious mode than skill-based performance. Rule-based performance calls up arepertoire of stored rules or refers to written rules in order to execute a planned seriesof actions, such as cooking by following a recipe or programming a timer on a videorecorder. Rule-based performance is also critical to problem solving when rectifying aperceived failure or difficulty. Here error is categorised as a mistake, caused by theselection of bad rules or the misapplication of good rules. Rule-based mistakes aresignificant in maintenance as the wrong overhaul or repair procedures may befollowed. As aircraft systems have become more complicated, intuitive-basedprocedures based on engineering judgement are often incorrect. As discussed in section2.2.1, subtle differences between aircraft types may be lost on the engineer. Thus thereis a clear imperative now to undertake tasks strictly according to the details in themanual. Moreover, the modularity of modern aircraft systems creates a mental wallbetween the maintenance engineer and the technology, such that intuitive solutions toproblems are inappropriate.

2.3.3 Knowledge-based performance. Knowledge-based performance only normallyensues when a repertoire of rule-based procedures has been exhausted or is absent, forexample when a situation is utterly novel. In the now famous case referred to in section2.2.1, a United Airlines DC-10 lost all three hydraulic systems after an uncontrolledrotor failure in the number two engine. The aircraft was steered to a runway at SiouxCity by the crew varying thrust in the number one and three engines. Afterwards thecrew admitted that there were no procedures from which to select potential solutions to

Performance level Error type

Skill-based level Slips and lapsesRule-based level Rule-based mistakesKnowledge-based level Knowledge-based mistakes

Table I.Error types and humanperformance

DPM16,5

726

the crisis. In effect the plane was unflyable, but a mixture of teamwork and excellentairmanship and judgement by the captain provided an extraordinary knowledge-basedimprovisation.

In general, knowledge-based performance is highly inappropriate in a safety-criticalenvironment. It is time-consuming and prone to error. The operator or maintainer isreduced to inductive feedback-based problem solving. In percentage term humansmake a far greater ratio of errors in this performance mode.

2.4 Maintenance errors and error typesThe maintenance shop or hangar is a high-pressure work environment. Limitedresources means that illness and holidays can lead to staff shortages or there may beunfilled vacancies. Lighting and other environmental conditions may also not beperfect. In other words there are many performance-shaping factors (PSFs) in theworkplace that will affect the likelihood of error. But also our knowledge ofmaintenance error types shows how the cognitive basis of error is highly relevant tothe incidence of maintenance error. In this paper the point is amplified that the designcommunity needs to establish good working knowledge of the typical pattern ofmaintenance error, as both tasks and procedures can be re-engineered to reduce ormitigate error. Frequency bias and habit capture have been discussed to show howknown limitations in human performance provoke error.

2.4.1 Maintenance errors. In aviation maintenance (and other sectors) the omission ofa step in a procedure, or even the omission of a whole procedure, has been found to be thepredominant type of error. A study of a major US carrier Boeing found omissions to bemore than 50 percent of the observed errors[4]. Similarly, in Rasmussen’s study of thenuclear industry the omission of functionally isolated steps accounted for 34 percent ofall recorded errors (Rasmussen, 1980). Both in academic analysis and through interviewswith maintenance crews it has been found that the most commonly omitted step is onethat is functionally isolated from the rest of a procedure. The isolation could be physical,i.e. a different location, or temporal. Such functional detachment increases the risk of aslip, as environmental and precursor cues may be absent.

There are also a significant number of other error provocative tasks which aretypical in maintenance, which relate to our understanding of error. As was mentionedin section 2.2.1, one key problem is the issue of negative transfer. Here a step orprocedure which looks similar to a more familiar procedure brings forth the moreestablished behaviour. Again, the frequency bias heuristic comes into play, but othermore complex features of cognition are relevant. It is now widely understood thathumans are not rational decision-makers. A confirmation bias (heuristic) disposesindividuals to seek evidence which confirms previous experience; moreover, whenhumans perceive data or experiential events that are novel, they are inclined to matchthese with previously stored schemata. According to Wickens and Flach (1988, p. 127),individuals also interpret what has happened as what should have happened as thereexists a bias towards overconfidence. Also human beings “are furious patternmatchers” (Reason, 1990, p. 66). This fact impacts strongly at the rule-based level asour cognitive predisposition is always to seek an established package of rules. Thus,established schematics of ordered and systematised experience engender negativetransfer in novel situations, which cue or call forward the well-learned rule set. From adesign point of view regarding new aircraft, the clear implication is that design aspects

Human hazardanalysis

727

which call for maintenance procedures which are similar to established procedures oncurrent aircraft, but have subtle and safety-critical relevant differences, will be highlyerror provocative.

Table II illustrates some of the more obvious causes of error in maintenance. It isneither complete nor strictly verifiable. Its purpose is to illustrate the type of factorsthat amplify the likelihood of error. Violations have been excluded, which would haveconsiderably expanded the list of issues under the management heading, as factorssuch as morale and supervision are critical to deliberate rule infringement. Violationsalso imply a conscious intention to break rules and procedures. They are stronglyrelated to the issue of safety culture and the organisational aspects of the workplace.

What Table II shows very clearly is that certain types of maintenance error risk areinherent in the task. Many errors occur in the re-assembly of components andsub-systems. But re-assembly will remain a key part of aircraft maintenance until or ifa world of total modularity is reached. In routinised actions, slips and lapses willcontinue to happen as attention is captured, either by distraction or internalpreoccupation. Sometimes a necessary action is simply forgotten. In air accidents suchslips and lapses still cause puzzlement as experienced crews fail to implement aprocedure they have successfully undertaken many thousands of times. Such errors arethe result of failures of cognitive executive control, where failure is evident in fetchingitems from memory, reading text or problem solving (NASA, 2001).

Inherent in task Procedural issuesManagement of workplace Cognitive complexity

Re-assembly error Documentarydiscrepancy with actualprocedure

Actions which arefrequently disrupted

Installation of multipleitems

Omission Inadequatedocumentation

Personnel changeduring task

Step requires memoryof detailed instructions

Functionally isolatedstep

Ambiguousdocumentation

Skill deficit Numerical complexity

Step is a new procedure Procedure isunnecessarily complex

Poor environment Cues for action areopaque

Routinised action Procedure iscounter-intuitive

Documentation missing Subtle differences insimilar tasks or steps

Action after main taskgoal is completed

Procedure is subject tochange

Inadequate supervision Multiple permutationsfor re-assembly

Action is near the end ofthe task

Procedure has beenmodified locally

Deficient qualityassurance

Task requiresnon-routinised problemsolving/diagnosis

Step not required insimilar procedures

Parallel procedures inplace

Insufficient resources Multi-tasking required

Action sometimesomitted

Procedure reliant onmemory

Poor safety culture Task iscounter-intuitive

Table II.Sources of maintenanceerror

DPM16,5

728

Because maintenance will continue to exhibit error it is important to re-think designfrom an error mitigation point of view. On the basis of Table II a human factorsspecialist in a design team could pose the following questions to engineers working onthe maintenance aspects of system design:

(1) Could a low-cost redesign eliminate the safety-critical maintenance task?

(2) Could the frequency for a safety-critical task be reduced?

(3) Could the functionally isolated step in a procedure be integrated into theprocedure?

(4) Could a procedure which is subtly different from a more common procedure bemade identical or more transparently different?

(5) Could a task which is inherently complex be simplified?

(6) Could documentation for a task be better related to the actual known reality oftask performance?

(7) Could better cues be established to guide the correct progress of re-assemblytasks?

(8) Could steps which are sometimes omitted in procedures be always included?

(9) Could components which are regularly inspected be repositioned to limit theimpact on nearby systems?

(10) Could design differences for similar systems and subsystems be eliminated?

(11) Could permissive action links (PALs) be established to prevent systems onwhich an erroneous task has been undertaken being activated[5]?

Doubtless other questions could be posed, but the point here is to illustrate therelevance of knowledge of human error to the design process.

3. The human hazard analysis methodology3.1 The safety processThe system safety process evaluates aircraft functions and the systems designed toperform these functions in order to determine that identified hazards have beenadequately addressed. This was described in section 1.3 and it is not intended toduplicate that exegesis in detail here. However, it is important to stress that the HHAderives from the aircraft-level functional hazard assessment (FHA) which identifiesand classifies the failure conditions associated with the defined aircraft functionsranking them in order of severity according to CAA criteria, i.e.:

. catastrophic failure conditions;

. hazardous failure conditions;

. major failure conditions;

. minor failure conditions; and

. no safety effect failure conditions.

3.1.1 Determine functional failures at a system level. These failure conditions are thenapplied at the system level, to identify what functional failures could lead to thespecified failure condition, for example:

Human hazardanalysis

729

Catastrophic 1Failure condition – Fuel starvationFunctional failures – Insufficient fuel, cross-feed failure, etc.

3.1.2 Evaluate system architecture to determine failure modes. The Preliminary SystemSafety Assessment (PSSA) analyses the system architecture to evaluate all functionalfailures that could result in the failure conditions. Fault trees (FT) are the main toolused in the PSSA. The fault tree analysis (FTA) determines the probability of thefailure conditions, thus allowing reliability requirements to be put on the equipmentsuppliers and to determine which systems must be kept completely independent.Suppliers also generate FMEAs to satisfy that technical failures within componentscannot contribute to the base events. (It should be noted that safety personnel andsuppliers already conduct these activities.)

3.1.3 Establish cutsets. The logic of the fault tree reveals the numbers of systemfailures and events that have to occur in combination for the top hazardous event tohappen. This enables the safety process to isolate the critical from non-criticalcomponents, i.e. if just one component failure in combination with one event leads to acatastrophic event, then this component is more critical than one which requires sixcomponent failures in combination with three events to lead to a catastrophic event. Acutset from the fault tree determines those combinations that leads to the top event –i.e. the number of failures or events that lead to the failure. Cutsets are used todetermine critical components in the FT, and hence the areas of the FT that will bestudied in further HHA.

It is recommended here that detailed human hazard analysis should be carried outon those systems or components where two base events (a system failure or an event)can, in combination, result in a catastrophic or hazardous failure. The rationale behindthis is as follows. When a system architecture is under consideration, it isacknowledged that no single failure can be allowed to have the potential to result in acatastrophic or hazardous failure condition. When the catastrophic or hazardousfailure condition can occur due to the combination of two base events (a system failureor an event such as an engine fire), it is important to design a system to be tolerant ofthe occurrence of one event and rely on the second event or failure not occurring.However, if the first failure is latent (undetected), then, although the aircraft is tolerantof the failure, no warning would be received and the aircraft may dispatch in acondition where a single failure could result in a catastrophic or hazardous failurecondition. Maintenance human errors will frequently be latent and as a result it iscrucial to consider the potential for human error in maintenance in these cases ofpotential dual failure. In effect, the type of latent maintenance error being consideredcan shoot a hole straight through the defences provided by the system architecture, if afailure remains undetected.

Because of the possible scenario outlined above, it is argued that the safety processshould require that human hazard analysis is initiated by stipulating thosecomponents that require a detailed HHA, using the causal logic developed in thefault tree. The method should then follow the steps outlined below.

3.1.3.1 Identify the component top-level functional failures. Identify the ways inwhich the critical components could fail (e.g. LP valve fails to open, LP valve fails toclose). This analysis is already conducted within the safety process, but thisinformation can also be added to a HHA document.

DPM16,5

730

3.1.3.2 Differentiate critical from non-critical functional failures. Analyse thesefunctional failures to identify critical from non-critical functional failures. The criticalfunctional failures are those that have the potential to lead to the hazardous orcatastrophic event (e.g. LP valve fails to close is critical to an uncontrolled in-flight fire,whereas LP valve fails to open is not). This analysis is already conducted within thesafety process, but this information can also be added to a HHA document.

3.1.3.3 Determine the critical component states. At an early stage in an aircraft’sdevelopment, the critical component states (e.g. drive shaft shears) that could result inthe identified critical functional failure must be determined. This process will involvediscussions with the supplier and the use of component failure mode and effectanalysis (FMEAs) from similar, certificated aircraft. The information from this shouldalso be added to a HHA document.

3.1.3.4 The HHA report and the safety assessment. The HHA method presented herestops the fault tree at an appropriate point to conduct structured analysis by theoriginal equipment manufacturer (OEM) and the supplier. It is proposed that safetypersonnel would produce a document detailing the following points:

. the critical components on the fault tree whose failure has the potential to lead toa hazardous or catastrophic failure;

. the critical ways in which these components can fail that would lead to ahazardous or catastrophic failure in combination with other events or failures;and

. critical component states.

3.2 Human hazard analysisAfter the three critical tasks outlined in section 3.1.4.5 had been undertaken, the HHAteam would then take up the document from safety personnel to conduct the detail ofthe HHA. This team would then discuss each critical component (and the associatedinformation) to assess it in three ways depending on the detail of the functional failure:

(1) potential human error in production;

(2) potential human error in line/base maintenance; and

(3) potential human error in component assembly/maintenance.

These separate levels of analysis are required because some functional failures wouldbe impossible to induce in production or through line or base maintenance (i.e. thefailure involves a human error on a sub-component within the component) and hencecould only occur due to detailed component assembly or maintenance. The personnelrequired in the HHA team depends on the route of analysis and will be detailed below.

3.2.1 Potential human error in production. The HHA team in this analysis shouldconsist of a human factors specialist and a number of manufacturing engineers,(ideally at least two) who have knowledge of the system under discussion andexperiential knowledge of typical manufacturing error. The manufacturing engineerswould then be asked to identify production errors that could lead to particularfunctional failures. This is conducted first through simple brainstorming, then throughan analysis of relevant errors seen in the current manufacturing environment. Finally,a “human error” checklist would be used to support the discussion to determine thepotential production errors based on Table II.

Human hazardanalysis

731

Detailed analysis in the form of a human error mode and effect analysis(HEMEA) should then be carried out on all the potential production errorshighlighted. This is shown in Figure 1. The HEMEA determines the potential error,the immediate consequence of the error (i.e. is it latent?), the most severe effects ofthe error (in combination with other events and failures) and the ease of detection ofthe error.

3.2.2 Potential line/base maintenance errors. The HHA team proposed for thisanalysis would consist of a human factors specialist and a number of maintainabilityengineers with knowledge of the system under discussion and knowledge ofmaintainability and typical maintenance error. The method has been trialled withmaintenance engineers (i.e. those who actually conduct maintenance activities as partof their job). However, it was found that although they have a good current knowledgeof potential maintenance error, it was shown that the most important knowledgerequired is that of the system under development. As a result, it is argued here thatmaintainability engineers are the best personnel to utilise for the method. Theseengineers would then be asked to identify maintenance errors which could lead to thatparticular functional failure according to the ways in which maintenance errors canoccur (see Figure 2). This is conducted first through simple brainstorming and thenthrough an analysis of relevant errors seen in the current maintenance environment.Finally, a human error checklist, based on Figure 2, would be used to support thediscussion to determine the potential maintenance errors.

Detailed analysis in the form of a human error mode and effect analysis (HEMEA)would be carried out on all the potential maintenance errors highlighted, for examplethe maintenance errors that could be made in the remove/replace task, any scheduledmaintenance, or in inspections that would result in the LP valve failing to close. TheHEMEA determines the potential error, the immediate consequence of the error, themost severe effects of the error (in combination with other events and failures) and theease of detection of the error. This is shown in Figure 3.

3.2.3 Potential human error in component assembly/maintenance. At a supplierlevel, it would be the responsibility of the OEM to specify to the supplier that it isnecessary to conduct a detailed analysis of the potential human errors that could resultin the identified critical functional failure. The OEM would then list such functionalfailures. The methodology described for line/base maintenance can be expanded by thesupplier to identify potential component assembly/maintenance errors and acustomised human error mode and effect analysis (HEMEA) could be developed toinvestigate all the potential errors highlighted. An example of such a HEMEA is shownin Figure 4.

3.3 Mitigate the potential human errorsThe decision on whether or not to mitigate a potential error, and if so, the manner bywhich it should be done, must be based on structured and traceable argumentation.Ideally, all potential errors would be mitigated, eliminating their potential by design.The discussion is structured around the mitigation strategy diagram (see Figure 5) anddecisions should be based on the immediate consequence of the error (i.e. is it latent?),the most severe effects of the error (in combination with other events and failures) andthe ease of detection of the error.

DPM16,5

732

Figure 1.Human error mode and

effect analysis forproduction errors

Human hazardanalysis

733

Figure 2.Maintenance task flow

DPM16,5

734

Figure 3.Human error mode and

effect analysis forline/base maintenance

errors

Human hazardanalysis

735

Figure 4.An example human errormode and effect analysisfor component assembly/maintenance errors

DPM16,5

736

3.4 Examples of potential mitigation actions

(1) Eliminate error potential by design:

. dissimilar parts;

. parts impossible to assemble incorrectly; and

. validation of maintenance task and data.

(2) Reduce the severity of the effect by design:

. system logic to inhibit or challenge incorrect actions.

(3) Reduce the likelihood of the occurrence of the error by design:

. switches that are locked or guarded;

. safety notices; and

. improve reliability (to extend the maintenance interval).

(4) Improve the detectability (and correctability) of the error by design:

. functional testing;

. built-in test equipment (BITE) checks; and

. reduce the inspection/scheduled maintenance interval.

Figure 5.Mitigation strategy

diagram

Human hazardanalysis

737

(5) Mitigation by procedural change:. secondary sources of information; and. safety warnings.

(6) Mitigation by training change:. Educate on error potential.

(7) Mitigation by operational change:. duplicate inspection;. pre-flight checks;. safety tags;. checklists; and. briefings.

4. Conclusion4.1 ResultsThis paper has shown clearly that the occurrence of a maintenance or productionhuman error can be due to a human error made in the design of the component orsystem. The predisposition of the human to err is demonstrated, and typical modes offailure exhibited by humans are explained. In essence design errors or weaknesses feedforward into maintenance and production, with potentially hazardous or catastrophiceffects down the line for operations.

The major challenge of the requirement to identify and mitigate foreseeable humanerror is in the choice of where to focus attention so as to gain maximum benefit fromthe investment of time and effort. The methodology presented here uses the top-downfault tree analysis of the safety process to highlight the most important areas toconsider – i.e. the critical system components, the failure of which would lead to ahazardous or catastrophic system states. A simple output from the safety process,which requires little extra effort from safety personnel, identifies the criticalcomponents, the critical ways in which these components can fail and their resultantcritical component states. Potential errors are then identified by consideration ofproduction, line/base maintenance and component assembly/maintenance tasks.Suggestions are made on organisational issues (i.e. the personnel required) and how theanalysis can be conducted in order to focus on specific potential errors. Once identified,the required contextual information of each potential error is determined using ahuman error mode and effects analysis (HEMEA), which follows the classic bottom-uplogic of the failure modes and effects analysis (FMEA). A mitigation strategy issuggested to determine the best way to manage each potential error.

These design weaknesses and their potential to trigger maintenance error can onlybe explored satisfactorily by the experts who design, produce and maintain them. Hereit is advocated that this can be done through the HEMEA technique. Other techniques,such as SWIFT and HAZOP, also guide a multi-disciplinary team of experts through aprocess of hazard identification and analysis. But the checklists and guidewords ofthese techniques are designed to explore all manner of potential hazards, with humanerror being just one. In the HEMEA deployed here the concept was developed toanalyse human errors in particular.

DPM16,5

738

Notes

1. Aviation Week and Space Technology (January 30, 1995, p. 57).

2. Airbus UK Safety Process AP2616.

3. CAP718 (previously ICAO Digest No. 12).

4. UK CAA, November 2001.

5. PALs are failsafe devices used in the military to prevent the unintentional or illicit use ofnuclear weapons. PAL refers to the fact that a pre-determined sequence of actions mustoccur for the system to work.

References

Hollnagel, E. (1993), Human Reliability Analysis, Academic Press, New York, NY.

NASA (2001), “Cognitive models and metrics”, available at: http://olias.artc.nasa.gov/cognition/att/cmmindex.html

Norman, D. (1988), The Psychology of Everyday Things, Basic Books, New York, NY.

Rasmussen, J. (1980), “What can be learned from human error reports?”, in Duncan, K.,Gruneberg, M. and Wallis, D. (Eds), Changes in Working Life, Wiley, London.

Rasmussen, J. (1986), Information Processing and Human-Machine Interaction, North Holland,Amsterdam, p. 100.

Reason, J. (1990), Human Error, Cambridge University Press, Cambridge, p. 184.

Reason, J. (1997), Managing the Risk of Organisational Accidents, Ashgate, Aldershot.

UK Civil Aviation Authority (2000), P-NPA 25 310, December, UK Civil Aviation Authority,Gatwick.

Wickens, C. and Flach, J.M. (1988), “Information processing”, in Weiner, E. and Nagel, D. (Eds),Aviation Human Factors, Academic Press, New York, NY.

Corresponding authorSimon Gill can be contacted at: [email protected]

Human hazardanalysis

739

To purchase reprints of this article please e-mail: [email protected] visit our web site for further details: www.emeraldinsight.com/reprints