8
Quantifying the impact of partial stroke valve testing of safety instrumented systems Paul Gruhn a *, Joe Pittman b , Susan Wiley b , Tom LeBlanc c a Moore Products Co., 8924 Kirby Drive, Houston, TX 77054, USA b ARCO, Channelview, TX, USA c Keystone Deer Park, TX, USA Abstract The ISA S84 and IEC 1508/1511 standards, along with the AIChE CCPS Guidelines on safety instrumented (interlock) systems are performance oriented, not prescriptive. They do not tell people what logic system to use, what field device configuration to use, or how often to test a system. They merely list the performance requirements for the system. In other words, the greater the level of risk of the process, the greater the performance needed of the safety system. These standards, along with OSHA PSM requirements, state companies need to ‘‘determine and document that equipment is designed, maintained, inspected, tested and operating in a safe manner’’. Engineering tools are currently available which can model (i.e. determine and document) the performance of dierent types of systems. When one quantifies the performance of the overall system, from sensor to final element, one quickly comes to the conclusion that the valves represent the ‘‘weak link’’ in most of today’s systems. The typical failure mode of discrete shuto valves is being stuck. The only way to test for such a condition is to stroke the valve, but closing the valve completely and stopping pro- duction is not desirable. One does not need, however, to fully stroke the valve in order to test its functionality. If one were able to partially stroke the valve (in a simple, reliable, and secure manner), on-line, without stopping production, a dramatic improvement in safety can result. When one quantifies the safety impact of this test method, the result is typically an improvement by one order of magnitude. # 1998 Elsevier Science Ltd. All rights reserved. Keywords: Safety instrumental system (SIS); Programmable electronic system (PES); Safety integrity level (SIL) 1. Background, and ‘‘the problem’’ Facilities are now operating for extended peri- ods between scheduled shutdowns to maximize reliability and profits. For some operating com- panies, this means waiting up to 6 years for the chance to test a shutdown valve o-line. This is not a viable solution for those interlocks which require SIL 3 safety availability, so the issue of on-line testing of the shutdown valves arises. However, the very thought of on-line testing of interlocks, a.k.a. shutdown valves, strikes terror into the hearts of plant managers and operations supervisors who are struggling to produce every pound of product possible. To these people, who for many years have heard ‘‘do more with less’’ and ‘‘maximize eciency and reliability’’, the thought of a spurious trip and lost production due to a bad sensor is considered unacceptable, and the idea of deliberately interrupting the process in order to test a valve that may never be called on to act, is insane. ISA TRANSACTIONS 1 ISA Transactions 37 (1998) 87–94 0968-0896/98/$19.00 # 1998 Elsevier Science Ltd. All rights reserved PII: S0019-0578(98)00009-3 * Corresponding author Tel.: 001-713-666-7686; fax: 001- 713-666-8421; e-mail: [email protected]

Quantifying the impact of partial stroke valve testing of safety instrumented systems

Embed Size (px)

Citation preview

Quantifying the impact of partial stroke valve testing of safetyinstrumented systems

Paul Gruhna*, Joe Pittmanb, Susan Wileyb, Tom LeBlanc c

aMoore Products Co., 8924 Kirby Drive, Houston, TX 77054, USAbARCO, Channelview, TX, USAcKeystone Deer Park, TX, USA

Abstract

The ISA S84 and IEC 1508/1511 standards, along with the AIChE CCPSGuidelines on safety instrumented (interlock)systems are performance oriented, not prescriptive. They do not tell people what logic system to use, what ®eld devicecon®guration to use, or how often to test a system. They merely list the performance requirements for the system. In

other words, the greater the level of risk of the process, the greater the performance needed of the safety system. Thesestandards, along with OSHA PSM requirements, state companies need to ``determine and document that equipment isdesigned, maintained, inspected, tested and operating in a safe manner''. Engineering tools are currently available

which can model (i.e. determine and document) the performance of di�erent types of systems. When one quanti®es theperformance of the overall system, from sensor to ®nal element, one quickly comes to the conclusion that the valvesrepresent the ``weak link'' in most of today's systems. The typical failure mode of discrete shuto� valves is being stuck.

The only way to test for such a condition is to stroke the valve, but closing the valve completely and stopping pro-duction is not desirable. One does not need, however, to fully stroke the valve in order to test its functionality. If onewere able to partially stroke the valve (in a simple, reliable, and secure manner), on-line, without stopping production,a dramatic improvement in safety can result. When one quanti®es the safety impact of this test method, the result is

typically an improvement by one order of magnitude. # 1998 Elsevier Science Ltd. All rights reserved.

Keywords: Safety instrumental system (SIS); Programmable electronic system (PES); Safety integrity level (SIL)

1. Background, and ``the problem''

Facilities are now operating for extended peri-ods between scheduled shutdowns to maximizereliability and pro®ts. For some operating com-panies, this means waiting up to 6 years for thechance to test a shutdown valve o�-line. This isnot a viable solution for those interlocks whichrequire SIL 3 safety availability, so the issue of

on-line testing of the shutdown valves arises.However, the very thought of on-line testing ofinterlocks, a.k.a. shutdown valves, strikes terrorinto the hearts of plant managers and operationssupervisors who are struggling to produce everypound of product possible. To these people, whofor many years have heard ``do more with less''and ``maximize e�ciency and reliability'', thethought of a spurious trip and lost production dueto a bad sensor is considered unacceptable, andthe idea of deliberately interrupting the process inorder to test a valve that may never be called on toact, is insane.

ISATRANSACTIONS1

ISA Transactions 37 (1998) 87±94

0968-0896/98/$19.00 # 1998 Elsevier Science Ltd. All rights reserved

PII: S0019-0578(98)00009-3

* Corresponding author Tel.: 001-713-666-7686; fax: 001-

713-666-8421; e-mail: [email protected]

However, given the new playing rules dealt out byOSHA [1] and speci®cally the ISA S84 [2] standard,operating personnel are starting to realize the waythe game will have to be played in the future. Theoptions are either to shut down on a regular basis totest the valves, or design the system such that noshutdown is required in order to test the valves. The®rst option, while possible, is simply not feasible forlarge scale commodity product unitsÐthe economicsand problems inherent in a shutdown of such plantsare prohibitive. Therefore, the second option ofinterlock design is being explored in greater detail.

2. What some do not want you to know, andothers do not want to admit

A system is made up of individual components.A chain is only as strong as the weakest link.What good is a $5000 CD player and receiver if allyou have are $50 speakers? What good is a $4000graphics art computer if all you have is a 14 inchmonitor with a lousy pixel size? If you sell top ofthe line cameras and lenses, you do not focus ontelling people how important the ®lm is.

One needs to consider how and why certaininformation is disseminated, and whose interestsare being served. For example, certain high enddual and triple redundant logic systems are cap-able of meeting the highest performance goalsde®ned in recent industry standards. However,one may implement such logic systems in a man-ner with particular ®eld devices and con®gurationsso the overall system only meets the lowest safetyperformance goals.

Certain safety system vendors have spent thelast ten years educating and showing people thattheir logic box is orders of magnitude ``better'' and``safer'' than general purpose hardware. It hasonly been within the last few years, however, thatmore people are starting to ask, ``Wait a minute,what about the ®eld devices?!''

3. Understanding the metrics

In order to evaluate systems and make com-parisons of design options, one ®rst needs to

understand the failure modes of safety relatedsystems and understand the terms used to de®nesystem performance. One would like to think thatthis is a universally understood and agreed uponareaÐunfortunately, it is not.

People involved in control systems all use theterm availability. Unfortunately, everyone's usageof the term is slightly di�erent, and the term is notthat applicable for safety systems anyway. The pro-blem stems from the simple fact that safety relatedsystems can su�er from two failure modes, not justone. So if you use just one term, such as ``avail-ability'', which failure mode are you referring to?

Safety systems can shut the process down whennothing is actually wrong. Such failures are typi-cally called nuisance trips (or spurious, safe, overt,revealed, or initiating failures). What might anavailability of 99.9% mean here? Does this meanthe system (and process) is down 45min once amonth, or 9 h once a year, or 37 days once every10 years? They all have the exact same availability!Most users know how long their process will bedown when production stops, they just want toknow, on average, how often such an event mightoccur. Saying a nuisance trip might occur onceevery month, or once every year, or once every 10years, or once every 100 years is much easier torelate to. This paper uses the term ``Nuisance TripRate'', measured in years (although it might bemore appropriate to consider it the mean timebetween nuisance trips).

Safety systems may also fail to respond to anactual demand. Such failures have historicallybeen called dangerous, covert, or inhibiting fail-ures. Commonly used terms here are safety avail-ability, probability of failure on demand (pfd),and risk reduction factor (RRF). The problemhere is the range of numbers typically used arevery di�cult for most people to relate to. Table 1shows a comparison between the numbers and thesafety integrity levels de®ned in industry standards[2±4]. The di�erence between a safety availabilityof 99% and 99.99% does not sound signi®cant (itis less than 1%!). The di�erence between a riskreduction factor of 100 and 10,000 however, is abit more obvious. The point is, both range ofnumbers di�er by two orders of magnitude. Thispaper uses the term Risk Reduction Factor.

88 P. Gruhn et al./ISA Transactions 37 (1998) 87±94

4. A logic box does not a system make

Control system vendors have been giving outperformance numbers of their systems for years.Unfortunately, many users do not realize thatsuch numbers are only for what is within the con-trol system vendor's scope of supplyÐthe logicbox only. The performance numbers listed inTable 1, however, are intended to be for the entiresystem, not the logic box in isolation.

There are quite a number of specialized dualand triplicated logic boxes independentlyapproved and certi®ed for use in SIL 3 applica-tions. Without causing undue concern and argu-ments amongst vendors, let us simply assume ageneric nuisance trip rate of 200 years and a riskreduction factor of 10,000 for such a specializedlogic box. In other words, the system is both safeand fault-tolerant. These numbers are both realis-tic and typical for the vendors specializing in theseapplications [5,6].

5. The impact of ®eld devices

Now let us consider the impact of ®eld devices.The majority of dual and triplicated systemsare installed with simplex ®eld devices. Whatimpact, if any, does this have on overall systemperformance?

There are several methods presented in ISAdTR84.02 for modeling the overall performance ofa safety instrumented system. Let us consider asmall interlock system with eight sensors (switchesfor the time being) and two valves. Let us assumean MTBF (mean time between failure) of 100years for each device in each failure mode. Inother words, one out of 100 devices causes a

nuisance trip in 1 year, and after testing 100 devi-ces after 1 year, only one was found to be ``stuck''.Note that MTBF is not the same as life.

In the nuisance trip mode we need to include all®eld devices in the model (assuming that anydevice failing safely will cause a nuisance trip). Ifthere are ten devices, and each has an MTBF of100 years, then the MTBF due to all 10 devices is10 years. In other words, there will be a nuisancetrip, on average, every 10 years due to the ®elddevices alone. (Remember, the number for thelogic box was 200 years...)

Things are even simpler for the dangerous(RRF) calculation, because one does not (indeed,should not) include all 10 ®eld devices. We canassume for this example that both valves mustfunction, therefore both should be included in themodel. We have eight sensors, however. Should alleight be included? No! The system will only fail ifthere is a demand placed on the sensor that failed.In other words, if a pressure sensor fails, but theshutdown demand comes in on a temperaturesensor, the system functions properly. Therefore,one should only include one input in the fail-to-function model.

So, we have got one sensor, and two valves. Letus assume an 8 h repair time, and a 1 year manualtest interval.

The probability of failure on demand (pfd) for asimplex system may be calculated as follows [5±7]:

pfd � l�T1=2�MTTR�

where l=1/MTBF, TI=Test Interval, MTTR=Mean Time To Repair.

For a sensor with an MTBF of 100 years, 1 yeartest interval, and 8 h repair time:

Table 1

Safety integrity levels and performance requirements (for the entire system, including ®eld devices)

ISA S84 Safety integrity

level (SIL)

Safety availability

(%)

Probability of failure on demand

(pfd) (1 safety availability)

Risk reduction factor

(RRF) (1/pfd)

3 99.9±99.99 0.001±0.0001 1000±10,000

2 99±99.9 0.01±0.001 100±1000

1 90±99 0.1±0.01 10±100

0 Process controlÐnot applicable

P. Gruhn et al./ISA Transactions 37 (1998) 87±94 89

pfd � 1:14� 10ÿ6�8760=2� 8� � 0:005

RRF � 1=0:005 � 200

For two valves with the same MTBF, TI andMTTR:

pfd � 2� �1:14� 10ÿ6� � �8760=2� 8� � 0:010

RRF � 1=0:01 � 100

Calculating the pfd for the system, consisting ofthe sensor, logic box, and two valves:

RRF � 1=��1=200� � �1=10;000� � �1=100�� � 66

The pfd from these three devices alone is 0.015,which equates to a risk reduction factor of 66.(Remember, the number for the logic box alonewas 10,000...) If the RRF for the ®eld devices is66, the number for the system is 66. So the overallsystem only meets SIL 1 requirements, not SIL 3.In other words, the logic box represents less than1% of the overall problem!

6. Transmitters versus switches

Over the last decade, an increasing number ofcompanies are using analog transmitters in lieu ofdiscrete switches. This has obvious bene®ts. Forexample, the dynamic nature of a transmitter sig-nal means one can more easily tell if the device isworking properly. When one uses multiple devi-ces, comparisons can then be made to increase thepotential diagnostic levels even further. One canquantify the impact of such design options and seethey can easily increase sensor performance anorder of magnitude. However, if the valve designremains unchanged, the overall system perfor-mance may not improve at all.

For example, let us assume that the RRF due tousing a single analog transmitter versus a discreteswitch can increase an order of magnitude from itsearlier value of 200, up to 2000. The RRF of justthe two valves (as before) is still 100. If the sensorRRF is 2000, the logic box is 10,000, and the two

valves are 100, the overall system has a RRF of 94(merely add the reciprocals: 1/((1/2000)+ 1/(10,000)+1/(100)). In other words, we are still justat the high end of SIL 1!

7. What about dual valves?

Accepting that a chain is only as strong as itsweakest link, and realizing that an analog trans-mitter provides a level of diagnostics, and that thesubject logic box is both fault-tolerant and hasextensive diagnostics, but a single ``dumb'' valvehas no diagnostics whatsoever, it is obvious wherethe weak link is. The traditional ``®x'' would be touse dual valves in series, possibly with a bleedvalve between them. Such an arrangement can bereferred to as 1oo2 (one out of two), meaning thateither of the two valves is capable of performing ashutdown. While this is safer than just one valve,the nuisance trip rate su�ers (since either valve canstop production). Using the same numbers asbefore, but assuming each of the two output valvesare now con®gured as dual redundant, and account-ing for a small amount of common cause, the nui-sance trip rate degrades to 8 years (down from 10),and the RRF increases to 900 (up from 94).

So dual redundant valves can increase the over-all safety one order of magnitude. This should notcome as a shock. This is what people have tradi-tionally done with valves for high risk applica-tions. The obvious drawback, however, is cost.Not only is the capital equipment cost higher fortwice as many valves, but maintenance test laborincreases as well, as there are now twice as manyvalves that need to be periodically tested.

8. Can a simplex (non-redundant) valve be``safe''?

Might it be possible to design a system withsimplex valves, yet still meet SIL 2 performancerequirements? The simple answer is, yes! The keyis simple, reliable, limited movement testing. Ifone were able to partially stroke a valve, and claiman 80% diagnostic coverage factor as a result (thePareto principle), do the test automatically once a

90 P. Gruhn et al./ISA Transactions 37 (1998) 87±94

day (and still fully test the valves once a year) theNuisance Trip Rate returns to the 10 years forsimplex ®eld devices, yet the RRF increases to800. So a simplex valve can meet SIL 2 perfor-mance requirements.

There is one underlying assumption worthpointing out. If valves are never stroked, one canpretty much guarantee they will not work whenneeded. Periodically stroking the valve actuallyincreases the ``reliability'' of the valve (increasesthe MTBF). The RRF of 800 assumes the valveMTBF increased from the original 100 years to300 years, merely due to the periodic testing.

9. Caveat!

It is easy to fall into the trap of getting overly``involved'' in reliability models and putting toomuch faith in their answers. For example, hard-ware interlocks were left o� the Therac 25 radia-tion machine because a quantitative reliabilityanalysis of the software showed it to be so goodthat the designers felt the hardware interlocks werenot needed. Six people died as a result of beingmassively overdosed due to a software error [8].

So it is worth mentioning two caveatsÐrelia-bility models are like captured foreign spies, if youtorture them long enough they'll tell you anything.Also, if you automate the modeling process anddo it all on a computer, do not forgetÐcomputersare known for their speed, not their intelligence.All programs are designed by people, so carefullycheck the assumptions, simpli®cations, and hiddenagendas that may be ``buried''. In addition, no oneparticular modeling technique is more ``correct''than another, they are all merely approximations.Di�erent assumptions and simpli®cations can bemade using any of the techniques which canchange the answers by orders of magnitude.

10. Methods of valve testing

10.1. Bypasses

Many methods and designs have been proposedto handle on-line testing of shutdown valves. The

time-honored favorite to date for operating/man-ufacturing personnel is to install a bypass valvearound every interlock valve. While some minorprocess upsets may be encountered (i.e. balancing¯ows, etc.), the process remains intact and run-ning. One additional upside for bypass valves is onthe maintenance side. Should a valve be found tobe failed, a bypass allows for change-out on line.This seems a simple solution to the problem fromthe operating side, but a number of drawbacks canbe noted from other perspectives.

Providing a bypass valve for each interlockvalve is di�cult to justify for several reasons.Economics, obviously are a large factor. The costof the additional piping, full size bypass valvesand instrumentation required for a large operatingfacility are astronomical. Aside from the hardwarecosts of adding bypasses, the labor cost to cutpipe, build access platforms for valves, etc. are amajor portion of any interlock retro®t project.Another often overlooked issue is that of realestate. In many existing installations the spaceavailable for adding piping and valves for bypas-ses simply does not exist.

Adding the hardware for bypasses requiresspace, space at or very near the valve in question.When the unit has not been designed for this theergonomic assessment can get very interesting.Adding a bypass valve that cannot be readilyaccessed or that is in the way of routine operationsis not a practical solution.

Another issue with the use of bypass valves isthe consequence of the valve being left in thewrong position after testing. This is an importantconsideration when designing interlocks, as thise�ectively takes the interlock out of service. Whilemany operating personnel maintain that interlockbypass valves can be car-sealed in the appropriateposition between testing cycles and thus beadministratively monitored, anyone who has everdealt with large facilities, some with car-seal listshundreds of items long, may be inclined to dis-agree. As mentioned earlier, operating facilitiesare ever being pushed to decrease costs, so thefeasibility of adding X number of valves to a carseal list which must be regularly inspected shouldbe taken into account when designing the inter-lock for long term maintainability and reliability.

P. Gruhn et al./ISA Transactions 37 (1998) 87±94 91

One method used to prevent this scenario withoutincreasing the inspection requirements is theinstallation of limit switches on the bypass valves,and alarming when the valves are not in the ``safe''position. This increases I/O counts, wiring require-ments and may add confusion for operators due tothe additional alarms.

Fig. 1 shows a typical installation sketch for twoSIS valve scenarios. One utilizes a device for par-tial stroke testing, the other utilizes a traditionalbypass arrangement. Limit switches are providedon both designs, which insures the valves are inthe proper position both during normal plantoperation and during an interlock trip. Theswitches can also be used for testing documenta-tion if the SIS has an associated sequence of eventrecorder. This documented testing is required byOSHA 1910 [1] to prove that the required perfor-mance is being maintained.

On-line partial stroke testing involves strokingthe valve through approximately 20% of its travel.This proves that the valve is not stuck in positiondue to corrosion in the actuator or product buildupinside the valve. It obviously does not test whetherthe valve will fully close and/or seal completely.

10.2. Alternatives

Another method is to utilize limit switches andactually measure valve movement, or else time the

signal to the valve, thus only moving the valvepartially. Limit switches are naturally prone tofailure, and problems occur if they are out ofadjustment. Timing of valve stroking has alsoproven problematic for users willing to admithaving tried it.

Some are using analog control valves for safetyapplications. While analog valves are certainlycapable of limited movement testing, they are anexpensive alternative (compared to typical discretevalves) and also are more likely to leak (which isobviously not desirable).

10.3. One method for on-line testing

After the above discussion, the bene®ts of asimple, dependable on-line testing device for suchvalves are easily seen. From the manufacturingperspective, the ability to on-line test without aprocess disruption is key. On the economic side,the cost savings are great as the use of such devicereduces hardware, piping and labor costs. Anotherbene®t associated with some of these devices is theability to be installed without pulling the valvefrom the line. This saves time during a retro®tturnaround by not needing to clear the line forentry and the time and labor for removal andreinstallation of valves.

In addition to testing the valve, operation of thelogic solver output point, wiring and fusing for

Fig. 1. Which would you rather implementÐa single testable valve, or one standard valve with three bypasses?

92 P. Gruhn et al./ISA Transactions 37 (1998) 87±94

energize to trip circuits, and operation of thesolenoid valve is veri®ed. When looked at as thetotal output system, testing with this type ofdevice veri®es approximately 85% of the outputside of the interlock. This 85% estimation canthen be used in interlock availability calculationsas ``% credit'' taken during testing to insure thatthe interlock still meets its required performance.

Partial stroke testing does not insure the valvewill shut o� completely. However, neither doesclosing the valve completely while the valveremains in the line. Only removing the valve fromthe line and pressure testing it will verify completeclosure and shut-o� class.

A method used by one valve/actuator manu-facturer to incorporate on-line partial stroketesting is very simple. An interlocking device ismounted between the actuator and valve body(see Photo 1.) This device is generic and can bemounted on any quarter turn valve. This inter-locking unit can be actuated manually or remo-tely. When placed in the test mode, the actuatormay be stroked approximately 20� in order toverify movement of the valve. After testing, theactuator may then be moved back to its normalposition. The manually operated device is pro-vided with an interlocking unit and proprietarykey that ensures only authorized users of thedevice access the system. The remote interlockdevice is remotely controlled for enhanced safetysystem integration. The remote device is suppliedwith integral limit/proximity switches to providepositive open/close indication of the interlockdevice itself.

11. Summary of performance bene®ts

Fig. 2(a) and (b) summarizes the nuisance triprate and risk reduction factor for the logic box andvarious ®eld device con®gurations.

11.1. Additional bene®ts of an integrated solution

Just as all of the systems on an aircraft mustfunction in unison, so must the portions of anintegrated safety system in the process industry.

Fig. 2. (a) Nuisance trip rate chart. (b) Risk reduction factor

chart.

P. Gruhn et al./ISA Transactions 37 (1998) 87±94 93

While there are obvious performance and safetybene®ts to be gained through limited movementtesting of valves, there are additional bene®ts to begained if the testing, logging, and reporting couldbe automated, centralized, and simpli®ed.

Currently, maintenance personnel must be uti-lized to both perform and record testing of ®elddevices. Records must then be maintained show-ing the tests have actually been performed.

Consider the bene®ts, however, of having thelogic box send a self-test signal to the valve, read-back that the test was successfully performed, log(and store) the test date, time, and results, andprint the report when requested. This would notonly obviously lower manual testing requirements(and associated costs), but greatly simplify doc-umentation requirements set forth within theprocess safety management legislation.

References

[1] 29 CFR Part 1910.119, Process Safety Management of

Highly Hazardous Chemicals, US Federal Register, 24

February 1992.

[2] Application of Safety Instrumented Systems for the Pro-

cess Industries, ISA standard 84, 1996.

[3] Guidelines for Safety Automation of Chemical Processes,

AIChE, CCPS, 1993.

[4] Functional SafetyÐSafety Related Systems, IEC draft

standard 1508, 1997.

[5] P. Gruhn, The evaluation of safety instrumented sys-

temsÐtools to peer past the hype, ISA transactions 35

(1996) 25±32.

[6] CaSSPack (Control and Safety System Modeling Pack-

age), L&M Engineering.

[7] D.J. Smith, Reliability, Maintainability and Risk, Butter-

worth Heinemann, 1993.

[8] N.G. Leveson, SafewareÐSystem Safety and Computers,

Addison±Wesley, 1995.

94 P. Gruhn et al./ISA Transactions 37 (1998) 87±94