Interrater reliability and the assessment of pressure-sore risk using an adapted Waterlow Scale

Interrater reliability and the assessment of pressure-sore risk using an adapted Waterlow Scale

M. Cook, C. Hale and B.Watson

Objective: the purpose of this paper is to highlight the issue and the complexi ty of the assessment of in terrater reliabil ity in the assessment of pressure sore risk.

Design: an empirical study undertaken to assess the in ter rater reliabil ity of an adapted Water low Scale is described.The scores obtained f rom 15 patients on two wards are presented. Each patient was assessed daily by two different nurses over a period of 7 days.A total of 210 assessments were obtained. In ter rater reliabil ity was assessed using two statistical techniques: percentage of agreement and correlation.

Setting: the elderly care unit of a distr ict general hospital in the Nor th of England.

Participants: twenty-eight ward staff and the research coordinator.

Results: the statistical tests showed a weak~moderate degree of in terrater reliability. A comparison of the reliabil i ty coefficients obtained f rom these tests with a visual examination of the actual scores obtained by the raters is made and discussed in relation to the clinical acceptabil ity of reliabil i ty coefficients.

Conclusions:

Margaret Cook RN, Msc, BSc, Staff Nurse, Neurological Rehabiritation Unit, Sunderland City Hospitals NHSTrust, Sunderland

Claire Hale RN RNT, PhD, BA Dame Kathleen Raven Professor of Clinical Nursing, School of Healthcare Studies, University of Leeds, General Infirmary at Leeds, Great George St, Leeds LS I 3EX, UK

Bill Watson RN, MSC, Senior Lecturer, Nursing Research and Development Unit, Faculty of Health, Social Work and Education University of Northumbria, Coach Lane Campus, Newcastle upon Tyne NES, UK

Correspondence to: Claire Hale

I. The relationship between statistical acceptabil ity and clinical acceptabil ity needs to be fur ther explored part icularly in relation to the assessment of pressure sore risk.

2. Fur ther investigation into the reliabil i ty of risk assessment tools is needed to identify if and under what circumstances they are effective and what training in the use of the tool is needed to improve and maintain reliabil ity and thereby be of any real practical use.

�9 Harcour t Publishers Ltd

Keywords: decubitous ulcer, pressure sore,Water low Scale, risk assessment, nursing practice

I N T R O D U C T I O N A N D B A C K G R O U N D

For many years pressure sores have been seen as a major problem in the National Health Service (NHS), not only because of the costs (estimated at

s million per year [ Smith et al. 1995]), but also because of the suffering they cause. The presence of pressure sores has been in the past, and to some

extent still is, seen as an indicator of poor nursing care and pressure sore prevalence is often used as a

quality indicator. As a result, considerable effort has

been put into preventing the occurrence of pressure sores by identifying those who are most at risk and concentrating preventative measures on this group of patients. The basis of prevention requires a reli-

able and valid assessment tool which accurately dis- criminates between patients at no risk, low risk, high risk and very high risk, in order to allocate expensive prevention resources efficiently.

Since the development 9 f the Norton Scale in the early 1960s (Norton et al. 1962) a number of scales

Clinical Effectiveness in Nursing (I 999) 3, 66-74 �9 1999 Harcourt Publishers Ltd

Interrater reliability and the assessment of pressure-sore risk using an adapted Wate r low Scale 67

Ring scores in the table and calculate the total (several scores per category can be used)

Score I 0+ 15+ 20+ At Risk High Risk Very High Risk

Build/VVeight for height Appetite

Average 0 Above average I Obese 2 Below 3

Continence Completely catheterized 0 Occasionally incontinent I Catheterized/incontinent of faeces 2 Doubly incontinent 3

Skin type: visual risk area Healthy 0 Tissue paper I Dry I Oedematous I Clammy (Temp I") I Discolou red 2 Broken/spot 3

Mobility Fully 0 Restless/fidgety I Apathetic 2 Restricted 3 Inert/traction 4 Chairbound 5

Sex/Age Male I Female 2 1449 I 50 64 2 65 74 3 75-80 4 81+ 5

Average Poor NG tube/fluids only NBM/anorexic

Tissue malnutrition Terminal cachexia Cardiac failure Peripheral vascular disease Anaemia Smoking

Neurological deficit Diabetes, MS, CVA, motor/sensory. paraplegia

Major surgery/trauma Orthopaedic - below waist, spinal On table over 2 hours

Medication Cytotoxics. high doses of steroids, anti inflammatory

0 I 2 3

8 5 5 2 I

4-6

have been developed to assess the vulnerability of patients to pressure damage. Clark (1993) claimed there were 18 risk-assessment tools in existence, most of which attemped to overcome the apparent failures of the Norton Scale. However, the majority of these risk assessment scales have not been subjected to rigorous scrutiny in terms of their reliability and validity.

The reliability of a measuring instrument is a major standard for assessing its quality and adequacy. It refers to the consistency of the scale no matter who uses it. Gibbon (1995) states that reliability is an absolute prerequisite to validity. The validity of a tool is determined by its ability to measure what it claims to measure. Polit and Hungler (1991) claim that a measuring tool that is not reliable cannot validly be measuring the charac- teristic of concern. The greater number of ways a scale is shown to be reliable and valid, the more confidence one can have in its utilization.

In the UK one of the most widely used pressure- risk assessment scales is the Waterlow Scale (1985). It was developed after a survey revealed that 6.6% of all hospital inpatients developed pressure damage whilst in hospital. A concern arising from the survey was that a number of patients identified by the

Norton Scale as 'not at risk' developed pressure

damage. Waterlow reviewed the risk factors associ- ated with pressure damage and concluded that many were overlooked by the Norton Scale. The

Waterlow Scale consists of ten categories, each containing a number of subscales. Each subscale is allocated a 'risk score' ranging from 0 (most favourable) to 6/8 (least favourable). To achieve the total risk score, first the category scores are obtained and then these are totalled. A patient is deemed 'at risk' if the total score is between 10 and 14, at 'high risk' if the total score is between 15 and 19 and at 'very high risk' if the total score is over 20. Waterlow is nationally recognized in the UK,

however, despite its popularity, there has been little assessment of its reliability and validity (Bridel

1993, Smith et al. 1995). Table 1 shows the original Waterlow Scale.

In early 1993, the Elderly Services Directorate of a district general hospital in the North East of

England began to review its pressure sore prevention policy. The Norton Score, which was then used in the directorate, was failing to identify those who

subsequently developed pressure sores. A pressure care interest group was instigated and all the current pressure-sore risk-assessment tools were reviewed.

68 Clinical Effectiveness in Nursing

Ring scoresin the table and calculate the total(severalscores per category can be used) Score 10+ 15+ 20+

At Risk High Risk Very High Risk

Build/weight for height (I) Appetite (6)

Average 0 Above Average I Obese 2 Below 3 Continence (2)

Completely catheterized 0 Occasional incontinence I Catheterized/incontinent of faeces 2 Incontinence of urine 2

Skin type: visual risk Area (3)

Healthy 0 Tissue paper I Dry I Oedematous I Clammy I Discoloured 2 Broken/spot 3 Mobility (4) Fully 0 Restless/fidgety I Apathetic 2 Restricted/inert 3 Traction 4 Chairbound 5 Sex/age (S) Male I Female 2 14-49 I 50-64 2 65-74 3 75-80 4 81+ 5

Average Poor NG tube/fluids only NBM/anorexic Special risks: tissue malnutrition (7) Terminal cachexia Cardiac failure Peripheral vascular disease Anaemia Smoking Neurological deficit/general medical condition (8) Diabetes, MS, CVA, motor/sensory. paraplegia severe rheumatoid arthritis

Major surgery/trauma (9) Orthopaedic - below waist, spinal On table over 2 hours

Medication ( I 0) Steroids, cytotoxics, high doses of anti- inflammatory, sedation

The Waterlow Scale was already being used in other areas within the hospital and it was decided to trial this tool on the elderly care ward. The trial suggested that there were limitations to using the Waterlow Scale in the elderly care area, particularly with patients who were continuously incontinent of urine but who were continent of faeces. As a result, the pressure care interest group decided to make some minor changes to the original Waterlow Scale. The tool subsequently produced was known within the hospital as the adapted Waterlow Scale. Details of the process by which this tool was developed can be found in (Cook 1994). Table 2 shows the adapted version of the Waterlow Scale which is the subject of this paper.

The format of the adapted version of the Waterlow scale was similar to the original, but contained the following changes:

�9 In the Waterlow Scale in the 'continence' subscale, there is no means of scoring a patient who is incontinent of urine but continent of faeces. Waterlow appears to be assuming that if a patient is incontinent of urine they will be catherized. However, it was not the policy of the hospital involved in this study to catheterize all incontinent patients, therefore, 'doubly

incontinent' was replaced by 'incontinent of

urine' and rated a score of 2. In the Waterlow Scale the 'neurological deficit' subscale rates a score of between 4 and 6, allowing for individual interpretation by each assessor. In the adapted scale, the 'neurological

deficit' subscale has been changed to 'neurological deficit/general medical condition'

and rated a standard score of 6. Rheumatiod arthritis was added to the list of conditions which would fall into this category.

The 'medication' subscale in the adapted scale has been extended to include high doses of sedation over a 24 hour period. This alteration was made following an in-depth discussion with the head of the pharmacy department who

maintained that such doses could not only affect the patient's mobility and state of mind, but also that the 'cocktail' of drugs often prescribed to the elderly could result in a slowing down of the healing process.

In April 1994, the tool was introduced into the elderly care area of the hospital for a trial period and an evaluation was carried out once the staff had

gained some practice using the new tool. A full

Interrater reliability and the assessment of pressure-sore risk using an adapted Waterlow Scale 69

report of the study is available elsewhere (Cook 1996). This paper discusses an assessment of its interrater reliability.

INTERRATER RELIABIL ITY

Interrater reliability is an estimate of the degree to which two or more independent raters, observers, scorers judges or interviewers are consistent in their judgments (Goodwin & Prescott 1981). The assessment of interrater reliability is particularly important in the development of standard measuring instruments which will be used by a variety of raters in a variety of situations. In these circumstances interrater reliability is assessed statistically and a reliability coefficient is obtained. Goodwin and Prescott (1981) explain that most of the statistical approaches to reliability assessment are based on classical test theory, which assumes that a person's score on a measure is the sum of two parts: a true score and an error component. Error may be caused by many factors, e.g. the subjects themselves, poor technique by the raters, insufficient time to carry out the assessment, and poor wording of items on the instrument. The aim when developing measuring instruments of any kind is to minimize the error component so that

the scores subsequently obtained are as near as possible to the true score. However, in all statistical assessments of interrater reliability it is assumed that obtained scores are never totally free of error. The

reliability coefficient obtained from the application of a statistical technique to a data set expresses the amount of variance in observed scores that can be considered true-score variance, as opposed to error variance, and will range from the theoretical values of 0 to I. An unreliable instrument will have a relia-

bility coefficient approaching 0 (e.g. 0.02) which indicates that the measure is producing unstable or inconsistent scores. A reliable instrument will have a

reliability coefficient approaching the maximum theoretical value of 1 (e.g. 0.95). When assessing the reliability (interrater or other) of a particular instrument, it is important to remember that reliability is not a property of an instrument in isolation

(Goodwin & Prescott 1981). The reliability coefficient obtained in a particular study is specific to that group of subjects in those particular circumstances. The reliability coefficient obtained in one study cannot automatically be assumed to apply to other subjects in other circumstances.

To add to the complexity of assessing interrater

reliability, there are also several different approaches to assessment. One of the most frequently used is Cohen's K (Burns & Grove 1993).

Two other popular approaches are 'percentage of agreement' and 'correlation technique' (Goodwin &

Prescott 1981). Goodwin and Prescott (1981), using the same data set, also demonstrated how four different approaches to estimating interrater reliability yielded substantially different results and

interpretations. They concluded that increased attention needs to be paid to reliability estimation in nursing research when measuring instruments are being developed. In particular, there is a need to consider the ways in which reliability estimates are calculated and the appropriateness of the corresponding interpretations. This requires that more information needs to be given in a study report about the type of reliability coefficient calculated and the subjects involved as well as other relevant features of the reliability estimation.

Unfortunately, it appears that nurse researchers have paid little attention to the advice of Goodwin and Prescott. A recent citation search of the article provided only 20 citations, only one of which was from an article about the assessment of pressure sore risk (Bergstrom et al. 1987). This paper attempts to redress that balance.

RESEARCH D E S I G N

Setting

The study was carried out in two wards within the Elderly Services Directorate of a district general hospital in the North East of England. Two wards

were used:

�9 Ward 1-A stroke rehabilitation unit. Admission to this ward depended upon the patient's potential for rehabilitation.

�9 Ward 2-An acute medical/rehabilitation ward. Patients were admitted directly from home or via the Acute Medical Emergency Unit.

Approval tbr the study to take place was given by

the Local Research Ethics Committee.

RESEARCH PARTICI PANTS

Patients

The ward manager of each ward was requested to compile a list of the patients suitable for inclusion in the research. The rationale for this was to avoid any patient being included whose medical condition was regarded as unstable. The stability of the patient's condition was deemed necessary as an inclusion criteria in order to exclude (as far as possible) changes in a patient's condition as a reason for

variation in the nurse assessment scores. It was also believed that the ward manager was

in the best position to safeguard the inclusion of any patient who was 'confused' or 'mentally impaired' from unwittingly giving their permission to be involved. Written and witnessed consent was obtained from all patients who participated in the study. In total, 15 patients participated in the study: nine from ward 1 and six from ward 2.


30

25

20

15

10

5

0

x

• x

4~

1 2 3 4 5 6 7 8 9 10

Patient identification number

Fig. I Assessment scores for each patient on ward I

3(3

25

2(3

15

1C

5

x

x ~e<

• x ~ x

x x

x

09 10 11 12 13 14

Patient identification number

Fig. 2 Assessment scores for patients on ward 2

15 16

Nurses

Twenty-six registered nurses and two final-year nursing students took part in the study, the only

criteria for the nurses' inclusion were their consent a n d their daily involvement with the selected patients. Prior to commencing the study, each nurse working on the study wards was given an information sheet and given the opportunity to ask any

questions. All the nurses involved in the study had prior experience of using the 'adapted' Waterlow Scale. No additional training was given in the use of the tool because the study required that the nurses use their existing knowledge both of the scale and of their patients in order to give an accurate account of how the scoring scale is normally interpreted when assessing patients. Both first and second-level registered nurses were included in the study, no differentiation was made between the level and grade of nurse because all second-level nurses had been qualified for over 5 years and were considered as competent in assessing a patient's potential risk of developing pressure damage as were their first- level counterparts. The final-year students were also considered to be competent in this assessment technique and two were included in the 'nurse' sample.

M E T H O D S O F D A T A C O L L E C T I O N

The same protocol was used on both participating wards. Every day for a period of 7 days, two nurses assessed each participating patient. Two different nurses carried out the assessment each day so that each nurse assessed each participating patient once only during the study period. The nurses assessed

the patient's risk status as they did in everyday practice. Each nurse was unaware ( 'blind') of the other

assessment scores. Assessment scores were recorded on purpose designed sheets and once com- pleted, these were placed in a sealed envelope for collection by the researcher. Altogether, a total of

210 assessments were obtained (approximately 14 from each patient), 126 from ward 1 and 84 from ward 2. For the purpose of analysis the assumption was made that the condition of the patient did not change over the period of the assessment week (no

independent clinical evidence was obtained to

suggest that any change had taken place). Using the Statistics Package for Social Scientists (SPSS Inc Chicago), a coding frame was developed using the patient as the unit of analysis. The separate scale

category scores from each nurse, were entered for each patient along with the total score and the final

'risk category' (e.g. low risk at risk, high risk, very high risk)

F I N D I N G S

Figures 1 and 2 show the total scores which were

obtained from wards 1 and 2 for all the patients in the study during the data-collection period. The patient identification numbers are plotted along the

X-axis and the Waterlow scores along the Y-axis. Each small cross represents a risk score given by one of the nurse assessors for the respective patient. Thus, for patient 1 (Fig. 1) it can be seen that four assessors gave that patient a score of 15. A visual examination of this raw data clearly shows the wide variation in the scores obtained for each patient over the period of 7 days. A specific inclusion criterion for patients was that, in the judgement of the nursing staff, the clinical condition of the patient was expected to remain stable for the next 7 days. Verification with the ward nursing staff on comple- tion of the data collection was carried out and no evidence was obtained which suggested that variations in the assessment scores were due to changes in the patient's condition. No other independent assessment was carried out to ascertain if the patient's condition remained stable.

In Figures 3 and 4 the floating vertical bars represent the range of raw scores already shown in

Figs 1 and 2) obtained for each patient. When these raw scores are plotted against the background of the risk category parameters (<i0 = low risk, 10-14 = at risk, 15-19 = high risk, 20+ = very high

risk), which from the horizontal bands in Figures 3 and 4, it can be seen that no patient received scores that placed them in a single risk category. Five patients received scores which placed them in two different risk categories. Nine out of the fifteen patients received scores from different nurse raters


Low Risk E]

30

20 8

10

Risk [] High Risk [] Very High Risk [] 6

I<1= 14 14 14 14 14 11 14 14 11 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00

Patient Indentification Number (n = number of assessments per patient)

Fig. 3 Distr ibut ion of assessment scores from ward I across risk categories of the adapted Watedos Scale

Low Risk []

30

20 8 m

~2 lO

Risk [ ] High Risk [] Very High Risk []

ON= 14 14 14 14 14 14 I000 11.00 12.00 13.00 14.00 15.00

Patient [ndentification Number

(n = number of assessments per patient)

Fig. 4 Distribution of assessment scores from ward 2 across risk categories of the adapted Waterlos Scale

spanning three of the 'risk categories ' . One patient

received scores spanning all four categories.

The variability in the scores may have been due

to differing levels of knowledge regarding the

patients being assessed. However, all the nurses

involved in the study were part of the nursing team

caring for the patients. Each nurse shared the same

data base (patient's nursing and medical notes) to

access the relevant intormation required to com-

plete the assessment. Unfortunately, not all nurses

used this information and this had implications for

the risk category into which the patient was placed.

For example patient 15 (ward 2) who received the

outlier score of 5 from one nurse assessor was

placed in three 'at risk' categories. After a close

inspection of all the assessment scores relating to

this patient, it was discovered that in that one par-

ticular instance the nurse rater did not classify him

as suffering from a 's troke' . All the other 13 raters

did so. Had that nurse rated him the same as the

others, then his score would have been 11 and not 5.

This would have placed him in two 'at risk' cate-

gories not three.

ANALYSIS OF THE SUBSCALES

The adapted Waterlow Scale, like the original

Waterlow Scale is composed of a number of

4 �9

a

o M~ior Surgery Mobility Age ~u~ld Sou Tiasue Malnutrition

Medication Igeum Deficit Appetite C o n t i ~ Skin Type

Fig. S Median range of scores assigned to each patient for each scale category

subscales each of which have their own score. In an

attempt to ascertain if the variation in the overall

nurse rater scores was caused by difficulties in a

particular section of the pressure sore assessment

scale, the subscales scores for each patient were

analysed. Figure 5 shows the median range of

scores assigned to each patient for each scale cate-

gory. These scores were calculated by tabulating

each patient case by each scale category minimum

and maximum score. The median min imum and

maximum scores were then calculated across all

cases and these scores were plotted as floating bars.

The wider the bar, the greater the variation in scores

given by the raters. So, for example, in the cate-

gories of ' sex ' , ' age ' and 'major surgery' the bars

are narrow, indicating a high degree of consensus in

scores. In other categories the bars were wider,

indicating more variability. The nurse raters varied

most in their assessment of skin type and mobility

status. Clearly, i f variability of scores in these sub-

scales alone could be reduced then there is likely to

be a corresponding decrease in the w~riability of the

patients' overall pressure sore risk scores and an

improvement in interrater reliability.

STATISTICAL ASSESSMENT OF INTERRATER RELIABIL ITY

As stated above, interrater rehability is an important

concept in any kind of assessment, Basically, it is

the extent to which two or more people obtain the

same scores when measuring a given phenomenon.

Demonstrating good interrater reliability is an

essential prerequisite to purporting the validity of

any measurement tool and statistical methods are

often used to demonstrate interrater reliability. On a

visual examination, the interrater reliability scores

of the risk-assessment tool described here are poor

from a clinical perspective. However, (as explained

previously) within measurement theory there is

always an expectation that there will be some

variability between raters. The question then

becomes how much variation is acceptable?

Working on the proposition that the variation within

the scale scores demonstrated in this study were


100 I 100

80 . 72.7 7 1 . 4 . 1iii41 i 70 043 57, = - 51 .714 . 7 1 , . 7 1 = _ 00 l 50 I I 560 ~ 50

O ~ 0 10 11 12 13 14 15

1 2 3 4 5 6 7 8 9 Patient identification number

Pat ient identification number

Fig. 6 Percentage of agreement - ward I Fig. 7 Percentage of agreement ward 2

clinically unacceptable then the authors decided to apply some commonly used statistical methods to the data to ascertain the statistical level of interrater reliability of the data set. The purpose of this exercise was an attempt to benchmark visual data with an interrater reliability statistic.

As discussed previously, several methods have been developed to measure interrater reliability. One of the most frequently used is Cohen's •, a statistical test which estimates the likelihood of agreement owing to chance alone (Bums & Grove 1993). However, the aims of this study were to assess interrater reliability whilst taking account of the everyday working environment of clinical nurses rather than in a controlled environment. In practice, a single nurse may assess several patients and this study was designed to reflect this. Therefore, the assessment data (the scores) were distributed over a high number of patients relative to the number of raters (the study was designed so that every nurse assessed approximately ten patients). As such, Cohen's ~: was considered to be inappropriate and instead two methods outlined by Goodwin and Prescott (1981) were followed. These two methods were: percentage of agreement and correlation technique.

PERCENTAGE OF AGREEMENT

Goodwin and Prescott (1981) claim that percentage of agreement is the most frequently used approach to interrater reliability because it expresses reliability in terms of the number of times the raters agree relative to the total number of assessments made. They state that percentage of agreement is best used when dealing with data with few distinct categories, because the larger the numbers of choices open to raters, the higher the probability that exact agreement will not occur. For this study the overall risk category scores (low risk, at risk, high risk, very high risk) were used for the calculation of percentage of agreement. Using percentage of agreement as a measure of interrater reliability is appropriate in this study, as it could be argued that the risk category into which the patient is placed is the most important of all the scores because it is this category which usually determines the allocation of pressure relieving and preventative resources.

To calculate percentage of agreement for each patient in this study, the following steps were undertaken:

1. The risk category most frequently assigned (the mode) for each patient was recorded. For example for patient 1 (Fig. 1) the most frequently assigned risk score was 15 (four nurses gave this score). A risk assessment score of 15 puts the patient at 'high risk', therefore, for this patient the modal risk category was 'high risk'.

2. The percentage of scores allocated to this category was calculated. For example, for patient 1 the percentage of agreement with the modal score was 64.2%.

Figures 6 and 7 show the percentage of agreement scores for all the patients on each ward. Each column represents the percentage of agreement for each patient. The actual percentage figure is given at the top of each column. It is clear that the nurse assessors were more in agreement about some patients than others.

The percentage of agreement for each ward was calculated by taking the mean of all the percentage of agreement scores across each ward, giving a score of 55.5% for ward 1 and 72.6% for ward 2. The higher level of agreement on ward 2 over ward 1 can be clearly seen from the height of the columns in Figures 6 and 7. From a statistical perspective, these percentage of agreement scores show a moderate degree of interrater reliability among the raters.

CORRELATION T E C H N I Q U E

Correlation technique expresses reliability in terms of the correlation between the sets of total scores of two raters, i.e. the extent to which scores are ranked similarly by the different raters. In this study the median correlation coefficient was calculated for the staff on each ward by first obtaining the sets of total risk scores calculated by each nurse. Then each possible rater pair 's Kendat tan was calculated. This was done for each ward and resulted in a matrix of correlation coefficients (Fig 8 and 9). The median of all the coefficients for each ward was then obtained by using Fisher's Z score transformation. Interrater


reliability estimates of 0.36 for ward 1 and 0.5 for ward 2 were obtained. These results suggest a weak-to-moderate degree of interrater reliability.

C O M P A R I S O N OF T H E V I S U A L ASSESSMENT A N D STATISTICAL ASSESSMENT OF INTERRATER RELIABILITY

Visual examination of the range of risk scores obtained for each patient (Figs 1-4) suggest that the interrater reliability of the nurse raters is poor. However, the statistical estimation of interrater reliability which was carried out using the two different techniques suggests a moderate degree of interrater reliability when using percentage of agreement and weak-to-moderate reliability when using Kendal's Tau. Both techniques suggested that the raters on ward 2 had a higher degree on interrater reliability than the raters on ward 1, which at least shows consistency! However, this distinction was not evident from a visual examination of the data as presented in Tables 1-4. This gives rise to

two issues:

�9 were the statistical test used inappropriate?

and/or �9 in the assessment ofinterrater reliability for

instruments to be used clinically, should a higher level of statistical intenater reliability be seen as necessary for clinical acceptability?

These are discussed in more detail below.

DISCUSSION

Screening tools, such as the pressure risk assessment scale, are useful where there is an important health problem that can be prevented or improved by early detection using a simple, reliable and valid

tool. Pressure-risk assessment tools have been introduced into the nursing repertoire to enhance systematic assessment and allow effective targeting

of expensive resources. However, if these tools are not valid and reliable then not only are they failing to identify correctly those patients at risk, but they could also be directing scarce and expensive

resources haphazardly. This study raised a number of issues namely:

�9 the variation within the scores obtained by the

nurse raters �9 identification of potential areas for reducing the

variability of the score �9 the implications of this study for the interrater

reliability of the Waterlow Scale �9 the difference between clinical acceptability

and statistical acceptability �9 the design of interrater reliability studies.

There is an assumption within nursing that pressure-sore risk assessment tools are self-explanatory and that minimal (if any) tr~ning is needed to use them. Yet pressure-sore risk-assessment tools are frequently quite complex devices. The Waterlow Scale from which the scale used in this study was adapted consists of 10 categories of which four require the user to rate within them and six require the user to know certain facts about the patient. Moreover, when a large number of choices are available to the raters, there is a higher probability that exact agreement will not occur (Goodwin & Prescott 1981). Thus, a combination of poor training in the use of an assessment tool and an assessment tool which allows a large number of choices is likely to lead to poor interrater reliability scores. In

this study it was noticed that two particular categories were responsible for most of the variability in the total score. These were 'mobility' and 'skin type'. An education programme which focused on developing criteria for scoring within these two categories alone would do much to improve the interrater reliability scores.

Most pressure-sore risk-assessment tools are designed to produce an overall 'risk score' which forms the basis of the categorization of patients on a continuum from 'no/low risk' to 'high risk'. Decisions regarding which patients receive expensive pressure-relieving equipment rests on the assessed degree of risk they face, therefore, even if individual nurses differ over the assessment of subscale scores, it is not unreasonable to expect that there should be a high degree of unanimity over the final risk score, at least to the extent that all the raters' scores fall into one or, at worst, two categories (where the scores cluster around a risk category parameter). In this study the interrater reliability obtained from the percentage of agreement technique (which used the overall risk category into which individual patients were placed as a result of their actual score) was higher than that obtained from correlating the actual scores that the nurses gave to the patients. This finding is not unexpected because the categorization process serves to iron out some of the individual variations. What is more worrying is the extent to which the scores of some patients ranged across three or four risk categories. Certainly any nurse manager look- ing at the results presented in Figures 3 and 4 would have serious concerns about the effectiveness of this tool in a clinical situation because he or she would have no way of knowing if the risk score assessed by one nurse was anywhere near the 'true risk score' for that patient.

The pressure sore assessment tool used in this study was developed from the Waterlow risk-assessment tool which itself is widely used in the UK. While the authors would not presume to suggest that the findings from the data obtained within this study can be generalized to the Waterlow Scale, similar studies should be carried out on the


Waterlow Scale itself because the principles of

assessment and scoring are the same across both

tools and there has, in fact, been very little assess-

ment of the reliability and validity of the Waterlow

Scale (Bridel 1993, Smith et al. 1995).

In medical research the concept of clinical

significance is as important as statistical signifi-

cance. This concept can be applied in field studies

of the type described here. It would be difficult to

sustain an argument that the raw risk assessment

scores obtained in this study are acceptable for

'clinical ' purposes, however, the statistical tests car-

ried out suggested a higher degree of interrater reli-

ability. In this study, because the data set was

comparatively small, we examined the data visually

to see what moderately reliability and weak-to-

moderate reliability really meant. Further work

should be undertaken with larger data sets to

correlate 'clinical ' acceptability and statistical

acceptability.

The design of interrater reliability studies also

needs some attention. In the past, much work has

been concentrated on the predictive validity of

pressure-sore scales to assess the relationship

between the score and future pressure-sore develop-

ment (Edwards 1994) and this work continues. To

avoid the problems of interrater reliability usually

only one or two raters are involved in this type of

study. However, to be a useful tool clinically, the

reliability of the scores obtained by many different

raters has to be assured. If it is not, then the predic-

tive validity of the tool (in practice) and the effec-

tiveness of the allocation of resources according to

the scores obtained, is questionable. This particular

study was designed to have as large a number of

raters as possible and practical, assessing a group

of patients. The above, combined with the structure

of the assessment tool made the assessment of inter-

rater reliability complex and we would like to

highlight that future studies of this kind are carried

out in collaboration with statisticians. Goodwin and

Prescot (1981) themselves recognized the limita-

tions of the techniques used in this study and

suggested that a useful way forward would be to use

generalizability theory techniques (Cronbach et al.

1972) and this could be the focus of further work in

this area.

C O N C L U S I O N

If pressure-sore prevention resources are allocated

on the basis of risk score, then the variability in

scores demonstrated above clearly has financial

implications for resource use as well as the assess-

ment of the effectiveness of the interventions

themselves.

Further investigation into the reliability of risk

assessment tools is needed to identify if and under

what circumstances they are effective and what

training in the use of the tool is needed to improve

and maintain reliability, and thereby be of any real

practical use. Some of these investigative studies

should explore the use of generalizability theory

techniques.

REFERENCES

Bridel J 1993 Assessing the risk of pressure sores. Nursing Standard 7 (25): 32-35

Clark M 1993 Understanding pressure sores; an awareness of research methodology. Wound Management 4 (2): 41-45

Cook MJ 1994 The process of change: the implementation of an 'adapted' Waterlow pressure damage prevention policy, throughout the 'Elderly Services Directorate' of South Tyneside Health Care Trust. Unpublished dissertation University of Northumbria at Newcastle

Cook M 1996 Estimating the inter-rater reliability of the adapted Waterlow Pressure Sore Risk Assessment Scale. Unpublished MSc thesis. University of Northumbria at Newcastle

Edwards M 1994 The rationale for the use of risk calculators in pressure sore prevention and the evidence of the reliability and validity of published scales. Journal of Advanced Nursing 20:288-296

Gibbon B 1995 Validity and reliability of assessment tools. Nurse Researcher 2 (4): 48-55

Goodwin LD, Prescott PA 1981 Issues and approaches to estimating inter-rater reliability in nursing research. Research in Nursing and Health 4:323-337

Norton D, McLaren R, Exton-Smith A 1962 An investigation of geriatric nursing problems in hospital. National Corporation for the Care of Old People, London

Polit DF, Hungler BP 1991 Nursing research: principles and methods, 4th edn. Lippincott, Philadelphia

Smith LN, Booth Net al. 1995 A critique of 'at risk' pressure sore assessment tools. Journal of Clinical Nursing 4: 153-159

Waterlow J 1985 A risk assessment card. Nursing Times 81(48): 4%55

Documents

Interrater reliability and the assessment of pressure-sore risk using an adapted Waterlow Scale