9
Original Article Evaluating Correlation and Interrater Reliability for Four Performance Scales in the Palliative Care Setting Jeff Myers, MD, CCFP, MSEd, Kate Gardiner, BSc (C), Kristin Harris, BSc (C), Tammy Lilien, BA, Margaret Bennett, MD, CCFP, Edward Chow, MBBS, PhD, FRCPC, Debbie Selby, MD, FRCPC, and Liying Zhang, PhD Palliative Care Consult Team (K.G., T.L., D.S., L.Z., J.M.), Rapid Response Radiotherapy (K.H., E.C.), and Aging and Veteran’s Program (M.B.), Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada Abstract Performance scales are used by clinicians to objectively represent a patient’s level of function and have been shown to be important predictors of response to therapy and survival. Four different scales are commonly used in the palliative care setting, two of which were specifically developed to more accurately represent this population. It remains unclear which scale is best suited for this setting. The objectives of this study were to determine the correlations among the four scales and concurrently compare interrater reliability for each. Patients were each assessed at the same point in time by three different health care professionals, and all four scales were used to rate each patient. Spearman correlation coefficient values and both weighted and unweighted kappa values were calculated to determine correlation and interrater reliability. The results confirmed highly significant linear correlation among and between all four scales. Whether using a reliability measure that incorporates the concept of ‘‘partial credit’’ for ‘‘near misses’’ or a measure reflecting exact rater agreement, no one scale emerged as having a significantly higher likelihood of agreement among raters. We propose that what may be more important than clinical experience or rater profession is the level of training an individual health care professional rater receives on the administration of any particular performance scale. In addition, given that low levels of exact rater agreement could have substantial clinical implications for patients, we suggest that this parameter be considered in the design of future comparative studies. J Pain Symptom Manage 2010;39:250e258. Ó 2010 U.S. Cancer Pain Relief Committee. Published by Elsevier Inc. All rights reserved. Key Words Performance scales, interrater reliability, correlation, palliative care Introduction Widely used in palliative care settings, per- formance scales (PS) are key to the develop- ment of individualized treatment plans for patients/families, efficacy assessment of such plans, and the tracking of disease progression. Address correspondence to: Jeff Myers, MD, CCFP, MSEd, Palliative Care Consult Team, Sunnybrook Health Sciences Centre, 2075 Bayview Avenue, Toronto, Ontario M4N 3M5, Canada. E-mail: [email protected] Accepted for publication: July 13, 2009. Ó 2010 U.S. Cancer Pain Relief Committee Published by Elsevier Inc. All rights reserved. 0885-3924/10/$esee front matter doi:10.1016/j.jpainsymman.2009.06.013 250 Journal of Pain and Symptom Management Vol. 39 No. 2 February 2010

Evaluating Correlation and Interrater Reliability for Four Performance Scales in the Palliative Care Setting

Embed Size (px)

Citation preview

Page 1: Evaluating Correlation and Interrater Reliability for Four Performance Scales in the Palliative Care Setting

250 Journal of Pain and Symptom Management Vol. 39 No. 2 February 2010

Original Article

Evaluating Correlation and InterraterReliability for Four Performance Scalesin the Palliative Care SettingJeff Myers, MD, CCFP, MSEd, Kate Gardiner, BSc (C), Kristin Harris, BSc (C),Tammy Lilien, BA, Margaret Bennett, MD, CCFP, Edward Chow, MBBS, PhD, FRCPC,Debbie Selby, MD, FRCPC, and Liying Zhang, PhDPalliative Care Consult Team (K.G., T.L., D.S., L.Z., J.M.), Rapid Response Radiotherapy

(K.H., E.C.), and Aging and Veteran’s Program (M.B.), Sunnybrook Health Sciences Centre, Toronto,

Ontario, Canada

Abstract

Performance scales are used by clinicians to objectively represent a patient’s level of functionand have been shown to be important predictors of response to therapy and survival. Fourdifferent scales are commonly used in the palliative care setting, two of which were specificallydeveloped to more accurately represent this population. It remains unclear which scale is bestsuited for this setting. The objectives of this study were to determine the correlations amongthe four scales and concurrently compare interrater reliability for each. Patients were eachassessed at the same point in time by three different health care professionals, and all fourscales were used to rate each patient. Spearman correlation coefficient values and bothweighted and unweighted kappa values were calculated to determine correlation andinterrater reliability. The results confirmed highly significant linear correlation among andbetween all four scales. Whether using a reliability measure that incorporates the concept of‘‘partial credit’’ for ‘‘near misses’’ or a measure reflecting exact rater agreement, no one scaleemerged as having a significantly higher likelihood of agreement among raters. We proposethat what may be more important than clinical experience or rater profession is the level oftraining an individual health care professional rater receives on the administration of anyparticular performance scale. In addition, given that low levels of exact rater agreementcould have substantial clinical implications for patients, we suggest that this parameter beconsidered in the design of future comparative studies. J Pain Symptom Manage2010;39:250e258. � 2010 U.S. Cancer Pain Relief Committee. Published by Elsevier Inc.All rights reserved.

Key Words

Performance scales, interrater reliability, correlation, palliative care

Address correspondence to: Jeff Myers, MD, CCFP,MSEd, Palliative Care Consult Team, SunnybrookHealth Sciences Centre, 2075 Bayview Avenue,Toronto, Ontario M4N 3M5, Canada. E-mail:[email protected]

Accepted for publication: July 13, 2009.

� 2010 U.S. Cancer Pain Relief CommitteePublished by Elsevier Inc. All rights reserved.

IntroductionWidely used in palliative care settings, per-

formance scales (PS) are key to the develop-ment of individualized treatment plans forpatients/families, efficacy assessment of suchplans, and the tracking of disease progression.

0885-3924/10/$esee front matterdoi:10.1016/j.jpainsymman.2009.06.013

Page 2: Evaluating Correlation and Interrater Reliability for Four Performance Scales in the Palliative Care Setting

Vol. 39 No. 2 February 2010 251Evaluating Performance Scales

In addition, it has been consistently demon-strated that, with accurate ratings, PS canstrongly inform the prediction of patientsurvival.1e7

Two widely used reliable and valid PS are theKarnofsky Performance Status (KPS) and theEastern Cooperative Oncology Group Perfor-mance Status (ECOG). KPS was originallydeveloped to facilitate the objective assessmentof performance in patients with cancer.8 Con-sisting of fewer categories, ECOG is anabridged version of KPS, developed to sim-plify performance assessment.9 The specificparameters determining an overall rating forboth scales include patient’s level of ambula-tion, level of activity, and ability to performself-care. Previously identified limitations ofboth KPS and ECOG are of particular rele-vance for patients in the palliative care set-ting. Both scales have been criticized forpoor sensitivity at lower ends, suggesting pos-sible inaccuracies for patients with advanceddisease.10e12

The Palliative Performance Scale (PPS) andthe Australia-modified KPS (AKPS) are bothKPS adaptations and were each developedwith the intent to improve the evaluation ofpatients with lower performance status.10,13

PPS ratings are assigned through an evaluationof five objective parameters: ambulation,evidence of disease, self-care, oral intake, andlevel of consciousness.13 AKPS differs fromthe original KPS for the ratings of 40 andbelow.10 The corresponding descriptions forthese ratings on AKPS are intended to clarifyfor raters a level of function that can beobserved and quantified, that is, the AKPS rat-ing of 40 is used to describe a patient who ‘‘isin bed more than 50% of the time’’ vs. theKPS rating of 40 used to describe a patientwho is ‘‘disabled and requiring special careand assistance.’’10

Given the key role PS should play in overallpatient care, many centers have mandated PSintegration into their clinical assessment anddocumentation processes. Successful clinicalpractice or institutional integration of any PSis highly dependent on strong evidence of clin-ical utility, accuracy, ease of use, and interraterreliability.5 As previously mentioned, both KPSand ECOG have been found to be valid and re-liable tools.4,14 Several studies have addressedvalidation of PPS,15e17 but reliability has only

recently been investigated.18,19 Althoughmore formal validation and evaluation of reli-ability is required, the face validity has been es-tablished for AKPS.10

Despite their widespread use in the pallia-tive care setting, little direct evidence existsto guide clinicians as to which PS is best suitedfor this population. To our knowledge, no pre-vious work has addressed the linear correla-tions, both across the four PS themselves andbetween different raters for each scale. In addi-tion, interrater reliability has not been concur-rently evaluated. Designs of previous reliabilitystudies for PS have incorporated various statis-tical measures to determine the likelihood dif-ferent raters will agree in the same generaldirection; however, exact agreement is rarelyaddressed. Given both the interprofessionalnature of the field of palliative care and thefact that different centers will assign the taskof PS assessment to different health care pro-fessionals (HCPs), concurrent evaluation ofcorrelation and interrater reliability of allfour scales would provide a foundation of nec-essary evidence for clinicians. The two mainobjectives and corresponding hypotheses forthis study were:

1. To determine the correlations across allfour PS and between different HCP raterswithin each individual PS. Given thatthree of the scales are modifications ofthe KPS, it is expected very high levelsof correlation will be identified bothwithin and between each of the scales.

2. To compare interrater reliability betweenthree different HCP raters for each of thefour scales for the same study population.Given its substantially fewer categories, itis hypothesized that ECOG will have thehighest level of agreement.

Sample and MethodsBetween March 2007 and May 2007, all

patients referred for palliative care assessmentat Sunnybrook Health Sciences Centre and theOdette Cancer Centre in Toronto, Ontario,were eligible for inclusion in this prospectivequantitative study. Patients were accruedfrom three different palliative care deliverysites: an outpatient palliative care clinic withinan ambulatory regional cancer center, an

Page 3: Evaluating Correlation and Interrater Reliability for Four Performance Scales in the Palliative Care Setting

252 Vol. 39 No. 2 February 2010Myers et al.

inpatient oncology unit within a tertiaryacademic acute care institution (patientsreferred for palliative care consultation), andan inpatient palliative care unit (patientsadmitted primarily for end-of-life care).

Ethics approval to conduct the research wasobtained from the Sunnybrook HealthSciences Centre Research Ethics Board. Forthis study, only the rater assessment was cap-tured, requiring neither additional patientcontact nor questioning beyond standardcare. Consent was, therefore, not required,and there were no exclusion criteria. Patientdemographic and disease information was col-lected from patient charts and informationgathered during consultation.

All patients were individually rated using allfour scales by a palliative care research assis-tant (RA), a primary oncology/palliative carenurse (RN), and a specialist palliative care phy-sician (MD). For each patient, assessmentsfrom all three raters were made within 24e48hours of each other. A single individual servedas the RA for all three sites. Both the MDs andRNs varied between and within sites. Site ofpalliative care delivery was not accounted forin the analysis, as relative homogeneity wasassumed among all three sites.

Power AnalysisSample size calculation was conducted by

Power Analysis and Sample Size, version 2005for Windows (NCSS, Kaysville, UT). The un-weighted kappa was assumed to be increasedby 40% based on the kappa of 0.30 for KPSamong all raters. Using two-sided binomial hy-pothesis test with a target significance level of0.05 (Type I error), 127 patients were requiredto detect a difference between the null hypoth-esis that the proportion was 0.30 and the alter-native hypothesis that the proportion was 0.42(0.30� 140%). The actual power was 81%,and the actual significance was 0.0416. Patientscould be withdrawn for any reason; therefore,a dropout rate of 5% was considered. A total of134 (127/0.95) patients were required for thestudy.

StatisticsInferential and descriptive statistics were

measured using Statistical Analysis Software(version 9.1; SAS Institute, Cary, NC) for Win-dows. Results were expressed as the median

(range) for quantitative variables and as pro-portions for categorical findings.

Correlation. To establish correlation amongdifferent raters for each individual scale,Spearman correlation coefficients were calcu-lated using an alpha of 0.05 for all rater pair-ings (MD/RA, RN/RA, and MD/RA). Toestablish the correlations between the scalesthemselves, Spearman correlation coefficientswere calculated among each uniprofessionalgroup of raters (MDs, RNs, and RAs) for allpossible scale pairings. The range in values isfrom �1.0 (indicating perfect inverse correla-tion) to 1.0 (indicating perfect correlation).

Interrater Reliability. Given individual patientswere assessed by multiple raters, kappa analy-ses were chosen over intraclass correlation co-efficients to examine reliability.20 Bothweighted and unweighted values were calcu-lated. Unweighted or simple kappa values rep-resent the measure of agreement beyondchance.21 When calculating unweighted kappavalues, zero weight is given to all disagreement,regardless of discrepancy size, thus focusing onagreement that is exact. Weighted kappa calcu-lations in addition assign ‘‘partial’’ credit foragreement that is ‘‘near.’’22 In theory, onewould expect weighted kappa values to behigher than those unweighted when assessingagreement between raters using a tool withmultiple rating categories or levels. For bothweighted and unweighted kappa statistics,a value of 0.2 or less represents poor agree-ment, 0.21e0.4 indicates fair agreement,0.41e0.6 indicates moderate agreement,0.61e0.8 indicates good agreement, and0.81e1.0 signifies very good agreement.23 Ifagreement was found to be poor (kappa value:0.2 or less), subsequent analysis using theMann-Whitney U nonparametric test identifiedgroup tendencies to rate more positively ornegatively. The percentage of exact agreementwas also documented for all rater pairings.Comparative results were considered signifi-cant at the 5% critical level (two-sidedP< 0.05).

ResultsDescriptive information for the 134 patients

is summarized in Table 1. Gender was evenly

Page 4: Evaluating Correlation and Interrater Reliability for Four Performance Scales in the Palliative Care Setting

Table 1Descriptive Information of Sample (n¼ 134)

GenderFemale 70 (52%)Male 64 (48%)

AgeMedium 67Range 21e102

Palliative care delivery siteOutpatient (palliative care clinic) 59 (44%)Inpatient (acute care) 40 (30%)Palliative care unit 35 (26%)

Diagnosisdmalignant (most common); total¼ 125Lung 20Breast 15Colorectal 7Lymphoma 8Pancreas 7

Diagnosisdnonmalignant; total¼ 9Chronic heart failure 3Chronic renal failure 2Other nonmalignanta 4

aDiagnoses include pneumonia, CVA, thrombocytopenia, andrespiratory failure (one patient each). CVA ¼ cerebral vascularaccident.

Vol. 39 No. 2 February 2010 253Evaluating Performance Scales

distributed, with 52% males and an overallaverage age of 67 years (range 21e102 years).Of those with a cancer diagnosis (93% oftotal), 78% had either metastatic or locallyadvanced disease. The most common malig-nancies included those of lung (15%), breast(10.5%), and colon (7%). Median ratings ofperformance status differed substantiallybased on site (Table 2).

For each rater group, all PPS, KPS, andAKPS pairings were found to be highly signifi-cantly correlated (MD ratings in Table 3).Given its inverse configuration, appropriatelynegative but equally highly significant correla-tions were found for ECOG and each of theother three scales. For each individual PS, rat-ings for all rater pairings were found to behighly significantly correlated (Table 4).

Using weighted kappa values, interrater reli-ability for each of the PS, namely, KPS, PPS,AKPS, and ECOG, was almost universallyfound to be ‘‘good’’ (Tables 5 and 6). Usingunweighted kappa values, interrater reliabilitywas found to be ‘‘poor’’ between the RN/MDpairings for both KPS and AKPS. Using non-parametric testing, further analysis of thesedata indicated the tendency for RN ratings tobe significantly higher than those of MDs forboth KPS and AKPS. The range of exact agree-ment between raters was found to be from

38% to 61% (Tables 5 and 6) but less than40% for the RN/RA and RN/MD pairings forthe KPS alone.

DiscussionAccurately rating a patient’s performance

level is of critical importance, as in manyhealth care settings, more and more clinicaldecisions are determined solely by a specificPS rating. Clinician investigators are oftenrequired to define a specific level of perfor-mance that would determine a patient’s eligi-bility for participation in a clinical trial. Inaddition, a specific PS rating is often includedin the criteria for participation in researchstudies in general. In many settings, healthcare resources are allocated based solely ona specific PS rating. An example of this isfound within the algorithm used by manyhome care agencies to determine the amountof nursing and personal support hours forwhich a patient may qualify. Factoring in theprognostic utility of PS, strong support existsfor the integration of PS use into the routinecare of patients.

Given the possible clinical implications of aninaccurate PS assessment, this study wasdesigned to evaluate interrater reliability andconfirm correlation both within and betweeneach of the four commonly used PS in the palli-ative care setting. These two psychometric prop-erties contribute substantially to the statisticaland clinical evidence HCPs appropriately re-quire to support and guide individual, grouppractice, and/or institutional PS integration.

As originally hypothesized, our study con-firms that, among three different rater groups,all four PS are highly significantly correlated.This information is meaningful, as it confirmsfor the PS a general level of association, that is,raters used each scale according to its originalintended use. Clarifying further, our findingsconfirm that when a rater assigns a particularrating for a patient using one of the four PS,it is likely that subsequently using one of theother three PS, the rater would assign for thesame patient a rating in the same generaldirection. In addition, our study confirmsthat when using any of the four PS, two differ-ent raters will assign for the same patient a rat-ing in the same general direction. A key pointto emphasize is, as the earlier description

Page 5: Evaluating Correlation and Interrater Reliability for Four Performance Scales in the Palliative Care Setting

Table 2Median PS Rating (and Range) by Site

ECOG KPS AKPS PPS

Outpatient 1 (0e4) 70 (10e100) 70 (20e100) 80 (20e100)Inpatient 3 (1e4) 50 (10e90) 40 (10e80) 50 (10e80)Palliative care unit 4 (1e4) 30 (10e60) 30 (10e60) 30 (10e60)

254 Vol. 39 No. 2 February 2010Myers et al.

suggests, high correlation levels do not auto-matically confirm specific agreement betweenraters.

Interrater variability has been proposed toincrease with a higher number of rater options(i.e., 10 categorical options for KPS, AKPS, andPPS, and four for ECOG).12 For our study,then, it was hypothesized ECOG would havethe highest level of agreement among raters.Consistent with several other comparative re-ports involving PS,1,24e27 our results did notfully confirm this hypothesis. However, it isnoteworthy that, of the four, ECOG was theonly PS with greater than 50% exact agree-ment for all three rater pairings.

Reviewing the PS literature, a wide range ofterminology and statistical measures havebeen used to represent the likelihood of agree-ment between raters. An example is found inthe original KPS reliability studies. Using Pear-son correlation coefficient values, Yates et al.first reported ‘‘moderately high’’ levels of KPSinterrater reliability for patients with advancedcancer.27 Also choosing Pearson correlationcoefficient values to evaluate agreement(reported as ‘‘very high’’), Schag et al. includedkappa statistics in their study design, the result-ing values indicating ‘‘fair to moderate’’ agree-ment.28 The authors concluded that KPS hadoverall ‘‘very good’’ interrater reliability. Choos-ing intraclass correlation coefficients and subse-quent Cronbach’s coefficient alpha values, Moret al. also reported ‘‘very high’’ KPS interraterreliability.2

To examine the reliability of ECOG, Soren-son et al. chose weighted kappa statistics and

Table 3Correlations (r) Among the Four PSa for MD

Raters

KPS ECOG PPS AKPS

KPS 1 �0.9139 0.9628 0.9836ECOG �0.9139 1 �0.9097 �0.9262PPS 0.9628 �0.9097 1 0.9662AKPS 0.9836 �0.9262 0.9662 1

aSpearman correlation coefficients,P< 0.0001 in all cases, df¼ 132.

demonstrated a ‘‘moderate’’ level of agreementbetween raters.14 Interrater reliability of PPShas only recently been evaluated. Using a web-based case scenario design, Ho et al.examined agreement among administratorsand senior clinicians of palliative care institu-tions.18 Intraclass correlation coefficients andweighted kappa values indicated a ‘‘good’’ levelof agreement, which led the authors to confirmthe reliability of PPS. The only study examiningagreement for PPS in the clinical setting wasrecently reported by Campos et al.19 Spearmanrank correlation coefficients and Cronbach co-efficient alpha values were calculated, leadingthe authors to conclude ‘‘excellent’’ correlationagreement in ratings between an RA, a radiationtherapist, and a radiation oncologist.19

Few studies have concurrently comparedinterrater reliability of different PS using thesame study population. With two clinicaloncologists as raters, Roila et al. used weightedkappa statistics and reported interrater reli-ability of both KPS and ECOG to be ‘‘veryhigh.’’26 Taylor et al. examined interrater reli-ability of KPS and ECOG using raters from dif-ferent professions (clinical oncologist,medicine resident, and nurse), and Spearmanrank correlation coefficients were chosen toevaluate agreement.4 The authors found eachPS to have ‘‘good’’ interrater reliability despiteslightly higher values for ECOG. As mentionedpreviously, this led to the suggestion that‘‘either scale could be used with good inter-rater reliability but the simpler format of theECOG would minimize potential differencesin ratings.’’6

Table 4Correlations (r) Between Different Raters for

Each of the Four PSa

KPS ECOG PPS AKPS

RA/MD 0.8702 0.8251 0.8895 0.8932RA/RN 0.8656 0.8344 0.8704 0.8273RN/MD 0.8490 0.7921 0.89616 0.8582

aSpearman correlation coefficients,P< 0.0001 in all cases.

Page 6: Evaluating Correlation and Interrater Reliability for Four Performance Scales in the Palliative Care Setting

Tab

le5

Inte

rrat

erR

elia

bil

ity,

aS

tren

gth

of

Agr

eem

ent,

and

Per

cen

tE

xac

tA

gree

men

tfo

rK

PS

and

EC

OG

HC

PD

yad

KP

SE

CO

G

Wei

ghte

dK

app

aSt

ren

gth

of

Agr

eem

ent

Un

wei

ghte

dK

app

aSt

ren

gth

of

Agr

eem

ent

%E

xact

Agr

eem

ent

Wei

ghte

dK

app

aSt

ren

gth

of

Agr

eem

ent

Un

wei

ghte

dK

app

aSt

ren

gth

of

Agr

eem

ent

%E

xact

Agr

eem

ent

RA

/M

D0.

71G

oo

d0.

38F

air

500.

65G

oo

d0.

46M

od

erat

e59

RA

/R

N0.

62G

oo

d0.

27F

air

380.

68G

oo

d0.

48M

od

erat

e61

RN

/M

D0.

58M

od

erat

e0.

19P

oo

r38

0.61

Go

od

0.38

Fai

r53

aK

app

adw

eigh

ted

and

un

wei

ghte

d.

Vol. 39 No. 2 February 2010 255Evaluating Performance Scales

Examining general principles of kappa sta-tistics, Landis and Koch suggested weightedkappa values of 0.61 or greater are consideredto represent ‘‘substantial’’ agreement; however,a minimum acceptable level has not been uni-versally agreed upon.29 Applying Landis andKoch’s assertion to the current study, interrat-er reliability would be considered substantialfor all four PS evaluated. It has been furthersuggested that incorporating unweightedkappa statistics in a study’s design would be‘‘inappropriate,’’ as size of rating discrepancyis considered a key factor when evaluatingagreement.26 This suggestion was made by au-thors evaluating KPS interrater reliability, whowent on to report that exact agreement be-tween ratings of different physicians occurred61.7% of the time, a value not too dissimilarfrom that reported in our study (Tables 5and 6). Attempting to clarify increasingly diffi-cult-to-interpret statistical measures, epidemi-ologists have demonstrated that weightedkappa values can be very high despite a verylow percentage of exact agreement.30 Thishas led many to advocate that weighted kappastatistics be regarded primarily as a measure ofassociation rather than agreement.30,31 Similarprinciples led to the same suggestion when in-terpreting Pearson correlation coefficientvalues.32 Given the clinical implications of aninaccurately assessed performance level andthe lack of universal agreement on both idealstatistical measures and the strength of agree-ment considered adequate, we would proposeboth unweighted kappa values and exactagreement be included as parameters basedon which the reliability of a PS could bejudged. With that in mind for the presentstudy, interrater reliability of KPS and its usein the palliative care setting presents a concern,as a review of our data demonstrates that thisPS had the only two rater pairings with lessthan 40% exact agreement and, usingunweighted kappa values, had one of onlytwo rater pairings demonstrating ‘‘poor’’agreement.

Historically, most studies evaluating interrat-er reliability have used a homogeneous or uni-professional group of raters. More recently,study designs have been multiprofessional innature, using combinations of physicians,nurses, social workers, radiation therapists,and RAs as the raters for the studied

Page 7: Evaluating Correlation and Interrater Reliability for Four Performance Scales in the Palliative Care Setting

Tab

le6

Inte

rrat

erR

elia

bil

ity,

aS

tren

gth

of

Agr

eem

ent,

and

Per

cen

tE

xac

tA

gree

men

tfo

rP

PS

and

AK

PS

HC

PD

yad

KP

SE

CO

G

Wei

ghte

dK

app

aSt

ren

gth

of

Agr

eem

ent

Un

wei

ghte

dK

app

aSt

ren

gth

of

Agr

eem

ent

%E

xact

Agr

eem

ent

Wei

ghte

dK

app

aSt

ren

gth

of

Agr

eem

ent

Un

wei

ghte

dK

app

aSt

ren

gth

of

Agr

eem

ent

%E

xact

Agr

eem

ent

RA

/M

D0.

72G

oo

d0.

39F

air

500.

71G

oo

d0.

37F

air

50R

A/

RN

0.64

Go

od

0.23

Fai

r47

0.63

Go

od

0.24

Fai

r43

RN

/M

D0.

63G

oo

d0.

23F

air

470.

61G

oo

d0.

18P

oo

r50

aK

app

adw

eigh

ted

and

un

wei

ghte

d.

256 Vol. 39 No. 2 February 2010Myers et al.

population.1,10,24e27,33 In our study, one con-sistent RA was used, contrasting with thegroups of RNs and MDs who provided ratings.Based on previous work involving KPS, ECOG,and AKPS, we assumed adequate agreementwithin each HCP group of raters.1,10,24,33 Onelimitation of our study is that this assumptionwas not tested through the assessment of intra-professional agreement within the MD and RNgroups. In addition, regardless of clinical role,a lack of rater knowledge on the use of the PScould greatly influence the accuracy of a rating.One previous PS agreement study used non-HCP raters who received a two-hour trainingsession on the use of the KPS. The strengthof agreement demonstrated in this studywas found to be ‘‘excellent’’ (weightedkappa¼ 0.97).14 For our study, the RAreceived informal training on the use of PS,but it was assumed that this was not necessaryfor the RN and MD groups. To minimizebias, future studies would ideally investigate in-terrater reliability by including only one HCPfrom each profession and ensuring adequatetraining for each rater on the use of PS.

Several other limitations exist with this study.Despite the intended setting of use for PPS andAKPS (i.e., palliative care setting regardless ofdiagnosis), only 7% of our study populationhad a noncancer diagnosis. Future studiesshould be designed to ensure adequate repre-sentation of patients with greater diversity in di-agnosis. One final limitation is that sample sizeprevented separate analysis of outpatients andacute care and palliative care unit inpatients.

ConclusionsBecause of wide variation in terminology

and statistical measures used to represent thelikelihood of PS rater agreement, cumulativeevidence from previous work examining PSinterrater reliability is difficult to summarizeand elicit conclusions. Paired with a lack ofconsensus on the strength of agreement con-sidered adequate, limited evidence exists toguide clinicians regarding the most appropri-ate PS for the palliative care setting. In ourstudy, linear correlation between and amongraters for all four PS was confirmed. Interraterreliability was evaluated, and no one PSemerged as having a higher likelihood of agree-ment among raters from different professions.

Page 8: Evaluating Correlation and Interrater Reliability for Four Performance Scales in the Palliative Care Setting

Vol. 39 No. 2 February 2010 257Evaluating Performance Scales

In the current clinical climate, the use of a PSfound to have low levels of exact rater agree-ment could have substantial clinical implica-tions for patients. We propose that, inaddition to correlation coefficients, bothweighted and unweighted kappa values aswell as exact agreement be included as param-eters based on which the reliability of a PScould be judged. In addition, what may bemore important than clinical experience orrater profession is the level of training an indi-vidual HCP rater receives on the administra-tion of any particular PS. To maximizeclinical efficiencies and ensure accuracy ofassessments, it is likely to be of great benefitto have multiple HCPs knowledgeable in theuse of PS. Adequate training should beincluded in the design of future comparativestudies. Although all patients in this studywere receiving palliative care services, tofurther clarify which PS is best suited forpatients in the palliative care setting, follow-up studies should include both the stratifica-tion of data into groups representing high,middle, and low patient performance, andrater perception on ease of PS use.

References1. Buccheri G, Ferrigno D, Tamburini M. Karnof-

sky and ECOG performance status scoring in lungcancer: a prospective, longitudinal study of 536 pa-tients from a single institution. Eur J Cancer 2002;32A:1135e1148.

2. Mor V, Laliberte L, Morris JN, Wiemann M. TheKarnofsky Performance Status Scale. An examina-tion of its reliability and validity in a research set-ting. Cancer 1984;53:2002e2007.

3. Morita T, Tsunoda J, Inoue S, Chihara S.Improved accuracy of physicians’ survival predictionfor terminally ill cancer patients using the PalliativePrognostic Index. Palliat Med 2001;15:419e424.

4. Taylor AE, Olver IN, Sivanthan T, Chi M,Purnell C. Observer error in grading performancestatus is cancer patients. Support Care Cancer1999;7:332e335.

5. Orr ST, Aisner J. Performance status assessmentamong oncology patients: a review. Cancer Treat Res1986;70:1423e1429.

6. Harrold J, Rickerson E, Carroll J, et al. Is thePalliative Performance Scale a useful predictor ofmortality in a heterogeneous hospice population?J Palliat Med 2005;8:503e509.

7. Head B, Ritchie C, Smoot T. Prognostication inhospice care: can the Palliative Performance Scalehelp? J Palliat Med 2005;8:492e501.

8. Karnofsky DA, Abelmann WH, Craver LF,Burchenal JH. The use of nitrogen mustards inthe palliative treatment of cancer. Cancer 1948;1:634e656.

9. Zubrod CG, Schneiderman M, Frei E. Appraisalof methods for the study of chemotherapy of cancerin man: comparative therapeutic trial of nitrogenmustard and triethylene thiophosphoramide.J Chron Dis 1960;11:7e33.

10. Abernethy AP, Shelby-James T, Fazekas BS,Woods D, Currow DC. The Australia-modified Kar-nofsky Performance Status (AKPS) scale: a revisedscale for contemporary palliative care clinical prac-tice. BMC Palliat Care 2005;4:7.

11. Kassa T, Loomis J, Gillis K, Bruera E, Hanson J.The Edmonton Functional Assessment Tool: prelim-inary development and evaluation for use in pallia-tive care. J Pain Symptom Manage 1997;13:10e19.

12. Verger E, Salamero V, Conill C. Can KarnofskyPerformance Status be transformed to the EasternCooperative Oncology Group scoring scale andvice versa? Eur J Cancer 1992;8:1328e1330.

13. Anderson F, Downing GM, Hill J, Casorso L,Lerch N. Palliative Performance Scale (PPS):a new tool. J Palliat Care 1996;12:5e11.

14. Sorensen JB, Klee M, Palshof T, Hansen HH.Performance status assessment in cancer patients.An inter-observer variability study. Brit J Cancer1993;67:773e775.

15. Virik K, Glare P. Validation of the Palliative Per-formance Scale for inpatients admitted to a palliativecare unit in Sydney, Australia. J Pain Symptom Man-age 2002;23:455e457.

16. Morita T, Tsunoda J, Inoue S, Chihara S. Sur-vival prediction of terminally ill cancer patients byclinical symptoms: development of a simple indica-tor. Jpn J Clin Oncol 1998;29:156e159.

17. Morita T, Tsunoda J, Inoue S, Chihara S. Valida-tion of the Palliative Performance Scale from a sur-vival perspective. J Pain Symptom Manage 1999;18:2e3.

18. Ho F, Lau F, Downing MG, Lesperance M. Areliability and validity study of the Palliative Perfor-mance Scale. BMC Palliat Care 2008;7:10.

19. Campos S, Zhang L, Sinclair E, et al. The Palli-ative Performance Scale: examining its inter-raterreliability in an outpatient palliative radiation oncol-ogy clinic. Support Care Cancer 2009 Jun;17(6):685e690.

20. Fleiss JL. Measuring nominal scale agreementamong many raters. Psychol Bull 1971;76:378e382.

21. Cohen J. A coefficient of agreement for nomi-nal scales. Educ Psychol Meas 1960;20:37e46.

Page 9: Evaluating Correlation and Interrater Reliability for Four Performance Scales in the Palliative Care Setting

258 Vol. 39 No. 2 February 2010Myers et al.

22. Cicchetti DV, Allison T. A new procedure forassessing reliability of scoring EEG sleep recordings.Am J EEG Tech 1971;11:101e109.

23. Brennan P, Silman A. Statistical methods forassessing observer variability in clinical measures.BMJ 1992;304:1491e1494.

24. Conill C, Verger E, Salamero M. Performancestatus assessment in cancer patients. Cancer 1990;65:1864e1866.

25. de Borja M, Chow E, Bovett G, Davis L, Gilles C. Thecorrelation among patients and health care profes-sionals in assessing functional status using the Karnofskyand Eastern Cooperative Oncology Group Perfor-mance Status scales. Support Cancer Ther 2004;2:1e5.

26. Roila F, Lupatelli M, Sassi M, et al. Intra- and in-terobserver variability in cancer patients’ perfor-mance status assessed according to Karnofsky andECOG scales. Ann Oncol 1991;2:437e439.

27. Yates JW, Chalmer B, McKegney FP. Evaluationof patients with advanced cancer using the KarnofskyPerformance Status. Cancer 1980;45:2220e2224.

28. Schag C, Heinruch R, Ganz P. Karnofsky Perfor-mance Status revisited: reliability, validity and guide-lines. J Clin Oncol 1994;2:187e193.

29. Landis JR, Koch GG. The measurement ofobserver agreement for categorical data. Biometrics1977;33:159e174.

30. Graham P, Jackson R. The analysis of ordinalagreement data: beyond weighted kappa. J ClinEpidemiol 1993;46:1055e1062.

31. Bloch DA, Kraemer HC. 2� 2 kappa coeffi-cients: measures of agreement or association. Bio-metrics 1989;45:269e287.

32. Altman DG, Bland JM. Measurement in medi-cine: the analysis of method comparison studies.Statistician 1983;32:307e317.

33. Nikoletti S, Porock D, Kristjanson LJ, et al. Per-formance status assessment in home hospice pa-tients using a modified form of the KarnofskyPerformance Status scale. J Palliat Med 2000;3:301e311.