9
REVIEW Reporting on reliability and validity of outcome measures in medical rehabilitation research M. P. J. M. DIJKERS{*, G. C. KROPP{, R. M. ESPER}, G. YAVUZER}, N. CULLEN{{ and Y. BAKDALIEH{{ {Department of Rehabilitation Medicine, Mount Sinai School of Medicine, New York, NY, USA {Brookfield Clinics, Garden City, MI, USA }Centre for Molecular Medicine and Genetics, Wayne State University, Detroit, MI, USA }Department of Physical Medicine and Rehabilitation, Ankara University, Ankara, Turkey {{Toronto Rehabilitation Institute, Toronto, Canada {{Department of Physical Medicine and Rehabilitation, University of Kansas Medical Centre, Kansas City, KS, USA Accepted for publication: March 2002 Abstract Purpose: To evaluate the degree to which published medical rehabilitation research offers evidence of reliability, validity and other clinimetric qualities of the data reported. Method: Descriptive study of published intervention research papers published in six US medical rehabilitation journals in 1997 and 1998. Selected characteristics of the papers and the outcome measures used were abstracted by one or two raters. Results: The 171 papers identified included 651 outcome measures. Some type of data reliability information was provided for 20.1% of these measures; for validity, this was 6.9%. However, this information was based on data collected for the sample studied for only 7.7% (reliability) and 0.6% (validity). Conclusions : Most rehabilitation research falls short of standards, including the Standards promulgated by an American Congress of Rehabilitation Medicine Advisory Group. Authors, peer reviewers and editors need to change their practices to improve this situation. Introduction Medical rehabilitation aspires to scienti®c maturity, and training in research methodology and statistics is becoming more prominent in training programmes for clinicians. The textbooks published for use in their research methods and statistics courses devote one or more chapters to measurement, 1, 2 and to the under- standing of psychometric (clinimetric, 3 biometric) quali- ties of data produced by various types of clinical and research instruments that are commonly used. Some of the major textbooks of medical rehabilitation now also include information on research methods generally, and measurement speci®cally. 4 Professionals with doctoral level clinical or research preparation, e.g. in psychology or speech/language pathology, typically obtain even more extensive training in the principles and methods of measurement. The message of this instruction can be summarized as follows: `All observations are subject to random and systematic error. It therefore behooves the researcher to utilize measures (instruments, scales, etc.) that produce data with minimal error, and to evaluate relia- bility and validity of the data produced by these measures, for each new population and/or each new type of use.’ Methods of estimating and interpreting various aspects of validity and reliability are at the core of training in measurement, because measurement is at the core of the scienti®c method: Measurement of variables of interest is linked to all steps of the scienti®c process: conceptualization of the study, analysis of the data, and interpreta- tion of the results. One’s measures are only as good as one’s conceptualization of the concept, one’s data are only as good as one’s measures, and one’s results are only as good as one’s data’. 5 (p. 15) The better textbooks and methods instructors empha- size that reliability and validity are not inherent in a * Author for correspondence; e-mail: [email protected]. DISABILITY AND REHABILITATION, 2002; VOL. 24, NO. 16, 819 ± 827 Disability and Rehabilitation ISSN 0963±8288 print/ISSN 1464±5165 online # 2002 Taylor & Francis Ltd http://www.tandf.co.uk/journals DOI: 10.1080/0963828021014858 5 Disabil Rehabil Downloaded from informahealthcare.com by Tufts University on 11/14/14 For personal use only.

Reporting on reliability and validity of outcome measures in medical rehabilitation research

  • Upload
    y

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Reporting on reliability and validity of outcome measures in medical rehabilitation research

REVIEW

Reporting on reliability and validity of outcomemeasures in medical rehabilitation research

M. P. J. M. DIJKERS{*, G. C. KROPP{, R. M. ESPER}, G. YAVUZER},N. CULLEN{{ and Y. BAKDALIEH{{{Department of Rehabilitation Medicine, Mount Sinai School of Medicine, New York, NY, USA{Brookfield Clinics, Garden City, MI, USA}Centre for Molecular Medicine and Genetics, Wayne State University, Detroit, MI, USA}Department of Physical Medicine and Rehabilitation, Ankara University, Ankara, Turkey{{Toronto Rehabilitation Institute, Toronto, Canada{{Department of Physical Medicine and Rehabilitation, University of Kansas Medical Centre,

Kansas City, KS, USA

Accepted for publication: March 2002

Abstract

Purpose: To evaluate the degree to which published medicalrehabilitation research offers evidence of reliability, validityand other clinimetric qualities of the data reported.Method: Descriptive study of published intervention researchpapers published in six US medical rehabilitation journals in1997 and 1998. Selected characteristics of the papers and theoutcome measures used were abstracted by one or two raters.Results: The 171 papers identified included 651 outcomemeasures. Some type of data reliability information wasprovided for 20.1% of these measures; for validity, this was6.9%. However, this information was based on data collectedfor the sample studied for only 7.7% (reliability) and 0.6%(validity).Conclusions : Most rehabilitation research falls short ofstandards, including the Standards promulgated by anAmerican Congress of Rehabilitation Medicine AdvisoryGroup. Authors, peer reviewers and editors need to changetheir practices to improve this situation.

Introduction

Medical rehabilitation aspires to scienti®c maturity,and training in research methodology and statistics isbecoming more prominent in training programmes forclinicians. The textbooks published for use in theirresearch methods and statistics courses devote one ormore chapters to measurement,1, 2 and to the under-standing of psychometric (clinimetric,3 biometric) quali-

ties of data produced by various types of clinical andresearch instruments that are commonly used. Some ofthe major textbooks of medical rehabilitation now alsoinclude information on research methods generally,and measurement speci®cally.4 Professionals withdoctoral level clinical or research preparation, e.g. inpsychology or speech/language pathology, typicallyobtain even more extensive training in the principlesand methods of measurement.

The message of this instruction can be summarized asfollows: `All observations are subject to random andsystematic error. It therefore behooves the researcherto utilize measures (instruments, scales, etc.) thatproduce data with minimal error, and to evaluate relia-bility and validity of the data produced by thesemeasures, for each new population and/or each newtype of use.’ Methods of estimating and interpretingvarious aspects of validity and reliability are at the coreof training in measurement, because measurement is atthe core of the scienti®c method:

Measurement of variables of interest is linked toall steps of the scienti®c process: conceptualizationof the study, analysis of the data, and interpreta-tion of the results. One’s measures are only asgood as one’s conceptualization of the concept,one’s data are only as good as one’s measures,and one’s results are only as good as one’sdata’.5(p. 15)

The better textbooks and methods instructors empha-size that reliability and validity are not inherent in a* Author for correspondence; e-mail: [email protected].

DISABILITY AND REHABILITATION, 2002; VOL. 24, NO. 16, 819 ± 827

Disability and Rehabilitation ISSN 0963±8288 print/ISSN 1464±5165 online # 2002 Taylor & Francis Ltdhttp://www.tandf.co.uk/journals

DOI: 10.1080/0963828021014858 5

Dis

abil

Reh

abil

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

Tuf

ts U

nive

rsity

on

11/1

4/14

For

pers

onal

use

onl

y.

Page 2: Reporting on reliability and validity of outcome measures in medical rehabilitation research

measure, but in the data collected using a particularinstrument.6 An instrument that produces data of highreliability and validity in one population may fail todo so when used with another group. Similarly, theparticular application and other aspects of data collec-tion may have a major impact on the quality of the datacollected. An extreme position would hold that thisimplies that every time an instrument is used in newcircumstances and/or for a new population, evidencesupporting reliability and validity of the data needs tobe collected de novo. A more lenient position would holdthat the burden is on investigators to show that their useand sample are similar to those of the studies on whichpublished reports of reliability and validity are based,su� ciently so that the assumption of similar quality ofthe data is justi®ed. At a minimum, researchers shoulduse only measures that produced data of known andacceptable reliability and validity in prior applications,and summarize clinimetric information from the litera-ture in their reports. Extenuating circumstances forusing `unproven’ measures may be oVered in some situa-tions, such as pilot studies or analyses of existing datasets.

However, training does not necessarily translate intoapplication. The literature indicates that existingresearch in the human services ®elds is hampered bypoor quality measures, or at least that investigators donot cite the available evidence on the reliability andvalidity of their measures, let alone produce their ownevidence. Reviews of the literature in counselling andeducation have shown that researchers in these ®eldshave not followed the advice of their textbooks, or oftheir measurement `bible’, the `Standards for educa-tional and psychological testing’.7 Meier and Daviscompared papers published in the Journal of CounsellingPsychology in 1967, 1977 and 1987.8 While there wassome improvement over time, for only 40% of allpreviously published scales (instruments) used in 1987,score reliability estimates from the literature were cited,and for only 5% of these scales did authors providevalidity estimates. Reliability information based on datafor the sample studied was provided for 23% of themeasures, and validity information derived from thesample for 1%. Slightly better results were reportedfor the papers in the 1996 volume of the Journal ofCounselling and Development.9 On a per-article basis,33% reported reliability estimates for their own sample,and 88% estimates from the literature. When Whitting-ton surveyed 22 `selective’ academic education journals,he found that of 220 published papers, 54% cited relia-bility evidence, but only 36% of papers (25% ofmeasures) reported reliability based on the study sample

or a similar sample from another source.10 Validityreports on the same basis were provided by 18% of arti-cles (14% of measures). Ellis et al.11 reviewing the coun-selling and psychotherapy supervision literature, notedthat for 80% of studies, `unreliability of dependent orindependent variable measures’ was `de®nitely a threat’to study validity. Results were similar for surveys ofstudies in nursing12, 13 and health education.14

Rehabilitation medicine has its own measurement`bible’. In 1992, the American Congress of Rehabilita-tion Medicine published `Measurement standards forinterdisciplinary medical rehabilitation’, written byJohnston et al.15 with input from the Advisory Groupon Measurement Standards, an impressive assembly ofrehabilitation clinicians, administrators , researchersand educators representing all medical rehabilitationdisciplines. The ®rst standard for reliability (Standard2.1) reads as follows: `reports of the use of measures inrehabilitation should be accompanied by numeric esti-mates of reliability, the population(s) used, and themethod of determining reliability of scores’.15 The ®rststandard for validity (1.1) states: `a measure should haveevidence of the appropriate type of validity in line withits intended uses’.15 While the latter standard does notrequire that reports on the uses of measures should beaccompanied by information relevant to validity, theparallel with the reliability standard suggests that thiswas the intention of the authors, as suggested by thecomment provided: `a measure should have evidenceof content, criterion and/or construct validity, and allthree should at least be addressed’.15

To date, no survey of the degree to which rehabilita-tion researchers adhere to these standards has beenpublished. The goal of this study was to determine towhat degree rehabilitation researchers who are or areexpected to be familiar with the ACRM standards useoutcome measures with known clinimetric characteris-tics, and to what degree this diVers between types ofresearch and types of measures. Papers published in1997 ± 1998 in the US journals were focused on becauseit was assumed that the majority of the publications inthose years were the result of studies designed after thepublication of the standards, and thus would have hadthe opportunity to bene®t from their guidance.

Methods

All peer-reviewed papers published during 1997 and1998 in six US medical rehabilitation journals werereviewed. The journals were Archives of Physical Medi-cine and Rehabilitation (APM&R); American Journal ofPhysical Medicine and Rehabilitation (AJPM&R); Amer-

P. J. M. Marcell et al.

820

Dis

abil

Reh

abil

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

Tuf

ts U

nive

rsity

on

11/1

4/14

For

pers

onal

use

onl

y.

Page 3: Reporting on reliability and validity of outcome measures in medical rehabilitation research

ican Journal of Occupational Therapy (AJOT); PhysicalTherapy (PT); Rehabilitation Nursing (RN); and Rehabi-litation Psychology (RP). For APM&R, which publishesthe largest number of papers by far, only articles for1997 and the ®rst 6-months of 1998 were included soas to limit the workload. The data reported here werecollected as part of a project that focused on the qualityof published reports on rehabilitation interventionalresearch, and papers describing quantitative interven-tional research focusing on patients or patient-substi-tutes (`normals’) were selected for study. Excludedwere discussions of professional issues, didactic papers,qualitative or quantitative literature reviews, casestudies, qualitative research, non-interventiona l descrip-tive research (including measure-developmen t studies),and interventional studies focused on (student) person-nel or non-human subjects, such as cadavers, animals,wheelchairs and other inanimate objects. (See ®gure 1for an overview of the percentages excluded for variousreasons.) The studies included were either group designs(clinical trials and similar studies), single-subjectdesigns, or correlational studies where the intent of theauthors clearly was to show how an intervention (e.g.an inpatient rehabilitation stay) aVected selectedoutcomes (e.g. functional status). Four major types ofstudies were found:

. Retrospective analyses of rehabilitation programs(either single specialty or integrated multiple speci-alty), relating outcomes to length of stay or other re-source input measures.

. Prospective studies of patients, relating outcomes to aspeci®c (experimental) treatment.

. Prospective studies of prevention of disability in non-

disabled, at-risk persons, or of prevention of second-ary conditions in persons with a disability.

. Prospective research investigating the eVect of a non-curative intervention, for instance, determining theoptimal time and frequency of static stretching to in-crease ¯exibility of the hamstring muscles.16

The papers selected were coded independently by twoout of six raters (the six authors) with respect to anumber of characteristics, based on an extensive sylla-bus providing de®nitions and examples. This syllabushad been developed iteratively in coding a small subsetof the papers. (The subset papers were coded a secondtime after all other articles had been processed, byjudges blind to the original codes). If the two ratersdisagreed in their codes, they tried to resolve their diVer-ence through discussion; if this was not possible, a thirdperson cast the deciding vote. Because all (rather than asample only) ratings were made by two independentraters, who came to agreement in the case of discre-pancy, and in order to keep down costs of data entryand processing, no interrater reliability data were calcu-lated for our ratings. Some items were added to the ®lein a later stage, and were coded by a single rater only(MD): content category of outcome measures, methodof data acquisition, and source of clinimetric informa-tionÐliterature vs `local’ data. (A list of the papersincluded in the sample, and a copy of the syllabus usedin coding, are available from the ®rst author.)

All information was entered on abstracting forms,from where it was entered into a computer ®le. Analysiswas performed using SPSS software. No hypothesis test-ing is used in the analysis, because this is a descriptivestudy without a priori hypotheses. There is no claim thatthe papers selected are a random sample, either in spaceor time, of all medical rehabilitation papers everpublished.

Results

A total of 1039 peer-reviewed papers were identi®ed.Quantitative original research (646 papers) constituted62.2% of the total (®gure 1). Within the category ofquantitative investigations, the 171 interventionalstudies involving patients constituted only a minority(26.5% of the 646). More common were purely descrip-tive studies (45.2%). Clinimetric research constituted15.6%, and studies of staV, institutions, animals andequipment 13.3%. Three-quarters of these involvedpersonnel.

The nature of the studies selected for review isdescribed in Table 1. The majority were published in

Figure 1 Types of articles published in six US medical rehabilitation

journals, 1997 ± 1998 volumes.

Reliability and validity

821

Dis

abil

Reh

abil

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

Tuf

ts U

nive

rsity

on

11/1

4/14

For

pers

onal

use

onl

y.

Page 4: Reporting on reliability and validity of outcome measures in medical rehabilitation research

APM&R, in spite of the fact that only a year and ahalf were reviewed for this journal. Both the largenumber of papers published in APM&R, and the highpercentage of these that are intervention studiesexplains this fact. The majority of the studies werenot well controlled, using a correlational or pre-postdesign, or historical controls. Data collection wasmostly prospective; for 6% of the papers, the methodof data collection could not be ascertained. Moststudies used only one type of subjects, either rehabili-tation in- or outpatients, or nonpatients with or with-out a disability. The intervention studied wastreatment for a disorder in 69% of the studies, andprevention of secondary conditions or disabilities in10%. These two categories used inpatients or outpati-ents as subjects, and sometimes community livingpersons with a disability. Studies focusing on primaryprevention or on evaluating an intervention that wasnot curative or preventive typically used non-patients,although they might use (out-) patients for the sake ofconvenience.

The typical paper used three outcome measures(median), with a range from 1 ± 14. The 651 measuresidenti®ed in these 171 papers are characterized in Table1 in terms of the nature of the study they were used in.The ratio of outcome measures to papers is 3.8 : 1 over-all; a signi®cantly diVerent ratio in any category ofpapers indicates that they use more or less outcomemeasures than average. APM&R papers tended to havethe most outcome measures, and RN papers the fewest.Studies using historical controls, which often were retro-spective studies, tended to use few outcomes. Investiga-tions using prospective data collection tended to usemore outcomes than the other types. The total numberof estimated words of the papers ranged from 1700 to17080, with a median of 6564. The mean was 6900, witha standard deviation of 2400.

The outcome measures description on average madeup 6% of the paper (median), with a range from lessthan 1% to 35%. The number of words dedicated tothe description of the outcome measure(s) ranged from10 to 1980, with a median of 367. This estimate includesonly the lines describing what the measures or instru-ments were, how the data were collected, and whatevidence for their reliability, validity and other clini-metric properties there was, and excludes the descriptionof the actual (substantive) study ®ndings with respect tothe outcome measures. The mean was 468, with a stan-dard deviation of 390. Not surprisingly, the number ofoutcome measures used tended to increase with paperlength and (to a lesser degree) with the length of thesection describing the outcomes.

Table 2 oVers additional information at the level ofthe individual outcome measure. Subject report, obser-vation with or without equipment, and especially auto-mated equipment were the most commonly usedmethods of obtaining data. In terms of content, func-tional assessment (including single-item and multiple-item ADLs and IADLs) was the most common classof outcomes; physiologic outcomes were more commonthan psychological and neuro-psychological ones.

In the great majority of cases (76.7% of 651 outcomemeasures), authors did not oVer any information as tothe reliability or validity of the data they used (table3). Validity was less often addressed (6.9% of measures)than reliability (20.1% of measures, for any type ofreliability measurement), and other aspects of clini-metrics were even more often disregarded. As reliabilityand validity depend on the population studied and thecircumstances of data collection, it was also noted ifthe authors of papers only cited the literature to suggestthe quality of the outcome data they used, or if theycollected their own data to demonstrate this, either ina pilot study or as part of the investigation reported.Such `local evidence’ was produced in about half of allinstances in which test-retest, interrater or internalconsistency reliability was mentioned. However, theseinstances involved less than 5% of all measures. `Localevidence’ of validity was oVered for only four measures(0.6%). In three cases, this was for a newly developed ora modi®ed pre-existing scale. The fourth instance wasuse of `goal attainment scaling’, a technique in whichthe scale content is unique to each subject.

Lastly, information on the clinimetric characteristicsof outcome data, whether based on the literature or onseparate analyses conducted as part of the study, is morecommon in certain types of papers. Prospective studiesare superior to retrospective onesÐexcept that the lattermore often report validity of any type (15% vs 7%).Longer papers tend to do better than shorter ones, butit is only the longest papers (over 11 000 estimatedwords) that really stand out. When less than 400 wordsare used to describe the outcome measure(s), clinimetricinformation is less common, but there are no systematicdiVerences between the 400 ± 599, 600 ± 799 and 800+categories. When the number of outcome measures ina paper is large (nine or more), evidence of clinimetricqualities tends to be lacking. That may be due to the factthat papers with a medical focus tend to have that manyoutcome measures (e.g. many blood, respiratory, orEMG parameters), and also often are de®cient in atten-tion to biometrics. Outcomes based on (neuro)psycholo-gical testing tend not to be justi®ed in terms of theirreliability or validity (only 7% with any clinimetric

P. J. M. Marcell et al.

822

Dis

abil

Reh

abil

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

Tuf

ts U

nive

rsity

on

11/1

4/14

For

pers

onal

use

onl

y.

Page 5: Reporting on reliability and validity of outcome measures in medical rehabilitation research

information), and `locally collected’ evidence is neveroVered for this type of measure. The measures basedon subject report and observation without equipmenthave the best documentation of clinimetric strength.

Discussion

The professional standards of rehabilitation medicinecall on researchers to use measures that produce datawhich satisfy minimum standards of reliability and

validity, and to provide information on these as partof their reporting.15 Our data indicate that scienti®cpapers published in 1997 and 1998, ®ve years after publi-cation of these standards, do not satisfy these principles.For only a quarter of the outcome measures used, relia-bility estimates from the literature are cited, and thecorresponding ®gure for validity is one out of 14.Evidence based on data derived from the sample studied(`local evidence’) is almost non-existent. These numbersare lower than those cited in reviews of the educational

Table 1 Characteristics of the papers reviewed and their outcome measures.

Characteristic Papers (A) Outcome Ratio (B:A)

measures (B)

# % # %

Journal

Archives of Physical Medicine and Rehabilitation 91 53 386 59 4.2American Journal of Physical Medicine and Rehabilitation 33 19 126 19 3.8

Physical Therapy 18 11 60 9 3.3American Journal of Occupational Therapy 17 10 50 8 2.9

Rehabilitation Psychology 4 2 11 2 2.7Rehabilitation Nursing 8 5 18 3 2.2

Study designCorrelational 8 5 22 3 2.8Clinical trial–group–simple 36 21 145 22 4.0Clinical trial–group–crossover 44 26 137 21 3.1Clinical trial–single subject 11 6 51 8 4.6Pre-post study 50 29 223 34 4.5Historical or other controls 16 9 48 7 3.0Other and combinations 6 4 25 4 4.2

Nature of data collectionProspective 144 84 568 87 3.9Retrospective 10 6 27 4 2.7Both 6 4 17 3 2.8Unknown 11 6 39 6 3.5

Type of intervention studiedTreatment–multidisciplinary rehabilitation 19 11 61 9 3.2Treatment–single discipline 17 10 70 11 4.1Treatment–single narrow intervention 82 48 343 53 4.2Primary prevention of disability 7 4 30 5 4.3Prevention of secondary disability 10 6 44 7 4.4Non-treatment/non-prevention changes instructure/function 35 21 102 16 2.9Other 1 1 1 <1 1.0

Number of outcome measures used in studyOne–two 70 41 98 15 1.4Three–four 44 26 151 23 3.4Five–six 27 16 145 22 5.4Seven–eight 17 10 127 20 7.5Nine or more 13 8 130 20 10.0Mean and standard deviation 3.8+2.7

Total 171 100 651 100 3.8

Reliability and validity

823

Dis

abil

Reh

abil

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

Tuf

ts U

nive

rsity

on

11/1

4/14

For

pers

onal

use

onl

y.

Page 6: Reporting on reliability and validity of outcome measures in medical rehabilitation research

and counselling,8 ± 11 health education14 and nursing12, 13

literature.It may be presumed that most of the studies published

in these years were conceived, and certainly implemen-ted, after the publication of the standards. While therewas no groundswell of support for the standards aftertheir publication, there also were no publicationsdisagreeing with them. The latter is not surprising: thestandards as published codify what is prescribed inintroductory research textbooks and taught in researchcourses, and they translate to interdisciplinary rehabili-tation usage what has been accepted standard in, forexample, education and psychology7 and physical ther-apy.17

Because empirical research papers were selecteddescribing intervention research with patients andpatient substitutes, the sample of studies surveyed hascertain limitations. Two major categories were excluded:measure development research (97 papers in the yearsand journals reviewed) and descriptive quantitativestudies (292 papers) involving patients or patient substi-tutes. Without doubt, measure development paperswould score better in terms of mention of reliability,validity and other clinimetric characteristicsÐthat istheir focal point. Descriptive studies concentrate onthe natural course of disablement, the relationshipsbetween patient characteristics and outcomes, and simi-lar issues. They tend to use the same types of variables asthe ones used as outcomes in our sample of interventionstudies. It would be expected that the authors of thesepapers are no more careful in providing informationon the clinimetric strengths of the measures and data

Table 2 Mode of data collection and content of outcome measures.

Characteristic # %

Mode of data collectionAdministrative records 4 1Subject report 148 23(Neuro)psychological testing 44 7Observation w/o equipment 107 16Observation w equipment 105 16Automated equipment 238 37Unknown 5 1

ContentFunctional assessment 75 12Motion or balance 55 8Energy use 41 6Strength, motor function 60 9Range of motion 29 4Electromyography parameters 21 3Blood parameters 30 5Respiratory parameters 51 8Other physiology/structure 34 5Bowel or bladder function 19 3Spasticity/muscle tone 13 2Other clinical complaints/diagnoses 35 5Psychological distress 40 6Cognitive or communicative functioning 40 6Self esteem/other attitudes 15 2Quality of life 10 2Other socio-psychological 18 3All other 65 10

Total 651 100

Table 3 Presence and nature of evidence for eight clinimetric qualities.

Clinimetric evidence No report Any report Report of Report of

references only local evidence*

# % # % # % # %

Test – retest reliablity 591 91 60 9 34 57 26 43Interrater reliability 612 94 39 6 16 41 23 59Internal consistency reliability 641 98 10 2 5 50 5 50Unknown type of reliability 594 91 57 9 56 98 1 2

Any type of reliability 520 80 131 20 81 62 50 38

Validity 606 93 45 7 41 91 4 9Sensitivity 645 99 6 1 6 100 0 0Practicality 649 100 2 <1 2 100 0 0Other clinimetric characteristics 644 99 7 1 7 100 0 0

Any clinimetric characteristic 499 77 152 23 101 66 51 34

*Note: The papers that oVered local evidence (i.e. evidence bsed on data for the sample studied) may or may not have provided literature references in

addition.

P. J. M. Marcell et al.

824

Dis

abil

Reh

abil

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

Tuf

ts U

nive

rsity

on

11/1

4/14

For

pers

onal

use

onl

y.

Page 7: Reporting on reliability and validity of outcome measures in medical rehabilitation research

than are their colleagues who report on interventionresearch.

On the other hand, all of the studies included in oursample from the rehabilitation literature have one ormore `independent’ variables, characterizing the treat-ment or other intervention whose eVects werestudiedÐif nothing more than the hours of therapy orsome other indicator of extensiveness. There werelimited eVorts to measure the degree to which the inter-vention called for in the research protocol was actuallydelivered.18 Furthermore, none of these studies consid-ered the fact that measures of independent variableshave their own reliability and validity problems. Thus,had the independent variables been included in ourreview, ®ndings would have been even bleaker.

A second limitation is that the journals reviewed wereones published (primarily) for US audiences, and a fewyears in the past. Generalizing to other journals andmore recent publication years may not be warranted.As a check on this, the same methodology was appliedto the most recent 20 papers (describing interventionaland non-interventional research) published in the 2001volume (issues 8 ± 23) of Disability and Rehabilitationand classi®ed by the Editor as `Research papers’. Infor-mation was coded by one author only (MD). Threepapers were excluded because they focused on measuredevelopment or were reports of qualitative research.The remaining 17 papers had 267 outcome measures(15.7 average). For 208 of these (78%), no clinimetricinformation whatsoever was found. Reliability informa-tion was provided for 49 (18%); six of these (2%) hadlocal evidence, while for 43 (16%) information wasbased on the literature only. Information on validitywas provided for 46 (17%); 0 had local evidence. A totalof nine measures (3%) had information on responsive-ness or other clinimetric characteristics. Thus, it is fairto say that the defects described in this paper are not athing of the recent past, or typify US journals only.

Explanations for the observed lack of adherence tothe principles of good research practice should distin-guish the two components implicit in the Standards:15

(1) the need to use tests and measures that produce reli-able and valid scores; and (2) the need to publish as partof any research report information on the reliability andvalidity of the data used.

There can be no doubt that the need to base researchon measures that satisfy the highest standards ofpsychometrics, biometrics and clinimetrics is shared byall scientists. There may be ongoing debate as to themerits of particular approaches to quantifying randomand systematic error in measurement scores (classicaltest theory vs generalizability theory, for instance), and

sometimes costs or other feasibility issues may forcethe use of measures with less than the best metric char-acteristics, but there is consensus that good sciencerequires good measurement. That is why scientists inall ®elds devote time, energy, creativity and funding tothe development and ®ne-tuning of new measures, andto assessing the usefulness of existing ones for newpopulations and applications. Medical rehabilitation isno exception: our classi®cation of the 1997 ± 1998 jour-nal articles (®gure 1) indicated that 9.3% of 1039published peer-reviewed papers (15.0% of all originalquantitative research papers) had as their only orprimary goal describing the (further) development of ameasure.

If it is assumed that all authors of quantitativeresearch publications in the professional and scienti®cjournals included in our survey know about the needto use valid and reliable data, the question remains:why are they not doing so. Each outcome measure usedwas not reviewed, to determine whether there are studiesdemonstrating that it produces data with acceptablelevels of reliability and validity, for particular purposes,for speci®c populations and if administered in certainways. Nor were the authors contacted to ®nd outwhether they had such evidence from the literature,and/or that they had performed reliability and validitystudies in preparation for or as part of their particularapplications. However, based on our familiarity withthe literature and with procedures common in rehabili-tation research, the authors feel con®dent in saying thatfor many of the measures used, there is little or noevidence of metric qualities. (The fact that oftenmeasures are used for which there is evidence that theyhave poor validity or reliability is not addressedÐe.g.internal consistency below the often-recommended cut-oV of Cronbach’s alpha of 0.80.)

If no literature reports of adequate clinimetric charac-teristics are available, or if a radical shift in populationor application of the measure is made, the onus ofcollecting evidence for data reliability and validity fallson the researcher. Generating validity and especiallyreliability information for one’s sample sometimes iseasyÐfor instance, calculating the internal consistencyof an attitude scale. In other situations, producing suchinformation is quite di� cult, and may involve collectingadditional information (e.g. a validity gold standardmeasure) or data for additional time points. The cursoryuse of measures without evidence of adequate metricqualities may be explained in this way in some instances.

In those instances that measures were used that dohave existing and published evidence of adequate metricproperties in one or more applications, the issue is why

Reliability and validity

825

Dis

abil

Reh

abil

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

Tuf

ts U

nive

rsity

on

11/1

4/14

For

pers

onal

use

onl

y.

Page 8: Reporting on reliability and validity of outcome measures in medical rehabilitation research

the papers reviewed did not make mention of this infor-mation. The authors may have argued that well-knownmeasures such as the Functional Independence Measure(FIM) have been demonstrated to produce data of highreliability and validity,19 and that it would be wasteful ofthe reader’s time to mention such information.However, not all of the instruments included in thestudies reviewed were well known. In addition, becauseinstruments have diVerent metric qualities in variousapplications, it behooves researchers to demonstrate(or at least argue) that in their application, reliabilityand validity are on a par with published levels, or atleast su� ciently high for the purpose at hand.6 Fewpapers were found that oVered explicit arguments tothat eVect. Evidence for (any type of) reliability basedon newly collected data was oVered for only 7.7% ofmeasures (Table 3), and for validity the corresponding®gure was a paltry 0.6%.

Lastly, lack of space may be an explanation for thepaucity of mention of metric qualities of the instrumentsused in research publications. Our data indicate thatthere is a (weak) relationship between paper lengthand completeness of reporting of clinimetric informa-tion. Editors want to publish as many papers as spacepermits, and therefore pressure authors to cut, reduceand compress. Peer review comments written or receivedare `this could be said in much less space’ and `the infor-mation in the text duplicates the data in the tables’.Authors may feel that this pressure is a justi®cation toomit anything but the core information on their study.However, with most investigators utilizing four or feweroutcome measures, a few paragraphs are su� cient toprovide adequate information on their metric proper-ties. If information relevant to metric qualities is in amanual or the published literature, reference to thesecan be made, and supplemental information on reliabil-ity and validity in the application at hand (`localevidence’) summarized. If a measure is new, more spacemay be needed, but it seems that most authors are nothesitant to detail instruments they have developed. Inboth instances, a footnote can be added that furtherinformation on reliability and validity ®ndings is avail-able from the authors.

It is recommend that medical rehabilitation research-ers follow the guidelines in their textbooks and the stan-dards on measurement put forward by their professionalorganizations. They should select measures that arelikely to produce data with adequate validity and relia-bility, use them as they were intended to be used, collect(if at all possible) information on the measures’ metricproperties in their own application, and provide thisinformation in any publication on the research. It is

recommended that editors and peer reviewers ensurethat researchers have done all of these things, andrequire that they provide a minimum of informationrelevant to measurement. Editors should make the extraspace available in order to make this possible. Lastly, itis recommended that readers do not skip over the para-graphs describing metric qualities of the data used in astudy, but instead utilize this information in judgingthe quality and applicability of the study results to theirown patient or subject populations.

Acknowledgements

At the time data abstracting for this project was performed, all

authors were associated with Rehabilitation Institute of Michigan,

Detroit, MI. Funding for this project was made available by the Del

Harder Rehabilitation Fund and United Way of Southeastern

Michigan.

References

1 Payton OD. Research: The Validation of Clinical Practice, thirdedition. Philadelphia: FA Davis Co, 1994.

2 Hicks CM. Research Methods for Cinical Therapists. AppliedProject Design and Analysis, third edition. Philadelphia: ChurchillLivingstone, 1999.

3 Feinstein AR. Clinimetrics. New Haven, CT: Yale University Press,1987.

4 Hinderer SR, Hinderer KA. Principles and applications ofmeasurement methods. In: JA DeLisa, BM Gans (eds) Rehabilita-tion Medicine: Principles and Practice, third edition. Philadelphia:Lippincott-Raven Publishers, 1998; 109 ± 136.

5 Frytak J. Measurement. Journal of Rehabilitation OutcomesMeasurement 2000; 4: 15 ± 31.

6 Thompson B, Vacha-Haase T. Psychometrics is datametrics. Thetest is not reliable. Educational and Psychological Measurement2000; 60: 174 ± 195.

7 American Educational Research Association, American Psycholo-gical Association, National Council on Measurement in Education.Standards for Educational and Psychological Testing. WashingtonDC: American Psychological Association, 1985 (Reprinted 1990).

8 Meier ST, Davis SR. Trends in reporting psychometric propertiesof scales used in counseling psychology research. Journal ofCounselling Psychology 1990; 37: 113 ± 115.

9 Thompson B, Snyder PA. Statistical signi®cance and reliabilityanalyses in recent Journal of Counselling & Development researcharticles. Journal of Counselling and Development 1998; 76: 436 ± 441.

10 Whittington D. How well do researchers report their measures? Anevaluation of measurement in published educational research.Educational and Psychological Measurement 1998; 58: 21 ± 37.

11 Ellis MV, Ladany N, Krengel M, Schult D. Clinical supervisionresearch from 1981 to 1993: a methodological critique. Journal ofCounselling Psychology 1996; 43: 35 ± 50.

12 Selby-Harrington ML, Mehta SM, Jutsum V, Riportella-Muller R,Quade D. Reporting of instrument validity and reliability inselected clinical nursing journals, 1989. Journal of ProfessionalNursing1994; 10: 47 ± 56.

13 Lynn MR. Instrument reliability and validity: how much needs tobe published? Heart and Lung 1989; 18: 421 ± 423.

14 Lamp E, Price JH, Desmond SM. Instrument validity andreliability in three health education journals, 1980 ± 1987. Journalof School Health 1989; 59: 105 ± 108.

P. J. M. Marcell et al.

826

Dis

abil

Reh

abil

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

Tuf

ts U

nive

rsity

on

11/1

4/14

For

pers

onal

use

onl

y.

Page 9: Reporting on reliability and validity of outcome measures in medical rehabilitation research

15 Johnston MV, Keith RA, Hinderer SR. Measurement standardsfor interdisciplinary medical rehabilitation. Archives of PhysicalMedicine and Rehabilitation 1992; 73: S3 ± S23.

16 Bandy WD, Irion JM, Briggler M. The eVect of time and frequencyof static stretching on ¯exibility of the hamstring muscles. PhysicalTherapy 1997; 77: 1090 ± 1096.

17 Task force on Standards for Measurement in Physical Therapy.Standards for tests and measurements in physical therapy practice.Physical Therapy 1991; 71: 589 ± 622.

18 Dijkers MPJM, Kropp G. Treatment integrity in medicalrehabilitation research: toward better research and better researchreporting, submitted.

19 Ottenbacher KJ, Hsu Y, Granger CV, Fiedler RC. The reliabilityof functional independence measure: a quantitative review.Archives of Physical Medicine and Rehabilitation 1996; 77: 1226 ±1232.

Reliability and validity

827

Dis

abil

Reh

abil

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

Tuf

ts U

nive

rsity

on

11/1

4/14

For

pers

onal

use

onl

y.