6
Disagreement in Interpretation: A Method for the Development of Benchmarks for Quality Assurance in Imaging David J. Soffa, MD, MPA, Rebecca S. Lewis, MPH b , Jonathan H. Sunshine, PhD b,c , Mythreyi Bhargavan, PhD b Purpose: To calculate disagreement rates by radiologist and modality to develop a benchmark for use in the quality assessment of imaging interpretation. Methods: Data were obtained from double readings of 2% of daily cases performed for quality assurance (QA) between 1997 and 2001 by radiologists at a group practice in Dallas, Texas. Differences across radiologists in disagreement rates, with adjustments for case mix, were examined for statistical significance using simple comparisons of means and multi- variate logistic regression. Results: In 6703 cases read by 26 radiologists, the authors found an overall disagreement rate of 3.48%, with a disagreement rate of 3.03% for general radiology, 3.61% for diagnostic mammography, 5.79% for screening mammog- raphy, and 4.07% for ultrasound. Disagreement rates by radiologist for the 10 radiologists with at least 20 cases ranged from 2.04% to 6.90%. Multivariate analysis found that controlling for other factors, both differences among radiologists and across modalities, statistically significantly contributed to differences in disagreement rates. Conclusion: Disagreement rates varied by modality and by radiologist. Double reading studies such as these are a useful tool to rate quality of imaging interpretation and to establish benchmarks for QA. Key Words: Quality assurance, observer performance, disagreement rate, interpretation, imaging, self-referral J Am Coll Radiol 2004;1:212-217. Copyright © 2004 American College of Radiology INTRODUCTION Radiologists do not currently have an objective benchmark for an acceptable level of missed diagnoses to meet hospital accred- itation and proctoring requirements [1]. Also, there is no ac- cepted measure by which to judge other nonradiologist physi- cians’ imaging interpretations. Government regulations, accreditation requirements, and the movement toward con- sumerism in the marketplace are placing ever increasing de- mands on the need to demonstrate quality. Residency training programs in radiology expose residents to studies in optical physics regarding perceptual errors. Although they also are familiarized with existing studies on rates of disagreement be- tween multiple reads of the same film, there is a paucity of recent literature on the subject that is applicable to actual practice situations and that can serve as a credible benchmark. This study seeks to address that gap. METHODS Sources of Data The International Radiology Group (IRG) is a radiology prac- tice in Dallas, Texas, currently reading between 1200 and 1500 cases per day. It operates a highly automated, streamlined reading center. Since its inception, IRG has maintained a qual- ity assurance (QA) program. Its cases come largely from out- patient settings in 26 states and include studies from nighttime emergency and off-hours coverage. Over the past several years, the center has developed its digital and film transmission capa- bilities. There is not yet sufficient volume for the inclusion of these readings in the current study, nor was there an adequate number of magnetic resonance imaging (MRI) or computed tomography (CT) film interpretations. During the five-year period studied (1997-2001), staff radiologists participated in a QA program consisting of the daily double reading of 2% of over 300,000 cases. The program was initiated with the start of the reading center so that there was a lesser number of cases over the first few years, and case volume increased over time. a American Imaging Management, Northbrook, Illinois. b Research Department, American College of Radiology, Reston, Virginia. c Department of Diagnostic Radiology, Yale University, New Haven, Con- necticut. Corresponding author and reprints: David J. Soffa, MD, American Imaging Management, 40 Skokie Boulevard, Northbrook, IL 60062; e-mail: [email protected]. © 2004 American College of Radiology 0091-2182/04/$30.00 DOI 10.1016/j.jacr.2003.12.017 212

Disagreement in interpretation: a method for the development of benchmarks for quality assurance in imaging

Embed Size (px)

Citation preview

Page 1: Disagreement in interpretation: a method for the development of benchmarks for quality assurance in imaging

I

Raiccasmppftr

a

b

c

n

Ms

2

Disagreement in Interpretation:A Method for the Development of

Benchmarks for Quality Assurancein Imaging

David J. Soffa, MD, MPA, Rebecca S. Lewis, MPHb, Jonathan H. Sunshine, PhDb,c,Mythreyi Bhargavan, PhDb

Purpose: To calculate disagreement rates by radiologist and modality to develop a benchmark for use in the qualityassessment of imaging interpretation.

Methods: Data were obtained from double readings of 2% of daily cases performed for quality assurance (QA) between1997 and 2001 by radiologists at a group practice in Dallas, Texas. Differences across radiologists in disagreement rates,with adjustments for case mix, were examined for statistical significance using simple comparisons of means and multi-variate logistic regression.

Results: In 6703 cases read by 26 radiologists, the authors found an overall disagreement rate of 3.48%, with adisagreement rate of 3.03% for general radiology, 3.61% for diagnostic mammography, 5.79% for screening mammog-raphy, and 4.07% for ultrasound. Disagreement rates by radiologist for the 10 radiologists with at least 20 cases rangedfrom 2.04% to 6.90%. Multivariate analysis found that controlling for other factors, both differences among radiologistsand across modalities, statistically significantly contributed to differences in disagreement rates.

Conclusion: Disagreement rates varied by modality and by radiologist. Double reading studies such as these are a usefultool to rate quality of imaging interpretation and to establish benchmarks for QA.

Key Words: Quality assurance, observer performance, disagreement rate, interpretation, imaging, self-referral

J Am Coll Radiol 2004;1:212-217. Copyright © 2004 American College of Radiology

pT

M

S

Tt1ripetbtntpQot

NTRODUCTION

adiologists do not currently have an objective benchmark forn acceptable level of missed diagnoses to meet hospital accred-tation and proctoring requirements [1]. Also, there is no ac-epted measure by which to judge other nonradiologist physi-ians’ imaging interpretations. Government regulations,ccreditation requirements, and the movement toward con-umerism in the marketplace are placing ever increasing de-ands on the need to demonstrate quality. Residency training

rograms in radiology expose residents to studies in opticalhysics regarding perceptual errors. Although they also areamiliarized with existing studies on rates of disagreement be-ween multiple reads of the same film, there is a paucity ofecent literature on the subject that is applicable to actual

American Imaging Management, Northbrook, Illinois.

Research Department, American College of Radiology, Reston, Virginia.

Department of Diagnostic Radiology, Yale University, New Haven, Con-ecticut.

Corresponding author and reprints: David J. Soffa, MD, American Imaginganagement, 40 Skokie Boulevard, Northbrook, IL 60062; e-mail:

[email protected].

12

ractice situations and that can serve as a credible benchmark.his study seeks to address that gap.

ETHODS

ources of Data

he International Radiology Group (IRG) is a radiology prac-ice in Dallas, Texas, currently reading between 1200 and500 cases per day. It operates a highly automated, streamlinedeading center. Since its inception, IRG has maintained a qual-ty assurance (QA) program. Its cases come largely from out-atient settings in 26 states and include studies from nighttimemergency and off-hours coverage. Over the past several years,he center has developed its digital and film transmission capa-ilities. There is not yet sufficient volume for the inclusion ofhese readings in the current study, nor was there an adequateumber of magnetic resonance imaging (MRI) or computedomography (CT) film interpretations. During the five-yeareriod studied (1997-2001), staff radiologists participated in aA program consisting of the daily double reading of 2% of

ver 300,000 cases. The program was initiated with the start ofhe reading center so that there was a lesser number of cases

ver the first few years, and case volume increased over time.

© 2004 American College of Radiology0091-2182/04/$30.00 ● DOI 10.1016/j.jacr.2003.12.017

Page 2: Disagreement in interpretation: a method for the development of benchmarks for quality assurance in imaging

Emtpotgo

Q

Tpsresisdttsdlsgaia

1

2

3

4

drttmrtttrcit

S

Er

rccprppM(

thrlrasrwthaofcora

slMupmmtrmC

wbteamo

rdgctsem

Soffa et al./Disagreement in Interpretation 213

ach test case was read by 2 radiologists. The test films wereixed in with the other films each radiologist was reading for

he first time. Thus, the first radiologist did not know that aarticular case was going to be reviewed, and the second radi-logist did not know that it had already been interpreted. Aotal of 26 board-certified radiologists participated in the pro-ram. The study retrospectively analyzed 6812 cases (fromver 30 sites) double read as part of the QA program.

A Procedure

he films arrive from the imaging facility and are sorted andrepared for reading. After initial readings are dictated, a 2%ample of each modality is randomly pulled from the readingoom. (Cases are not numerically randomly selected; rather,very other case from each imaging facility is taken until a 2%ample of each modality is achieved). The cases are then mixedn with new cases for a second radiologist to interpret. He orhe does not know whether the case has been read before andictates it as a new case. The dictated case pairs are entered intohe QA database. Under the supervision of the medical direc-or, a designated specially trained employee compares the re-ults of the two reports of the QA test case. Any report pair thatoes not match up is marked suspect. The suspect cases are

ogged on a quality control sheet. All suspect cases are pre-ented to the medical director or another designated radiolo-ist to validate the discrepancy. Insignificant differences, suchs lung scarring, granulomas, or vascular calcifications, aregnored. The medical director then assigns a significance ratings follows:

. interpretation expected and acceptable: reviewer comfort-able with interpretation;

. interpretation varies slightly, but not totally unexpected;reviewer still comfortable with interpretation;

. interpretation varies moderately; reviewer uncomfortablewith interpretation, which might adversely affect patientcondition; and

. interpretation varies significantly; reviewer very uncomfort-able with interpretation, which probably would adverselyaffect patient condition.

If possible, the radiologists meet and review the cases ofisagreement. On a limited number of cases for which the twoadiologists still disagree, or if a timely meeting is not possible,he third radiologist, the medical director, determines whetherhe QA radiologist’s assessment was correct. Either by agree-ent or by the responsible medical director, the most accurate

eport is identified and sent to the client. Cases with designa-ions of 3 or 4 are digitally archived for later review. These arehe cases that are considered disagreements in the context ofhis study. Individual physician performance and interreadereliability are tracked and reported to individuals and the QAommittee. The final resolution of a serious disagreement maynclude the presentation of the case to the QA committee orhe identification of remedial interventions.

tatistical Analyses

ach QA case record included identification numbers for the

eading and QA radiologists on that case and a field that d

ecorded whether the two radiologists agreed. For analysis, wealled the original, or first reader, the reading radiologist. Wealled the second reader the QA radiologist. Nineteen readersarticipated as reading radiologists, and 20 participated as QAadiologists, with an overlap of 13 radiologists who partici-ated as both reading radiologists and QA radiologists. Therocedure that was performed is identified by the Americanedical Association’s Current Procedural Terminology

CPT) code [2].Because some of the staff members were part-time, some of

hem may not have read films earlier in the day and so not havead any of their cases chosen for review. Alternatively, someadiologists may not have read films later in the day at thisocation and therefore may not have had the opportunity to beeviewers. (There were only four to six radiologists on duty onny given day. In addition, because this was a double-blindtudy, there was a small number of cases [97] for which the QAadiologist reviewed his or her own first interpretation. Theseere deleted from the data set that we analyzed.) It is possible

hat a radiologist’s behavior is affected by the knowledge thate or she is likely to be reviewed. To reduce bias on thisccount, we restricted our comparisons across radiologists tonly those who read as both reading and QA radiologists. (Weound that radiologists who were dropped from the data be-ause they had served in only one role—as reading radiologistsnly or as QA radiologists only—did not have disagreementates that were statistically significantly different from the over-ll average.)

Statistical analyses of the data were performed using SASoftware version 8.1 (SAS Institute Inc., Cary, North Caro-ina). The CPT codes were classified into six categories—CT,

RI, diagnostic mammography, screening mammography,ltrasound studies, nuclear medicine, and general radiology orlain film—to analyze agreement rates by modality. Mam-ography cases were split into diagnostic and screening mam-ography, because the volumes and patterns of interpretation

end to differ across them, leading possibly to differences inates of disagreement. We excluded cases with missing data andodalities with fewer than 20 cases. As a result, we omittedT, MR, and nuclear medicine and had 6703 cases remaining.Given the double-blinded nature of the QA process,

hether a radiologist was the first or second reader would note expected to affect his or her reading of a case and thereforehe disagreement rate with the others in the study. Therefore,xcept in regression analyses, we calculated and tabulated dis-greement rates for each individual radiologist, overall and byodality, while ignoring whether the radiologist was a reading

r QA radiologist on a particular case.Except where otherwise noted, cases read by reading or QA

adiologists who read fewer than a total of 20 cases in all wereeleted from the analysis. This process left us with 10 radiolo-ists. The reason for deleting radiologists with fewer than 20ases is that results based on small samples are not likely to beruly representative of a radiologist’s usual behavior because ofample variability. With very few cases, the standard error ofstimated disagreement rates would be too high to permiteaningful statistical comparisons. Although this may intro-

uce some unintentional bias, the modality specific disagree-

Page 3: Disagreement in interpretation: a method for the development of benchmarks for quality assurance in imaging

ml

ffaacowdrtremowitthtoaritia

oamwcddtvraodemf

reegrwvoiar

R

Tat

drdeorpmsamm

rsnwmm

masaaird

214 Journal of the American College of Radiology/Vol. 1 No. 3 March 2004

ent rates did not change significantly after dropping theow-volume radiologists.

We calculated disagreement rates for each radiologist andor each modality. These calculations of disagreement ratesacilitated comparisons across one factor at a time, for example,cross radiologists or across modalities. However, differencescross radiologists could simply mean that some of them readases in modalities that were more prone to disagreement thanthers. A very important question in comparing radiologists ishether there were systematic differences across radiologists’isagreement rates that would persist even if they had compa-able mixes of modalities among their cases. Although we didabulate disagreement rates for radiologist by modality, theetrospective nature of the study meant that we could notnsure an adequate number of responses per radiologist perodality. The number of observations in each category was

ften too small to permit statistical comparisons. Therefore,e conducted a comparison of radiologists’ disagreement rates

ndependent of modality using the common epidemiologicechnique of direct standardization. Specifically, we calculatedhe expected disagreement rate each radiologist would have ifis or her disagreement rate for each modality were the same ashe overall average disagreement rate for that modality but hisr her mix of cases by modality was that which he or shectually had. We then compared this expected disagreementate to the radiologist’s actual disagreement rate and examinedf the difference was statistically significant. For the purposes ofhese comparisons, we defined P values of .01 and less asndicative of statistical significance, and P values between .01nd .10 as indicative of marginal statistical significance.

As another way to compare disagreement across more thanne factor, we used multivariate regression analysis. This en-bled us to investigate the effect of each factor on disagree-ent, namely modality, radiologist, and whether a radiologistas a reading or a QA radiologist on a case, while statistically

ontrolling for the effects of other factors that may also affectisagreement. We used logistic regression because the depen-ent variable of interest, namely, agreement, was binary (i.e.,he radiologists either agreed or disagreed). (The dependentariable was defined as taking on a value of 1 when the twoadiologists agreed on a case and a value of 0 when they dis-greed.) In general, the logistic regression measures the impactf each value of a variable on the probability of agreement (orisagreement) relative to a reference value of the variable. Forxample, general radiology was used as the reference value forodality. (To facilitate statistical analysis, we chose the most

Table 1. Disagreement by modality*Modality

General RadiologyDiagnostic MammographyScreening MammographyUltrasoundOverall*Low volume modalities (CT, MRI, and nuclear radiology), and m

s � number of cases with disagreement; N � total number of

requently occurring value as the reference one.) The logistic w

egression thus indicated the significance level for the differ-nce in probability of agreement between each modality, forxample, ultrasound, and the probability of agreement foreneral radiology. When reporting results of the multivariateegression, we reported findings only for categorical variableshose coefficients were jointly statistically significant at a Palue of less than .01. For categorical variables, a larger P valuen any single value of the variable translates into very weakmplications for joint significance of all values of the variable aswhole. Therefore, we consider larger P values too weak to

eport and do not report marginal significance.

ESULTS

able 1 shows the overall disagreement rate, 3.5%, and dis-greement rates by modality. Screening mammography hadhe highest disagreement rate, 5.8%, of all the modalities.

Table 2 shows disagreement rate by radiologist by modalityisregarding whether the radiologist was the first or secondeader on the case. It also shows actual versus expected overallisagreement rates for each radiologist. Comparing actual andxpected disagreement rates overall across all modalities, radi-logist I had a significantly lower than expected rate, andadiologist D had a marginally significantly higher than ex-ected rate. Of the 10 radiologists, 3 radiologists in screeningammography and 2 in ultrasound had disagreement rates

ignificantly different from the average for that modality, andt least 1 in each modality had a disagreement rate that wasarginally significantly different from the average rate for thatodality.The results of a logistic regression that only considered the

elationship between modality and agreement were thatcreening mammography and ultrasound had significantlyegative coefficients (Table 3A). This means that radiologistsere significantly less likely to agree when reading screeningammography or ultrasound than when reading the referenceodality, general radiology.In a logistic regression analysis (Table 3B) that consideredodality, radiologist, and whether first or second reader, vari-

bles defining radiologist and modality were, as a group, highlyignificant. In other words, the probability of agreement wasffected by both factors the radiologist who interpreted a casend the modality of which the case consisted. Consideringndividual radiologists and individual modalities, the logisticegression showed that radiologist D had a significantly higherisagreement rate than the reference radiologist (radiologist B,

isagreement Rate (s/N)3.03% (114/3,763)3.61% (68/1,885)5.79% (27/466)4.07% (24/589)3.48% (233/6,703)

ng observations were excluded from the analysis.es read in the modality.

D

issicas

ho was chosen as the reference because he or she had the

Page 4: Disagreement in interpretation: a method for the development of benchmarks for quality assurance in imaging

gspir

D

C

TdddnWb

au

C

Fta[scrmfhlm

Soffa et al./Disagreement in Interpretation 215

reatest number of cases), and screening mammography had atatistically significant negative coefficient, implying a higherrobability of disagreement than general radiology. No signif-cant effect of whether the radiologist was a first or secondeader was found.

ISCUSSION

onclusions

his study reports the results of a QA program involving aouble reading of 2% of more than 300,000 cases that pro-uced statistically valid results. In this study, a 5% or lessisagreement rate was recorded. Another study of similar mag-itude [3] also reported disagreement rates in a similar range.e believe that our study provides important input toward

uilding a credible benchmark that could be used to measure

Table 2. Disagreement rates by radiologist by modality

Overall Gene

AverageDisagreement

Rate 3.03Radiologist Actual

DisagreementRate

ExpectedDisagreement

Rate

ActuDisagre

Ra

A 4.65% 3.60% 4.55B 2.84% 3.41% 2.33C 6.90% 4.00% 0.00D 6.00%* 3.51% 7.20E 3.91% 3.38% 3.40F 3.65% 3.39% 3.66G 5.13% 3.46% 0.00H 2.71% 3.03% 2.71I 2.04%† 3.33% 1.49J 2.51% 3.17% 1.63*marginally significantly different from expected rate at 0.01 � p

† � significantly different from expected rate at p � 0.01

Table 3. Results of multivariate analysesA. Logistic Regression o

VariableScreening mammographyUltrasound

B. Logistic Regression on Modality, Ra

Categorical Variables Significance*Radiologist SModality S*For categorical variables, this indicates if, as a group, the variableof effect of any individual value.

S � significant p � 0.01; � � decreases probability of agreem

ccuracy of interpretation of plain film, mammography andltrasound.

omparison with Other Studies

leckenstein and Puig [4] reported a 5% disagreement rate forhe first year of our data. Overall, our five-year study found anverage disagreement rate of approximately 3.5%. Siegle et al.3], using data from six community hospitals selecting a 3%ample of 26 radiologists readings, reviewed more than 11,000ases covering multiple modalities and found a disagreementate of 4.4%. Rhea et al. [5] reported 5% significant disagree-ents between radiology residents and attending physicians

or emergency room studies, with “significant” here meaningaving potentially serious effect on patient management. Gar-

and [6] is famous for having reported a 30% rate of disagree-ent in his studies in the 1950s. However, a close reading of

Modality

lDiagnostic

MammographyScreening

Mammography Ultrasound

3.61% 5.79% 4.07%

entActual

DisagreementRate

ActualDisagreement

Rate

ActualDisagreement

Rate

8.33% 0.00%† 3.26%3.09% 3.31% 4.90%0.00% 0.00%† 9.52%4.69% 38.46%* 2.11%3.58% 10.62% 5.95%2.38% 10.64% 4.24%

15.00%* 0.00%† 0.00%†

3.28% 3.39% 1.39%*7.14% 14.29% 0.00%†

.10

odalities on AgreementDirection andSignificance

�S�S

logist, and Whether 1st or 2nd Reader

Individual ValuesDirection andSignificance

Radiologist D �SScreening mammography �S

ave a significant effect on probability, irrespective of the direction

; � � increases probability of agreement

ra

%al

emte

%%%%*%%%%%†%*� 0

f M

dio

s h

ent

Page 5: Disagreement in interpretation: a method for the development of benchmarks for quality assurance in imaging

hTslt

rrnpgandpp7rsrm

S

Occsl

atfiIkfbooodf

rp

iw

lcriad

pds

cgcfo

(avp

dcmos

I

Wrmmrtasmgeisp

I

Osdasfd

biateiaHmi

borT

216 Journal of the American College of Radiology/Vol. 1 No. 3 March 2004

is work indicates all the studies were known positives [3].hus, he was really reporting a number that was 100% minus

ensitivity. He opined that in actual clinical practice, with itsimited percentage of positives, approximate error rates were inhe range of 5% [3].

Most studies [7-9] on plain films with radiologists and non-adiologists reading cases found much higher disagreementates, albeit in different practice settings. Halvorsen and Ku-ian [7] found a 12.5% disagreement rate between familyractitioners in 14 community practices and backup radiolo-ists reading their radiographs. However, this was for the rel-tively small fraction of cases referred on to radiologists and didot include false negatives. McLain and Kirkwood [8] found aisagreement rate of 9.2% in radiographs read by primary carehysicians in rural practice and their backup radiologists. Hop-er et al. [9] found a potentially serious disagreement rate of.4% between one expert reviewing radiologist and seven non-adiologists interpreting chest radiographs compared to nouch disagreement for 30 chest radiographs interpreted byadiologists. Such studies suggest that nonradiologists makeore errors in interpretation.

tudy Limitations

ur study has limitations. Most obviously, although it in-luded data collected from multiple sites and states, it wasomposed of only one practice, and there could be practice-pecific factors that could influence disagreement rates. Thisimits its generalizability to other practices.

Not every radiologist in the practice participated as a readingnd QA radiologist because some radiologists worked part-ime and may not have been in during the part of the day whenlms were picked for review or when they were being reviewed.n addition, it is possible, for instance, that radiologists whonow they are not likely to be reviewed may behave differentlyrom those who know they are likely to be reviewed. To reduceias on this account, we focus our comparisons across radiol-gists to only those who read as both reading and QA radiol-gists. (We found that radiologists who served as reading radi-logists only or QA radiologists only did not haveisagreement rates that were statistically significantly differentrom the overall average.)

Also, it was not possible to stratify the data by setting. Thus,esults cannot be generalized to a specific setting (such as hos-ital or office).None of the advanced modalities, such as CT or MRI, is

ncluded, which limits generalizability to the modalities thatere included in the analysis.Two percent of the daily cases of each modality were se-

ected. There may have been some unintentional bias in thease selection because the cases were not selected by numericalandomization. However, because we are interested in compar-sons of averages across radiologists and modalities and not thebsolute rates, we believe that our results are likely to be validespite any bias.Disagreement was determined by one radiologist, and it is

ossible that a different radiologist might have arrived at aifferent decision as to whether the two readers agreed. This

tudy defines disagreement as cases for which the radiologist t

omparing the interpretations of the reading and QA radiolo-ist thought that the disagreement might have adverse impli-ations for patient condition. It would be useful to carry QAurther and use expert committees instead of individual radi-logists to judge disagreements.

In general, the smaller the number of cases in any subgroupbased on radiologist or modality), the higher were the dis-greement rates we found. Small numbers of cases should makeariance high but not increase the mean. So, we count thiseculiarity as a weakness.Although we account for variations across readers and mo-

alities, we do not have information on disease prevalence orlinical factors, both of which could further influence disagree-ent rates and/or introduce bias. In general, addressing some

f the limitations we note would enhance the value of our QAystem.

mport of This Study

e have shown that it is feasible to measure the disagreementate and to do so not only overall but also by radiologist andodality. The analysis shows that disagreement rates differ byodality, controlling for which radiologist is reading, and by

adiologist, controlling for modality. Thus, we have shownhat it is important to include both radiologist and modality inppraising performance and making comparisons. In ourtudy, disagreements occurred over a relatively narrow range,uch smaller than the reported differences between radiolo-

ists and nonradiologists. Our study does not in and of itselfstablish a benchmark. However, it does provide importantnput toward building a credible review program and demon-trates a system that can be used to measure accuracy of inter-retation.

mplications for QA and Self-Referral

ur study shows that a blinded, second-reading-based QAystem can work in daily radiology practice. The analyses in-icate that modality matters and therefore should be taken intoccount and that there are differences among radiologists. Inhort, we have demonstrated a working and productive modelor a peer-review system and provided important benchmarkata for disagreement rates for some modalities.In addition to use for radiologist peer review, a credible

enchmark would be useful in evaluating the quality of imag-ng performed by nonradiologists. This in turn would enablen objective evaluation of quality in self-referred cases. Pa-ients, payers, and policy makers could use this information tostablish standards or accreditation criteria for facilities. Stud-es finding higher rates of disagreement between radiologistsnd nonradiologists, such those of as Hopper et al. [9] andalvorsen and Kunian [7], make a case for establishing bench-arks for interpretive ability for all individuals who interpret

maging examinations.Our study revealed a consistent disagreement rate of 5% and

elow among two board-certified radiologists reading a samplef daily cases for QA. Other studies in the literature haveeported similar rates when a double reading was performed.he results of the IRG QA program provide important input

oward building a credible review program and demonstrate a

Page 6: Disagreement in interpretation: a method for the development of benchmarks for quality assurance in imaging

sPoati

A

Wv

AwsM

R

1

2

3

4

5

6

7

8

9

Soffa et al./Disagreement in Interpretation 217

ystem that can be used to measure accuracy of interpretation.utting a system like this into use could facilitate comparisonsf disagreements between radiologists and nonradiologists inctual clinical settings. It could provide much useful informa-ion for the discussion of how important training in radiologys in physicians’ ability to read and interpret cases.

CKNOWLEDGMENT

e would like to acknowledge the contributions of Brian Hall,ice president of operations at IRG, Dallas, Texas.

This study received assistance in statistical analysis from theCR’s Technology Assessment Studies Assistance Program,hich provided the professional services of Jonathan H. Sun-

hine, PhD, Mythreyi Bhargavan, PhD, and Rebecca Lewis,PH.

EFERENCES

. Cascade PN. Quality improvement in diagnostic radiology. Am J Roent-

genol 1990;154:1117-20.

. American Medical Association. Physicians’ Current Procedural Terminol-ogy. Chicago: American Medical Association; 2001.

. Siegle RL, Baram EM, Reuter SR, Clarke EA, Lancaster JL, McMahan CA.Rates of disagreement in imaging interpretation in a group of communityhospitals. Acad Radiol 1998;5:148-54.

. Fleckenstein JL, Puig WA. Quality control in radiology: variance analysis.RSNA scientific session #1183, abstract, 1999.

. Rhea JT, Potsaid MS, DeLuca SA. Errors of interpretation as elicited by aquality audit of an emergency radiology facility. Radiology 1979;132:277-80.

. Garland LH. Studies on the accuracy of diagnostic procedures. Am JRoentgenol 1959;82:25-33.

. Halvorsen JG, Kunian A. Radiology in family practice: a prospective studyof 14 community practices. Family Med 1990;22:112-7.

. McLain PL, Kirkwood CR. The quality of emergency room radiographinterpretations. J Family Pract 1985;20:443-8.

. Hopper KD, Rosetti GF, Edmiston RB, et al. Diagnostic radiology peerreview: a method inclusive of all interpreters of radiographic examinations

regardless of specialty. Radiology 1991;180:557-61.