3
JOURNAL OFCONSUMER PSYCHOLOGY, 10(1&2), 71-73 Copyright ? 2001,Lawrence Erlbaum Associates, Inc. Interrater Reliability IV. INTERRATER RELIABILITY ASSESSMENT IN CONTENT ANALYSIS What is thebest way to assess reliability in content analysis- Is percentage agreement between judges best (NO!)? Or, statedin a slightly different manner from another re- searcher: There are several tests that give indexes of rater agreement for nominal data and some other tests or coeffi- cientsthat give indexes ofinterrater reliability formetric scale data.For my databased on metric scales, I have established rater reliability using the intraclasscorrelation coefficient, butI also wantto look at interrater agreement (for two raters). What appropriate test is therefor this? I have hunted around butcannot find anything. I have thought that a simplepercent- age of agreement(i.e., 1 point difference using a 10-point scale is 10% disagreement) adjusted for the amount of vari- ance for each questionmay be suitable. Professor Kent Grayson London Business School Kolbe and Burnett (1991) offered a nice-and pretty damn- ing-critique of the quality of content analysis in consumer research. Theyhighlight a number of criticisms, one of which is this concern about percentageagreement as a basis for judging the quality of content analysis. The basic concernis that percentages do not take into ac- count the likelihood of chance agreement between raters. Chance is likely to inflate agreement percentages in all cases, but especially with two coders, and low degrees of freedom on each coding choice (i.e., few codingcategories). That is, if Coder A andCoder B haveto decide yes-no whether a coding unithas property X, thenmere chancewill have them agree- ing at least 50% of the time (i.e., in a 2 x 2 table with codes randomly distributed, 25% in each cell, therewould be 50% of the scores already randomlyalong the diagonal, which would represent spurious apparent agreement). Severalscholarshave offered statisticsthat try to correct forchance agreement. The one that I havebeen using lately is "Krippendorffs alpha" (Krippendorff, 1980), which he de- scribedin his chapter on reliability. I use it becausethe math seems intuitive-it seems to be roughly based on the ob- servedand expectedlogic thatunderlies chi-square. Alternatively, Hughes andGarrett (1990) outlined a num- ber of different options (including Krippendorff's, 1980) and then offered their own solution based on a generalizability theory approach. Rust and Cooil (1994) took a "proportional reductionin loss" approach and pro- vided a general framework for reliability indexes for quanti- tative and qualitative data. Professor Roland Rust Vanderbilt University Recent work in psychometrics (Cooil & Rust, 1994, 1995; Rust & Cooil, 1994) haveset forth the concept ofproportional reduction of loss (PRL) as a general criterion of reliability that subsumes boththe categorical case andthe metriccase. This criterionconsiders the expected loss to a researcher from wrong decisions, and it turs out to include some popularly used methods (e.g., Cronbach's alpha: Cronbach, 1951; generalizability theory:Cronbach, Gleser, Nanda, & Raja- ratnam, 1972; Perreault and Leigh's, 1989, measure) as spe- cial cases. Simplyusing percentage of agreement between judges is not so good becausesome agreement is sureto occur, if only by chance,andthe fewerthe number of categories, the more random agreement is likely to occur, thusmaking thereliabil- ity appear better thanit really is. This random agreement was the conceptual basisofCohen's kappa (Cohen,1960), a popu- larmeasure that is not a PRLmeasure and has other bad prop- erties (e.g., overconservatism and, under some conditions, inability to reach one, even if thereis perfectagreement). Editor: For the discussionthat follows, imagine a concrete example. Perhaps two experts have been asked to code whether 60 print advertisements are "emotional imagery," "rational informative," or"mixed ambiguous" in appeal. The ratings tablethatfollows depicts the coding scheme with the three categories-let us call the number of coding categories "c." The row andcolumn represent the codes assigned to the advertisements by the two independent judges. For example, n21 represents one kind of disagreement-the number of ads that RaterI judged as rational that Rater 2 thought emotional. (Thenotation, nl +,represents the sumforthe first row, having aggregated overthe columns.) If theraters agreed completely,

Interrater Reliability Assessment in Content · PDF fileINTERRATER RELIABILITY ASSESSMENT IN CONTENT ANALYSIS What is the best way to assess reliability in content analysis- Is percentage

Embed Size (px)

Citation preview

Page 1: Interrater Reliability Assessment in Content · PDF fileINTERRATER RELIABILITY ASSESSMENT IN CONTENT ANALYSIS What is the best way to assess reliability in content analysis- Is percentage

JOURNAL OF CONSUMER PSYCHOLOGY, 10(1&2), 71-73 Copyright ? 2001, Lawrence Erlbaum Associates, Inc.

Interrater Reliability

IV. INTERRATER RELIABILITY ASSESSMENT IN CONTENT ANALYSIS

What is the best way to assess reliability in content analysis- Is percentage agreement between judges best (NO!)?

Or, stated in a slightly different manner from another re- searcher: There are several tests that give indexes of rater agreement for nominal data and some other tests or coeffi- cients that give indexes ofinterrater reliability for metric scale data. For my data based on metric scales, I have established rater reliability using the intraclass correlation coefficient, but I also want to look at interrater agreement (for two raters). What appropriate test is there for this? I have hunted around but cannot find anything. I have thought that a simple percent- age of agreement (i.e., 1 point difference using a 10-point scale is 10% disagreement) adjusted for the amount of vari- ance for each question may be suitable.

Professor Kent Grayson London Business School

Kolbe and Burnett (1991) offered a nice-and pretty damn- ing-critique of the quality of content analysis in consumer research. They highlight a number of criticisms, one of which is this concern about percentage agreement as a basis for judging the quality of content analysis.

The basic concern is that percentages do not take into ac- count the likelihood of chance agreement between raters. Chance is likely to inflate agreement percentages in all cases, but especially with two coders, and low degrees of freedom on each coding choice (i.e., few coding categories). That is, if Coder A and Coder B have to decide yes-no whether a coding unit has property X, then mere chance will have them agree- ing at least 50% of the time (i.e., in a 2 x 2 table with codes randomly distributed, 25% in each cell, there would be 50% of the scores already randomly along the diagonal, which would represent spurious apparent agreement).

Several scholars have offered statistics that try to correct for chance agreement. The one that I have been using lately is "Krippendorffs alpha" (Krippendorff, 1980), which he de- scribed in his chapter on reliability. I use it because the math seems intuitive-it seems to be roughly based on the ob- served and expected logic that underlies chi-square.

Alternatively, Hughes and Garrett (1990) outlined a num- ber of different options (including Krippendorff's, 1980) and then offered their own solution based on a generalizability theory approach. Rust and Cooil (1994) took a "proportional reduction in loss" approach and pro- vided a general framework for reliability indexes for quanti- tative and qualitative data.

Professor Roland Rust Vanderbilt University

Recent work in psychometrics (Cooil & Rust, 1994, 1995; Rust & Cooil, 1994) have set forth the concept ofproportional reduction of loss (PRL) as a general criterion of reliability that subsumes both the categorical case and the metric case. This criterion considers the expected loss to a researcher from wrong decisions, and it turs out to include some popularly used methods (e.g., Cronbach's alpha: Cronbach, 1951; generalizability theory: Cronbach, Gleser, Nanda, & Raja- ratnam, 1972; Perreault and Leigh's, 1989, measure) as spe- cial cases.

Simply using percentage of agreement between judges is not so good because some agreement is sure to occur, if only by chance, and the fewer the number of categories, the more random agreement is likely to occur, thus making the reliabil- ity appear better than it really is. This random agreement was the conceptual basis ofCohen's kappa (Cohen, 1960), a popu- lar measure that is not a PRL measure and has other bad prop- erties (e.g., overconservatism and, under some conditions, inability to reach one, even if there is perfect agreement).

Editor: For the discussion that follows, imagine a concrete example. Perhaps two experts have been asked to code whether 60 print advertisements are "emotional imagery," "rational informative," or "mixed ambiguous" in appeal. The ratings table that follows depicts the coding scheme with the three categories-let us call the number of coding categories "c." The row and column represent the codes assigned to the advertisements by the two independent judges. For example, n21 represents one kind of disagreement-the number of ads that Rater I judged as rational that Rater 2 thought emotional. (The notation, nl +, represents the sum for the first row, having aggregated over the columns.) If the raters agreed completely,

Dawn.Iacobucci
Text Box
Iacobucci, Dawn (ed.) (2001), Journal of Consumer Psychology's Special Issue on Methodological and Statistical Concerns of the Experimental Behavioral Researcher, 10 (1&2), Mahwah, NJ: Lawrence Erlbaum Associates, 71-73.
Page 2: Interrater Reliability Assessment in Content · PDF fileINTERRATER RELIABILITY ASSESSMENT IN CONTENT ANALYSIS What is the best way to assess reliability in content analysis- Is percentage

72 RELIABILITY ASSESSMENT

all ads would fall into the nl, 1 n22, or n33 cells, with zeros off the main diagonal.

Rater 2- row

Rater 1: emotional rational mixed sums

emotional n,, n,2 ni3 nj+

rational n21 n22 n23 n2+

mixed n3 n32 n33 n3+

column n n+2 n+3 n

sums e.g., 60

Cohen (1960) proposed K, kappa, the "coefficient of

agreement," drawing the analogy (pp. 37-38) between interrater agreement and item reliability as pursuits of evalu-

ating the quality of data. His index was intended as an im-

provement on the simple (he says "primitive") computation of percentage agreement. Percentage agreement is computed as the sum of the diagonal (agreed on) ratings divided by the number of units being coded: (nll + n22 + n33)/n++. Cohen stated, "It takes relatively little in the way of sophistication to

appreciate the inadequacy of this solution" (p. 38). The prob- lem to which he speaks is that this index does not correct for the fact that there will be some agreement simply due to chance. Hughes and Garrett (1990) and Kolbe and Burett (1991), respectively, reported 65% and 32% of the articles they reviewed as relying on percentage agreement as the pri- mary index of interrater reliability. Thus, although Cohen's criticism was clear 40 years ago, these reviews, only 10 years ago, suggest that the issues and solutions still have not perme- ated the social sciences.

Cohen (1960, pp. 38-39) also criticized the use of the

chi-square test of association in this application, because the

requirement of agreement is more stringent. That is, agree- ment requires all non-zero frequencies to be along the diago- nal; association could have all frequencies concentrated in

off-diagonal cells. Hence, Cohen (1960) proposed kappa:

_(Pa -Pc) K= (1-pc)

wherepa is the proportion of agreed on judgments (in our ex-

ample, Pa = (nll + n22 + n33)/n++). The termpc is the propor- tion of agreements one would expect by chance; pc = (e l + e22 + e33)/n++), where eii = (ni+/n++) (n+i/n++)(n++); for ex-

ample, the number of agreements in the (2, 2) (rational code) cell would be e22 = (n2+/n++) (n+2/n++)(n++) (just as you would compute expected frequencies in a chi-square test of independence in a two-way table-it is just that here, we care only about the diagonal entries in the matrix). Cohen also provided his equation in terms of frequencies rather

than proportions and an equation for an approximate standard error for the index.

Researchers have criticized the kappa index for some of its

properties and proposed extensions (e.g., Brennan &

Prediger, 1981; Fleiss, 1971; Hubert, 1977; Kaye, 1980; Kraemer, 1980; Tanner & Young, 1985). To be fair, Cohen

(1960, p. 42) anticipated some of these qualities (e.g., that the

upper bound for kappa can be less than 1.0, depending on the

marginal distributions), and so he provided an equation to de- termine the maximum kappa one might achieve.

If Cohen's (1960) kappa has some problems, what might serve as a superior index? Perreault and Leigh (1989) rea- soned through expected levels of chance agreement in a way that did not depend on the marginal frequencies. They defined an "index of reliability," Ir as follows (p. 141):

Pa\Pa \ ( C) c-\ )

whenpa > (l/c). Ifpa < (/lc), Ir is set to zero. (Recall that c is the number of coding categories as defined previously.) They also provide an estimated standard error (p. 143):

IIr(l-Ir) SI = | O -

, n ++

which is important, because when the condition holds that Ir x

n++ > 5, these two statistics may be used in conjunction to form a confidence interval, in essence a test of the signifi- cance of the reliability index:

Ir ? (1.96)si.

Rust and Cooil (1994) is another level of achievement, extending Perreault and Leigh's (1989) index to the situa- tion of three or more raters and creating a framework that subsumes quantitative and qualitative indexes of reliability (e.g., coefficient alpha for rating scales and interrater agree- ment for categorical coding). Hughes and Garrett (1990) used generalizability theory, which is based on a random-ef- fects analysis of a variance-like modeling approach to

aportion variance due to rater, stimuli, coding conditions, and so on. (Hughes & Garrett also criticized the use of intraclass correlation coefficients as insensitive to differ- ences between coders due to mean or variance; p. 187.) Ubersax (1988) attempted to simultaneously estimate reli-

ability and validity from coding judgments using a latent class approach, which is prevalent in marketing.

In conclusion, perhaps we can at least agree to finally ban- ish the simple percentage agreement as an acceptable index of interrater reliability. In terms of an index suited for general endorsement, Perreault and Leigh's (1989) index (discussed earlier) would seem to fit many research circumstances (e.g.,

Page 3: Interrater Reliability Assessment in Content · PDF fileINTERRATER RELIABILITY ASSESSMENT IN CONTENT ANALYSIS What is the best way to assess reliability in content analysis- Is percentage

RELIABILITY ASSESSMENT 73

two raters). Furthermore, it appears sufficiently straightfor- ward that one could compute the index without a mathemati- cally-induced coronary.

REFERENCES

Brennan, Robert L., & Prediger, Dale J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Mea- surement, 41, 687-699.

Cohen, Jacob. (1960). A coefficient of agreement for nominal scales. Educa- tional and Psychological Measurement, 20, 37-46.

Cooil, Bruce, & Rust, Roland T. (1994). Reliability and expected loss: A uni- fying principle. Psychometrika, 59, 203-216.

Cooil, Bruce, & Rust, Roland T. (1995). General estimators for the reliability of qualitative data. Psychometrika, 60, 199-220.

Cronbach, Lee J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334.

Cronbach, Lee J., Gleser, Goldine C., Nanda, Harinder, & Rajaratnam, Nageswari. (1972). The dependability of behavioral measurements: Theory of generalizabilityfor scores and profiles. New York: Wiley.

Fleiss, Joseph L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378-382.

Hubert, Lawrence. (1977). Kappa revisited. PsychologicalBulletin, 84, 289- 297.

Hughes, Marie Adele, & Garrett, Dennis E. (1990). Intercoder reliability es- timation approaches in marketing: A generalizability theory framework for quantitative data. Journal of Marketing Research, 27, 185-195.

Kaye, Kenneth. (1980). Estimating false alarms and missed events from interobserver agreement: A rationale. Psychological Bulletin, 88, 458- 468.

Kolbe, Richard H., & Burnett, Melissa S. (1991). Content-analysis research: An examination of applications with directives for improving research reliability and objectivity. Journal of ConsumerResearch, 18,243-250.

Kraemer, Helena Chmura. (1980). Extension of the kappa coefficient. Biometrics, 36, 207-216.

Krippendorff, Klaus. (1980). Content analysis: An introduction to its meth- odology. Newbury Park, CA: Sage.

Perreault, William D., Jr., & Leigh, Laurence E. (1989). Reliability of nomi- nal data based on qualitative judgments. Journal of Marketing Re- search, 26, 135-148.

Rust, Roland T., & Cooil, Bruce. (1994). Reliability measures for qualitative data: Theory and implications.Journal ofMarketingResearch, 31, 1-14.

Tanner, Martin A., & Young, Michael A. (I 985). Modeling agreement among raters. Journal of the American Statistical Association, 80(389), 175- 180.

Ubersax, John S. (1988). Validity inferences from inter-observer agreement. Psychological Bulletin, 104, 405-416.