Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
IRT ITEM DISCRIMINATION INTERPRETATION IN CRITERION-REFERENCED LANGUAGE TESTING
THOMHUDSON
University of Hawai'i
This study investigates relationships among the IRT one-parameter fit statistics, the two-parameter slope parameter and traditional biserial correlations in terms of the role these indices play in criterion-referenced language test construction. It discusses the assumptions of the two models and how these assumptions can affect criterion-referenced test construction and interpretation. The study then specifically examines how the indices operate in terms of item discrimination when items are designed to determine whether or not and examinee is at a specific level of ability. Examinees in Mexico, Saudi Arabia and Japan were administered one of two forms of a functional test containing (Form A n=430, k=94: Form B n=400, k=95). The data were analyzed using the two IRT models and the results were compared. The results indicate strong relationships between biserial correlation, two-parameter slope, and one-parameter infit and outfit. The implications of these strong relationships for interpreting the indices are discussed.
INTRODUCTION
CRITERION-REFERENCED language tests (CRTs) are generally designed to
distinguish between two or more defined levels of language ability rather than
to determine the status of individuals relative to other individuals who have
taken a test. For example, CRTs are used to determine mastery or non-mastery,
placement into categories such as in need of remediation, readiness or exemption from instruction, or ability levels such as low, middle or high. As a conse
quence of this goal to discriminate between those examinees who know a
particular body of knowledge or who have particular defined abilities and
those who do not, the tests must contain items which are tailored both in terms
of content and in terms of their psychometric properties to the criteria which have been defined.
The development of such tests presents particular concerns in the area of
University of Hawai'i Working Papers in ESL, Vol. 9, No.2, December 1990, pp. 99-128
100 HUDSON
item analysis. Two of these concerns involve item fit and item discrimination. First, the items should be maximally discriminating within the defined level of ability at which they are designated, without being so narrowly discriminating as to be of little utility. That is, items will be designated to measure a skill which is assumed to be of interest at some particular ability level. Not only should the difficulty of the item be located within the band of ability with which it is associated, but most of its discriminability should be within that band as well. Second, the items should fit the trait which is being measured (Henning, 1984, 1987, 1988; Wright & Stone, 1979; Mislevy and Bock, 1986, 1990). That is, the response patterns of examinees must be such that no items appear to be measuring a construct other than the primary trait of interest. For example it should be the case that examinees with low ability on the test trait are not indicated as being of high ability on several of the items on the test. This would indicate that the items measure a different trait or that the primary trait is misspecified. As will be discussed in detail, the determination of violations of fit are closely linked to the test evaluation model which is applied.
Item response theory (ill.T) is increasingly being applied to language test evaluation, and ensuring the two requirements mentioned above is one consideration in determining which IRT model and which item statistics to employ (d. Lord, 1980, Chapter 10 for a lengthy discussion of item selection). While, the one-parameter logistic model, often referred to as the Rasch model, has frequently been criticized for failing to account for differences in item discriminability (Hambleton and Swaminathan, 1985), the two-parameter logistic model is criticized both for the potential lack of stability in estimation of the slope parameter with small numbers of examinees and for allowing items with divergent discriminability to remain on a test and hence possibly introduce multidimensionality (Hulin, Lissik and Drasgow, 1982; Lord, 1983; Henning, 1988). However, there are few studies which compare results from the two models in order to examine the effect of these potential problems on real test data, language or otherwise (Dinero & Haertel, 1977, Hambleton, 1979). This study examines the relationships among two item fit indices available with the one-parameter logistic model (infit and outfit), the slope as determined in the two-parameter logistic model and traditional measures of item discrimination such as the biserial correlation. It examines the role these indices play in item selection for criterion-referenced tests. That is, it is concerned with how these indices affect selection of items which are designed
IRT ITEM DISCRIMINATION
to assess performance of a well defined domain {Popham, 1978; Hudson & Lynch, 1984; Brown, 1988).
Background The one-parameter and two-parameter logistic models take the following
forms respectively in estimation of item parameters:
(1)
(2)
-1 P.(a) = 1 + exp(-D{a- b.)]
1 1
In these two formulas,Pi(a) is the probability that an examinee with ability a answers item i correctly. The parameter bi is the item difficulty of item i and D is a constant which is a scaling factor, usually set at D=1.7. Also, exp is a constant equal to 2.71828 (i.e., a natural or Naperian log). The two-parameter model in Formula (2) includes ai, the item discrimination of item i. This item discrimination parameter, or slope, ai is defined as theoretically having a scale ranging from minus infinity to plus infinity. High values of ai (e.g. 0.9 and above) indicate slopes that are very steep and low values (e.g. 0.3 and below) indicate slopes that increase more gradually as a function of ability (Hambleton and Cook, 1977). Thus, items which have a slope with a high value discriminate over a narrow range at the a level, while those items with a lower value discriminate over a broader a band. Thus, ai can be interpreted as providing information conceptually similar to that of traditional item discrimination indices such as the biserial correlation. Lord (1980) has shown a strong relationship between ai and the biserial correlation. For example, he has shown the following ratios between ai and the biserial: .20 to .2; .44 to .4; .75 to .6; .98 to .70; 1.33 to .80; and 2.06 to .90.
This issue of item discriminability has implications for test construction in settings in which criterion-referenced testing is appropriate (Hudson & Lynch, 1984; Brown, 1988; Bachman, 1989; Hudson, 1989). In order to place the discussion in context, assume that analysis of a test indicates the set of three item characteristic curves in Figure 1. Additionally, assume that if a< 0.0 then the examinee should be classed a being a Level A student while if a > than 0.0 then the examinee is a Level B student. The purpose of the test is to place the
101
102 HUDSON
student into one of the two ability levels.
1.00 .95 .90 .85 .80 .75 .70 .65 .60 .ss .so .45 .40
Figure 1 Hypothetical test item slopes
.35
.301-----
.25
.20
.15
.10 .OS s.---~
.oo ~========----------' -4.00 -3.00 -2.00 -1.00 .00 1.00 2.00 3.00 4.00
L----- Level A __ ......_ ___ Level B -----JI
Item #2 has a low slope across the entire test. Although its point of inflection is approximately 0.0 on the ability scale, it supplies little discrimination between Level A and Level B. That is, examinees with an ability level of -4.00 have a · 27% chance of getting the item correct while examinees with high ability level of +4.00 only have a probability of approximately 63% of getting the item correct. Thus, while the item does discriminate maximally at the 0.0 level, it discriminates very little. Item #3 has a very high slope within Level B, but supplies information over a very narrow range within Level B. Items with discriminations this high have very restricted utility in discriminating between Level A and Level B. However, Item #1 provides a great deal of information near the 0.0 point, the point which is supposed to divide Level A from Level B. Items #2 and #3 are not satisfactory while Item #1 is a good item for
IRT ITEM DISCRIMINATION
discrimination and placement. It is most discriminating near the point of interest and discriminates over the range of abilities of interest. The importance of the slope in item selection does not have to do with whether one item has a slope that is similar to the slope of other items. Rather, it is an issue of whether the items which are selected have slopes like that of Item #1 in Figure 1.
As noted above, the two models treat slope differently. By not accounting for the slope parameter, the one-parameter model assumes that differences in discriminability of items are of no concern in the estimation of item parameters. That is, it differs from the two-parameter model in that in the process of estimating the item characteristic curve (ICC) all items are assumed to have equal discriminating power and vary only as to item difficulty. In terms of item fit, it is argued for the one-parameter model that items with very different discriminabilities necessarily do not fit the model and should be eliminated. The one-parameter model provides a variety of indices to determine item fit. Henning (1984, 1988, & personal communication) considers the Rasch model fit statistics to be tests of unidimensionality. He is of the opinion that including items with divergent discriminabilities and guessing levels can contribute to multidimensionality. Two of these fit statistics are infit and outfit (Wright and Linacre, 1984). Outfit is the standardized weighted mean square residual of unlikely item responses while infit is a weighted derivation of the outfit statistic which is focused on the area where responses deliver the most information (Henning, 1988). For the infit statistic, the deviations used in calculations of the outfit are weighted by the amount of information provided by each response. In this way, infit is not as sensitive to individual outliers which are distant from the item as the outfit statistic. Wright and Linacre (1984) indicate that items outside a band of -2.0 to 2.0 should be examined as possibly misfitting. According to this view, the Rasch fit tests all attempt to improve on unidimensionality by eliminating as misfitting those items that have highly divergent discriminabilities as well as those which have aberrant response patterns. Only those within a range of homogeneity of discriminability are judged to fit the model while others are rejected. Indeed, items that show extreme departure from the assumption are removed from the test through the tests of fit.
There are two potential difficulties with interpretation of the Rasch model fit statistics. Both problems relate to the fact that a single number represents
103
104 HUDSON
violations of fit regardless of the cause of lack of fit. That is, although the Rasch model assumes that the discriminations of all items are identical, and consequently indicates which items have extreme departure from that assumption, the fit statistics also theoretically identify items which have very unlikely examinee response patterns. The first problem of interpretation of an unacceptable fit value, then, is determining whether the item is unacceptable because of an anomaly in the response pattern or because the item has divergent discriminability. The second difficulty is that the statistic will not indicate which items have slopes that are either extremely flat or steep and hence not productive for level determination. It is possible that if all items on a test had an equally extremely steep or flat slope the fit statistics would not indicate that this was the case because the items would not have highly divergent discriminabilities. They would have homogeneous low or high discriminabilities. Further, it is not clear how divergent an item discrimination must be in order to be considered highly divergent.
An additional issue in determination of which model to use for analysis concerns the contributions that inclusion of the slope can have in test design. When there is a large enough sample size (e.g., 200 or above) and a sufficient number of items (e.g., 60 or above) to allow use of the two-parameter model, the inclusion of ai is useful in determining how effective an item may be. That is, just as bi determines the point at which the item is providing the most information, ai indicates just how much discriminating power the item will have between examinees at differing 9 levels. Hambleton {1979) has criticized the one-parameter model for rejecting too many items as misfitting in eliminating items with extreme discriminations. The two-parameter model in effect allows items at extreme ends by including the slope and then relying on x2 goodness-of-fit statistics to identify misfitting items (Mislevy and Bock, 1986, 1990). However, it is clear that it is not satisfactory merely to include ai in calibrations and then rely on a x2 goodness-of-fit statistic to eliminate items with unexpected response patterns. Items at the extremes may simply not be desirable in certain contexts and should be eliminated. That is, there may be a rationale for eliminating items with extremely low or extremely high slopes, independent of any reference to the one-parameter model assumptions.
This last point is an important issue. That is, the items should be eliminated because they are not productive, not because of a hypothesis that they violate the assumption of unidimensionality, and they should not be
IRT ITEM DISCRIMINATION
retained merely because the slope can be taken into account in determining other item parameters. The items should be eliminated or retained because of their utility. It is not clear that items which have different discriminations are necessarily violating the assumption of unidimensionality. Items have different item difficulties but no one claims that this violates unidimensionality. Any psychometric model adopted must allow for the fact that language skills vary both in their difficulty and in over what levels of ability they remain difficult. That one item measures a language skill which is learned quickly and another item measures a language skill which is learned more slowly should not be taken to indicate that the test is not a unidimensional language test. Thus, items with extreme discriminations may be eliminated because they do not function well given the purpose of the instrument of which they are a part. They should not necessarily be eliminated because they have discriminations different from the average discriminations of the remaining items on the test. Analysis of item discrimination, whether from the one-parameter or the two-parameter perspective, requires that a less purist view be taken. In practice not all items will have exactly the same discriminability and these differences in discriminability should be accounted for when possible. The differences between items with a slope of 0.40 and those with slopes of 0.90 are important differences to take into account when determining item difficulty. If it is possible to employ the two-parameter model then these differences can be incorporated. They are additional sources of information. However, it is also desirable to take the slope into account when determining which items to retain.
In short, there appear to be at least five sources of information which may help ensure that test items are maximally useful within the level for which they have been designed as well as congruent with the trait they are supposed to be measuring. These are the infit, outfit, slope, biserial and a x2 goodness-of-fit statistic for expected item responses. The remainder of this study will examine the relationships among these indices. For this discussion, suspect items will be those which have a slope below .40 or above .90. These values have been selected because items with a slope of less than .40 have very low discrimination at all and items with a slope above .90 discriminate over a very restricted range of interest. Additionally, items with an infit or outfit greater than 2.0 or less than -2.0 are labeled as suspect in order to examine the values
105
106 HUDSON
suggested by Wright and Linacre (1984).
PROCEDURE Instruments
The test instruments were two pilot versions of the General Tests of English Language Proficiency (GTELP) developed by the National Education Corporation. Form A had 94 items and Form B had 95 items. The test consists of three different subtests, reading, listening and grammar subtests, which are used to place examinees into one of three functional ability levels.
Subjects The subjects are 523 subjects in Japan, 105 subjects in Saudi Arabia and
356 examinees in Mexico. They are from a variety of academic and occupational backgrounds. Each examinee took one form of the test. Of these examinees, only the 830 examinees with complete data sets who also scored 25% or higher on the total test were retained for analysis.
Descriptive information Basic descriptive information of the tests, subtests and subtests by level of
items for the two tests is presented in Table 2 and Table 3 below. The two forms of the test have very similar mean percent scores and percent score standard deviations on subtests, levels and total score. Additionally, it should be noted that the percent scores are low. The mean percent score on Form A is 52.39 and on Form B is 55.66. The tests as a whole appear to be difficult for the subjects in this test sample. Each form and each subtest have acceptable 9 reliabili ties.
Table 1 Descriptive information Form A (n=430)
Mean Mean% Score so correct %50 Items
Total 49.45 17.29 52.39 18.39 94 .951 Grammar 18.83 7.14 62.77 23.79 30 .907 Listening 15.39 6.54 43.98 18.69 35 .853 Reading 15.22 5.62 52.50 19.38 29 .854
IRT ITEM DISCRIMINATION
Table2 Descriptive information Form B (n=400)
Mean Mean% Score SD correct %SD Items a
Total 52.88 17.46 55.66 18.38 95 .946 Grammar 19.45 6.95 64.83 23.17 30 .886
Listening 17.03 6.37 48.64 18.19 35 .837 Reading 16.40 5.98 54.67 19.92 30 .848
Item Analysis Item analysis involves IRT examination of the two-parameter slope, x2
statistic and biserials from PC-Bilog (Mislevy and Bock, 1986), as well as an examination of the one-parameter infit and outfit statistics computed with the Microscale program (Wright & Linacre, 1984). Item difficulties are computed using the two-parameter model, and the two forms were linked on 10 items from a common multiple-choice doze test given to all examinees.
The item information for each item is presented in Table 3. This information includes the slope, infit, outfit, biserial correlation, x2 goodnessof-fit statistic, and the two-parameter item difficulty. Items are rank ordered from the lowest slope to the highest slope. The item identification at the left contains two characters indicating which subtest the item tested (RD=reading, LS=listening, GR=grammar), followed by a character designating which form of the test contained the item (A or B), and then the sequential item number on its subtest. Suspect items (a slope below .40 or above .90, an infit or outfit less than -2.0 or greater than +2.0., and a x2 below .05) are indicated with an asterisk in the appropriate column of values.
107
108 HUDSON
Table 3 Item information sorted by slope
ITEM SLOPE INFIT OUTFIT BISER ·y__2 DIFFICULTY
RDA10 "0.083 .. 13.451 .. 11.935 0.260 "0.0000 0.070
l.SA32 "0.151 !t7.528 !t7.851 0.007 0.2934 2452
GRB27 "0.168 •6.197 •6.581 0.044 0.2471 2403
l.SB35 "0.176 .. 6.237 .. 6.337 0.051 0.1100 2007
l.SA35 "0.187 •6.528 .. 6.186 0.088 0.3652 1.929
l.SB29 "0.187 •6.733 •6.544 0.084 0.5695 1.254
RDA28 "0.199 •5.786 •5.093 0.115 0.5223 2645
RDB20 "0.208 .. 6.193 .. 4.377 0.106 0.1082 -1.505
RDB26 "0.211 .. 4.628 •5.116 0.119 0.4388 2532
ISB06 "0.215 •6.007 .. 6.039 0.162 0.5785 0.936
GRA22 "0.225 .. 6.602 "'6.777 0.185 0.0510 0.088
l.SA31 "0.235 "'4.296 •5.948 0.157 "0.0014 2218
RDB27 "0.239 ltJ.830 "'4.289 0.154 0.3149 2535
RDB28 "0.246 "'4.284 •4.237 0.186 0.1179 1.945
l.SB31 "0.267 "'2751 •4.263 0.174 0.1900 2693
RDB29 "0.269 •5.194 •4.751 0.193 "0.0066 0.287
RDB23 "0.290 1.822 ~.546 0.196 0.2051 -2989
RDA16 "0.293 •4.085 ltJ.840 0.235 0.9317 -1.371
LSB07 "0.312 ltJ.588 ltJ.668 0.293 0.8312 0.957 RDA21 "0.313 0.940 1.496 0.196 "0.0104 -4.003
LSA05 "0.316 .. 4.738 ~.556 0.267 0.2842 -0.514
l.SB11 "0.317 ltJ.857 ltJ.583 0.286 0.2078 -0.270
LSB30 "0.317 ltJ.524 ltJ.866 0.284 0.7884 1.104
LSA15 "0.319 1.955 ltJ.003 0.253 0.3147 3.200
LSA01 "0.320 •4.775 ltJ.401 0.271 0.1013 -0.022
LSB32 "0.324 "'2135 .. 4.647 0.256 "0.0019 1.865
l.SB28 "0.326 ltJ.786 •4.078 0.274 0.5511 0.397
RDA24 "0.345 "'2233 ltJ.387 0.277 "0.0076 2254
RDB30 "0.352 1.509 ~.513 0.296 "0.0467 2698
LSA16 "0.365 "'2479 ltJ.056 0.321 0.8258 t.m
IRT ITEM DISCRIMINATION 109
Table 3 (continued) Item information sorted by slope
ITEM SLOPE INFlT OUTFIT BISER x2 DIFFICULTY
RDB14 ltQ.369 *2.866 0.991 0.322 *0.0152 -1.162
ISA33 "0.374 1.690 ~.715 0.316 *0.0228 2.100
RDA19 "0.374 *2.604 1.042 0.312 *0.0160 -1.514
GRA03 ltQ.376 lt3326 lt3.092 0.348 0.6860 0.664
ISB15 "0.380 "2574 "2866 0.335 0.3415 0.808
RDA25 "0.381 lt3.162 lt3.355 0.349 0.9705 0.743
ISA34 ltQ.385 "2802 lt3.204 0.359 0.3473 0.873
ISA21 ltQ.387 ~.549 lt3.098 0.346 0.7088 1.097
ISA29 "0.396 0.926 lt3.243 0.334 *0.0424 2.196
ISB26 0.407 ~.403 *2.097 0.364 0.1178 -0.561
ISB09 0.408 ~.211 1.970 0.369 0.2451 -0.584
ISB12 0.417 *2.452 *2.531 0.360 0.4755 -0.239
ISA25 0.426 0.918 ~.637 0.372 *0.0233 2.014
GRA13 0.428 "2730 "2141 0.361 0.1209 -0.221
ISA27 0.428 0.504 ~.671 0.381 *0.0041 2.002
ISA11 0.435 *2.370 1.414 0.391 0.2137 -0.168
RDA11 0.443 "2355 0.993 0.395 0.2963 -0.462
ISA14 0.446 1.201 0.861 0.357 0.4888 -1.478
ISA23 0.447 1.815 *2.179 0.403 0.8206 0.645
RDA27 0.447 1.776 1.864 0.426 0.5951 0.750
RDA20 0.448 0.842 -0.183 0.322 lt().0185 -2.576
ISA08 0.459 "2043 "2457 0.416 0.8050 0.049
RDB09 0.460 1.727 1.652 0.415 0.4262 0.314
GRA15 0.461 1.823 1.612 0.417 0.7093 0.215
RDB13 0.463 1.393 1.261 0.409 0.3147 -0.771
RDB24 0.463 1.032 -0.021 0.381 0.2636 -1.653
GRBOt 0.464 0.411 0.236 0.326 0.4096 -2.646
ISB16 0.469 0.561 1.593 0.413 0.1227 1.337
GRB04 0.470 1.699 0.586 0.407 0.4740 -0.282
ISB18 0.471 1.195 1.678 0.405 0.9678 0.868
110 HUDSON
Table 3 (continued) Item information sorted by slope
ITEM SLOPE INFIT OUTFIT BISER x2 DIFFICULTY
GRA21 0.488 1.680 1.691 0.442 0.6198 0.002
ISA19 0.491 1.737 0.835 0.441 0.8749 -0266
RDB16 0.491 1.800 0.288 0.409 "0.0078 -0.842
ISB19 0.493 0.929 1.511 0.431 0.7525 0.967
RDB08 0.494 0576 '*2.587 0.438 0.4440 -0.918
lSA26 0.4% 1.269 1.209 0.460 0.9504 0.391
ISA09 0.518 0.930 1.336 0.470 0.4858 0.002
RDB10 0.526 0.692 -0.261 0.455 0.6379 -0.326
ISA06 0.529 0.927 0.365 0.462 o.9m -0.211
ISB10 0.530 0.564 0.468 0.481 0.9292 0.333
ISB34 0.530 -0.134 1.801 0.445 0.0573 1.517 GRAtO 0.531 0.788 0.962 0.457 0.3767 -0.477
ISA28 0.531 0.794 0.400 0.475 0.4554 -0.090
RDB25 0.532 0.578 0.422 0.488 0.8902 0.575
GRB12 0.533 0.688 0.426 0.486 0.4072 0.380 lSB21 0.551 0.821 -0.040 0.448 0.5146 -0.627
GRB14 0.559 0.257 0.626 0.490 0.8024 0.097
LSB20 0.559 -0.277 0.646 0.495 0.0608 0.927
ISA24 0.560 0.387 -0.114 0.463 0.7128 -0.779
RDA18 0.563 0.608 -0.389 0.473 0.8732 -0.421
RDB21 0.565 -0.022 -0.263 0.512 0.8299 -0.004
RDA29 0.566 0.344 0.259 0.536 0.6042 0.196 RDB22 0.584 0.149 -0.186 0.459 0.1978 -1.596
ISB25 0.589 -0.216 0.085 0.510 0.3445 0.306
RDB18 0.590 -0.349 -0.148 0.498 "0.0213 0.165
ISB13 0.597 -0.455 -0.215 0.521 0.6401 0.268
GRA07 0.599 -0.324 0.258 0.449 0.1575 -1.373
lSA02 0.601 0.199 -0.680 0.440 "0.0171 -1.535
GRB03 0.613 -0.108 -0.923 0.505 0.3465 -0.363 ISB08 0.613 -0.397 -0.750 0.507 0.3624 -0.817
IRT ITEM DISCRIMINATION 111
Table 3 (continued) Item information sorted by slope
ITEM SLOPE INFIT OUTFIT BISER x2 DIFFICULTY
LSA20 0.618 -0.367 -0.553 0.528 0.7265 -0.275
GRA14 0.619 -0.476 0.422 0.542 0.6469 1.158
LSA22 0.621 -0.182 -0.249 0.540 0.7788 1.617
LSB27 0.623 -0.542 -0.204 0.529 0.6700 0.300
RDA03 0.634 -0.546 -0.468 0.537 0.2381 -0326
GRB18 0.634 -0.317 -0.118 0.530 0.3721 -0.368
lSA10 0.638 -0.266 -0.292 0.554 0.6051 0.528
GRB26 0.641 -0.386 1.307 0.532 0.4257 -0.328
GRB09 0.645 -1.189 0.645 0.554 0.1722 -0.312
lSA03 0.647 -0.855 -0.176 0.519 0.7212 -0.843
lSB14 0.651 -0.176 1.417 0.408 0.1485 -2.604
LSB22 0.658 -1.057 -1.034 0.566 0.8676 0.534
ROB19 0.674 -1.564 -0.996 0.582 0.1735 0.537
LSA30 0.676 -1.361 -0.856 0.591 0.5386 0.275
GRB25 0.676 -0.811 -1.100 0.548 0.8221 -0.424
RDA04 0.678 -0.234 -1.040 0.456 0.6234 -1.608
RDB17 0.682 -0.823 -1.707 0.546 0.1002 -0.667
lSB33 0.685 -1.298 -1.357 0.582 0.4397 0.346
GRA04 0.686 -1.395 0.750 0.521 -o.0114 -1.019
GRB13 0.690 -0.596 -0.927 0.508 0.1370 -1.155
LSB17 0.697 -1.203 -1.375 0.560 0.7305 -0.502
RDB12 0.697 -0.801 -1.086 0.538 0.4107 -1.028
RDA06 0.702 -0.808 -1.608 0.544 0.8819 -0.457
GRB05 0.705 -0.526 -0.655 0.527 0.5541 -0.947
LSB23 0.706 -1.187 -1.126 0.573 0.8354 -0.616
lSA13 0.717 -1.658 -1.374 0.610 0.5635 0.306
RDB02 0.718 -1.775 -0.983 0.585 0.6380 -0.234
RDA26 0.731 -1.234 -1.227 0.642 0.9946 1.083
RDB07 0.731 -0.734 -0.766 0.525 0.3757 -1.532
RDA15 0.734 -1.905 -0.555 0.585 0.5619 -0.437
112 HUDSON
Table 3 (continued) Item information sorted by slope
ITEM SLOPE INFIT OUTFIT BISER x2 DIFFICULTY
GRB21 0.736 -1.639 -0.467 0.576 0.5872 -0.856
GRB23 0.740 -1323 -0.562 0.575 0.4594 -0.580
RDB15 0.741 -1.524 -1.760 0.587 0.2035 -0.544
GRA24 0.743 -1.228 -1.230 0.555 0.9092 -0.818
GRB06 0.744 -1.170 -1.840 0.568 0.5399 -0.835
RDA17 0.747 -1.558 -1.372 0.563 0.7787 -0.477
RDA02 0.751 -1.681 -1.130 0.568 0.4976 -0.693
RDA30 0.755 -1.750 -1.502 0.658 0.8308 0.399
RDAOB 0.760 -0.497 -1.478 0.509 0.6145 -1.582
RDA01 0.765 -1.265 -1.792 0.560 0.8135 -1.114
RDA14 0.765 -1.985 -1.879 0.605 0.5166 -0387
RDA09 0.771 "'-2.345 -1.445 0.604 "'0.0146 -0.422
RDA12 0.773 -1.207 -1.724 0.561 0.5871 -1.104
GRB02 0.781 -1.7Tl -1.089 0.600 0.5184 -0.611
GRB28 0.789 -2.266 -0.794 0.623 0.2882 -0.325
GRA02 0.806 -1.935 -1.925 0.580 0.5870 -0.427
GRB19 0.806 -1.319 -1.064 0.586 0.8341 -0.966
GRA27 0.808 "'-2.341 "'-2522 0.623 0.7722 -0.066
l.SA18 0.809 "'-2.244 "'-2446 0.619 "'0.0096 -0.163
RDA13 0.811 "'-2.798 "'-2010 0.663 0.1615 0.762
RDA05 0.816 "'-2.041 -1.638 0.593 0.5269 -0.788
GRB30 0.816 "'-2.058 -1.703 0.624 0.0829 -0.485
GRB08 0.824 -1.303 -1.106 0.578 0.8942 -1.127
GRB17 0.833 -1.422 -1.309 0.603 0.7806 -1.082
GRA30 0.835 "'-3.022 "'-2.550 0.655 0.0949 -0.001
GRB07 0.836 "'-2.516 "'-2761 0.646 0.8490 0.047
GRA06 0.839 "'-3.175 "'-2.199 0.657 0.3578 0.027
l.SA17 0.849 "'-2.953 "'-2.981 0.672 0.2343 0.121
l.SA12 0.852 "'-3.119 "'-2.368 0.683 "'0.0288 0.513
l.SB01 0.855 "'-3.043 "'-3.106 0.652 "'0.0111 -0.172
IRT ITEM DISCRIMINATION 113
Table 3 (continued) Item information sorted by slope
ITEM SWPE INFIT OUTFIT BISER x2 DIFFICULTY
GRA16 0.877 ·-2.936 -1.523 0.624 0.3699 -0.635
GRB10 0.889 -1.058 -1.797 0.574 0.2876 -1.525
RDB06 0.892 ·-3.166 -1.505 0.659 0.8276 -0.462
GRB20 otQ.908 ·-2.424 ·-2.617 0.653 0.9787 -0.627
GRA17 ot0.911 ·-3.177 ·-3.148 0.653 0.9811 -0.254 LSBOS otQ.912 ·-2.371 ·-2.685 0.639 0.9185 -0.612
GRA28 otQ.917 ·-3.551 "'-2.060 0.668 0.2184 -0.391
RDA23 otQ.920 ·-3.564 "'-3.239 0.691 0.1659 -0.133
GRAOB otQ.944 "'-2.958 "'-2.070 0.644 0.1906 -0.651
LSB24 otQ.955 -1.294 -1.785 0.609 0.9117 -1.338
RDA22 otQ.973 "'-4.267 ·-3.542 0.724 0.3044 0.001
GRB16 "'1.003 -1.211 -1.925 0.607 0.0606 -1.390
GRA23 "'1.007 "'-3.159 -1.795 0.639 0.3263 -0.781
RDB05 "'1.008 .. -4.449 •-4.066 0.705 0.7869 -0.139
LSB03 "'1.009 "'-2.165 -1.588 0.659 0.7123 -1.070
LSA04 •1.011 -1.710 -1.245 0.614 0.1021 -1.148
ISA07 •t.Q25 .. -4.244 "'-3.856 0.707 0.1518 -0.402
ROBOt "'1.029 "'-4.788 "'-4.044 0.722 0.2577 -0.236
RDB11 "'1.048 •-4.101 "'-3.721 0.714 0.4153 -0.544
GRA09 "'1.050 •-4.459 ·-2.666 0.693 0.1634 -0.423
LSB02 "'1.071 ·-3.815 ·-3.343 0.718 0.2017 -0.567
GRA19 •t.o76 -1.881 -1.934 0.619 0.3809 -1.141
GRB15 "'1.091 "'-4.286 "'-3.847 0.726 0.8575 -0.364
GRA20 "'1.116 "'-4.184 ·-3.461 0.713 0.9712 -0.546
LSB04 "'1.127 -0.673 "'-2.026 0.598 0.0541 -1.687
GRB29 "'1.173 ·-2.596 ·-3.018 0.738 0.8882 -0.932
GRA29 "'1.177 -1.678 ·-2.449 0.636 0.1938 -1.148
GRA01 "'1.203 ·-5.356 "'-3.192 0.730 0.1774 -0.511
GRA12 "'1.218 -1.686 -1.759 0.623 0.2219 -1.290
GRB24 "'1.223 "'-2.218 "'-2.546 0.703 0.7527 -1.118
114 HUDSON
Table 3 (continued) Item information sorted by slope
ITEM SLOPE INFIT OUTFIT BISER x2 DIFFICULTY
RDB04 lt-}.224 ""-3.765 ""-3.202 0.731 0.8026 -0.845
RDB03 It'} .424 ""-5.745 ""-4.299 0.804 0.2764 -0.538
GRB22 lt-}.493 -1.536 ""-2.612 0.727 0.4524 -1.332
GRAll ""1.509 -0.932 ""-2.153 0.617 0.7110 -1.479
GRA18 ""1.545 ""-5.017 ""-2.951 0.776 0.4014 -0.781
GRA26 ""1 .582 ·-5.087 ""-4.301 0.773 0.9025 -0.678
GRA05 ""1.642 ""-2.084 ""-2.477 0.701 0.8246 -1.179
GRA25 .. 1.864 ""-5.119 ""-4.463 0.801 0.2318 -0.763 GRBll 112.126 ""-4.067 ""-4.102 0.848 0.5185 -0.970
In examining the patterns in Table 3, several points seem clear. First, there is a close relationship between slope, infit, outfit and the biserial. In general, low slope is associated with high infit and outfit values and low biserial correlations. Additionally, high slope is associated with low infit and outfit values and high biserial correlations. Correlations among the different indices and difficulty are presented in Table 4.
Table4 Correlations of indices
INFIT OUTFIT SLOPE BISER DIFF
INFIT
OUTFIT 0.954
SIDPE -0.831 -0.841
BISER -0.960 -0.958 0.872 DIFF 0.394 0.519 0.437 0.405
IRT ITEM DISCRIMINATION
Since infit is a function of outfit the strong correlation (0.954) is to be expected. That is, they are not independent values. However, the strong overall correlation between the slope with infit and outfit (-0.831 and -0.841) is important. The narrower the slope, the lower the fit values. Thus, at the extreme ends of the slope continuum the fit indices demonstrate extreme values. This is plotted in Figure 2 and Figure 3 with the regression line of predicted values indicated.
Figure2 A scatterplot of the two-parameter slope to the one-parameter infit estimates. "."on diagonal represents predicted values of regression line. A=l observation, B = 2 observations, etc.
SLOPE 2.2 2.1 A 2.0 1.9 A 1.8 1.7 1.6 1.5 1.4 1.3
A A
A
A AA
1.2 A A. AAB 1.1 BB .. A A 1.0 ABB A.ABA 0. 9 BFB .. A
0.8 DDFFAA 0.7 BIJCB 0.6 BLFDA 0.5 AGGFC 0.4 0.3 0.2 0.1 0.0
ADBCICA A D.ABCAB
.. AC EC A
-6 -4 -2 0 2 4 6 INFIT
A
8 10 12 14
115
116 HUDSON
Figure 3 A scatterplot of the two-parameter slope to the one-parameter outfit estimates. "."on diagonal represents predicted values of regression line. A=1 observation, B = 2 observations, etc.
SLOPE 2.2 2.1 A 2.0 1.9 A 1.8 1.7 1.6 A A 1 . 5 AAA 1.4 .A 1.3 1.2 .CBA 1.1 AB.AB 1.0 CB .CAA 0.9 CCCB. 0.8 BCEGE. 0 . 7 BFLD .. AA 0.6 AGJF.A 0.5 ABGCHAB 0.4 A DAEDFB 0.3 A BADCBA 0.2 .. ABB CCA A 0.1 A 0.0
-6 -4 -2 0 2 4 6 8 10 12 OUTFIT
Figures 2 and 3 show that items with values above a slope of 0.9 or below 0.3 are generally also at extreme ends of the fit values. However, Table 4 indicates that the strongest correlation each of the three indices has is with the biserial. That is, if the item does not show a strong relationship to overall test outcome then both the dispersion around the slope will be great and residuals will be affected. This is shown in Figures 4, 5 and 6 below.
IRT ITEM DISCRIMINATION
Figure 4 A scatterplot of biserial to the two~ parameter slope. "." on diagonal represents predicted values of regression line. A=l observation, B = 2 observations, etc.
BISER 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0. 60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00
-0.05 -0.10 -0.15 -0.20
AA A AAB AA
AADCAA A ABFEB A
BFJCABAAA A AKJCA
CFCCA AKFA ID.A
EH. CFA DB
BC .CA
•• E • B
A
-0.25 A
0.0 0.5 1.0 SLOPE
1.5
A
2.0 2.5
117
118 HUDSON
Figure 5 A scatterplot of biserial to the one-parameter infit estimates. "." on diagonal represents predicted values of regression line. A=l
observation, B = 2 observations, etc.
BISER 1.00 0.95 0.90 0.85 A 0.80 AB. 0.75 AAAA. A A 0.70 ACDAA B 0.65 AICABA 0.60 ACILBA 0.55 DLHAA 0.50 AAGCD 0.45 ADDFCA 0.40 ABBCDB 0.35 AC AEBA 0 . 3 0 ABABABA
B .BAA 0.25 0.20 0.15 0.10 0.05 0.00
A A AA A
A AA. A A .BB
-0.05 -0.10 -0.15 -0.20 -0.25
-6 -4 -2 0 2 4 INFIT
B ••
6
A
A
8 10 12 14
IRT ITEM DISCRIMINATION 119
Figure 6 A scatterplot of biserial to the one-parameter outfit estimates. "."on diagonal represents predicted values of regression line. A=l
observation, B = 2 observations, etc. BISER 1. 00 0.95 0.90 0. 85 . A 0.80 B. A
0.75 AA.CA 0.70 CDAD 0 . 65 DDECA 0.60 BIGHB
CEFECC A ACBDEA
ACCCCCBA ABA GAB A A CBEA A B AADA
AAABA A A A.A A
B. B AB.AA
B
0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0 . 00 .A
-0 . 05 -0.10 -0.15 -0.20 -0.25 .A
-6 -4 -2 0 2 4 6 8 10 12 OUTFIT
The plots in Figures 4, 5 and 6 show two patterns in the relationships of biserial to the fit statistic values. First, items with high biserials increasingly show extreme negative fit values and those with low biserials show increasingly high fit values. Secondly, this relationship is less strong between the slope and the biserial than between the two fit statistics and the biserial.
Finally, item difficulty has the weakest correlational relationship to all of
120 HUDSON
the indices in Table 4. An examination of Figures 7, 8, 9 and 10 indicate that there truly is a weak relationship and that it is not merely an issue of curvilinearity masking a relationship.
Figure7 A scatterplot of the two-parameter slope to the two-parameter difficulty estimate. "."on diagonal represents predicted values of regression line. A=1 observation, B = 2 observations, etc.
SLOPE 2.2 2.1 2.0 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 A
0.2 0.1 0.0
-4
A
A
• A A B A
A
CAAA A A .BB
BAAAAABA
A DBC A A CBABEADAA A
A B ACCEDA ABB A BA CAFABCBA AA A
A A AB BCDADAAA AA
A AAA BAC •• BDA ACA A A A AAAAA. AA AAA A
-3 -2
A A • AA CABC A
-1 0 DIFF
1 2 3 4
IRT ITEM DISCRIMINATION
Figure 8 A scatterplot of the biserial correlation to the two-parameter difficulty estimate. "."on diagonal represents predicted values of regression line. A=1 observation/ B = 2 observations/ etc.
BISER 1.00 0. 95 0.90 0.85 A 0.80 .BA 0.75 A AABA 0.70 B .CBCA A 0.65 AAACBBCAA AA 0.60 ACDBBCFBAABA 0.55 A BBBEGA AAB AA 0.50 B AAB A.BBCA A 0.45 CA AAACBC A AA A 0.40 A A B ABAAAAA A A 0.35 A A B B . ACA AA 0.30 A AA A .. AA ABA 0.25 A A A A. A A 0.20 A A AA . A 0.15 .. A A B 0.10 A .. A A B 0.05 AA 0.00 A
-0.05 -0.10 -0.15 -0.20 -0.25 A
-4 -3 -2 -1 0 1 2 3 4 DIFF
121
122 HUDSON
Figure 9 A scatterplot of the one-parameter infit estimates to the two-parameter difficulty estimate. "." on diagonal represents predicted values of regression line. A=l observation, B = 2 observations, etc.
INFIT 14 13 A 12 11 10
9 8 A
7 A •• A A
6 A .A AAA 5 A AA • A 4 A A A. AA AAA
3 AA A •• ACA A
2 A ABADBAAAA BAA A A 1 A A AA AAACAB CAAA A BA
0 B DA B CABCAA AA B
-1 CCDECDD. ABA A -2 AFABEGBA BA -3 AABABCA AA -4 AACDBA
-5 .BB A
-6 .A
-4 -3 -2 -1 0 1 2 3 4
DIFF
IRT ITEM DISCRIMINATION
Figure 10 A scatterplot of the one-parameter outfit estimates to the two-parameter difficulty estimate. "." on diagonal represents predicted values of regression line. A=l observation, B = 2 observations, etc.
OUTFIT 14 13 12 11 10
9 8 7 6 5 4 3 A 2 1 A 0
-1 -2 -3 -4 -5
-4 -3
A
A A A. A
A. BA A A B
AA A A .. AA AAB A A AA .ACA ABBA
B ABAAABA AA A AAAAA DCB.A A B BA DADABCDB A A
-2
D CDBEFB ABB A BCFACDFAA AAA
AAABEACCA .AACBBA
-1 0
DIFF 1 2
A
3 4
There is a general indication that items at extreme difficulty or ease likewise have low slope and biserial values. However, this is not a strong pattern given that there are items at these extremes which fall across a span of slope and biserial values. In terms of the relationship between difficulty and the fit statistics, Figures 9 and 10 appear to confirm a weak association.
In terms of the interactions between the IRT indices, it should be noted that a close relationship exists between the two-parameter slope information and the one-parameter infit and outfit residual analyses. As noted above, the correlations between the two, -0.831 and -0.841, are strong. Table 3 shows that no items with a slope between 0.456 and 0.765 were indicated as misfitting
123
124 HUDSON
according to infit and only one item was identified as misfitting according to outfit. Infit and outfit systematically reject items below a slope of approximately 0.450 and above 0.800. However, Table 3 also indicates that five items between .460 and .800 were indicated as misfitting by the X2 goodnessof-fit analysis. These items were accepted by the infit and outfit indices. This would indicate that the two fit tests are less sensitive to deviations from expected response patterns when the slope is not extremely high or extremely low.
DISCUSSION
THIS STUDY was designed to address the issues which surround item discrimination in CRTs. Specifically, it aimed at examining the relative contributions of the one-parameter and the two-parameter models in constructing tests that both contain items which discriminate between ability levels and which have response patterns which are consistent with the items fitting the test. Identification of the relationships which exist among the indices can help in test construction in several ways.
In the present study the finding of high correlations between the biserial and each of the discrimination indices under consideration suggests general relationships which are of importance for test development. It indicates that perhaps items with a high biserial should not be selected as in conventional test development. Choosing items with the highest biserial may provide a test which has items with extremely high slopes, hence lessening the productivity of the items. That is, when items have very high biserials they may discriminate very well, but not be discriminating what we want to have discriminated. They may be so discriminating that the actual trait we wish to measure is not addressed. Thus, the conventional wisdom in traditional test analysis that items with high biserials are precisely the types of items which are desireable should be reconsidered.
Additionally, the results indicate that there is a strong relationship between the residual fit statistics and the slope. This is encouraging in that it suggests that the one-parameter model fit statistics allow item selection such that the assumption of a common item discrimination parameter may essentially be met. That is, it allows the identification and possible exclusion of items with extreme discriminations. This is potentially very useful in those
IRT ITEM DISCRIMINATION
testing situations which would preclude the use of the two-parameter model as a result of small sample sizes or too few items. In general, most settings in which CRTs are needed are precisely those situations which have few examinees and a small number of items associated with each criteria of interest.
There are, however, at least five concerns regarding reliance on the oneparameter model and the associated fit statistics as well as with the traditional biserial correlation. First, in constructing the test information curve, the twoparameter ai is a useful source of additional information. It is clear that a test constructed eliminating items viewed as unacceptable according to the infit and outfit statistics from above would still contain items with some differences in discrimination. That is, the test would still contain items with ai as low as .290 and .446 and would contain item as with ai as high as 1.076 and 1.218. These differences should be accounted for in test examinee ability estimates and item difficulty estimates. Thus, the two-parameter model is to be preferred as generally supplying more information. Second, it remains to be seen whether the fit statistics demonstrate the same weakness of stability as the slope estimate does with small sample sizes. This will need to be examined in further research. Third, infit and outfit appear to be too heavily a function of biserial and/or slope and are consequently not as sensitive to other types of problems in response patterns as would be desirable. That is, the x2 analysis indicated six items within the range of + 2.00 infit and outfit as having unacceptable response patterns. It may be that residual analysis alone is not sufficient for the one-parameter model, just as the x2 analysis is not sufficient for the two-parameter model. For example, the x2 statistic does not indicate problems with items which have extremely high discrimination values. However, it is clear that we should not confuse item discrimination with fit. The infit and outfit statistics appear to frequently do this while the x2 statistic is moot on this point. Fourth, it remains to be seen how the infit and outfit values are affected by the extremely narrow distributions associated with many criterion-referenced tests. That is, with the current test high values of infit and outfit generally reflect low slope and low biserial correlations. However, with truncated distributions this same relationship might not hold. Fifth, while high values of infit and outfit reflect poorly discriminating values, and consequently lead to item decisions fairly consistent with those based on the two-parameter slope and traditional biserial correlation, extremely low
125
126 HUDSON
values of infit and outfit are more problematic to interpret. For example, in settings which indeed require extremely high values of the slope in order to discriminate at a very narrow threshold, the infit and outfit values below -2.00 should not be used to discard items.
Finally, it must be stressed that none of the statistics alone addresses content issues of the items. It is important to link any acceptance or rejection of items with a third source of information, content analysis. Items may be indicated as undesirable on the basis of fit or discrimination, yet have content that is representative of the trait which is to be measured. In focusing on the relationships among item statistics used in determining the acceptability or unacceptability of test items, this study does not mean to indicate that these indices alone are sufficient for thorough test development.
IRT ITEM DISCRIMINATION
REFERENCES
Bachman, L.F. (1989). The development and use of criterion-referenced tests of language ability in language program evaluation. In Johnson, R.K. (ed) The Second Language Curriculum. Cambridge: Cambridge University Press. (pp. 242-258).
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord and Novick, M.R. (eds) Statistical theories of mental test scores.Reading, Mass.:Addison-Wesley.
Brown, J.D. (1988). Improving ESL placement tests using two perspectives. TESOL Quarterly. 23:1. 65-83.
Dinero, T.E. and Haertel, E. (1977). Applicability of the Rasch model with varying item discriminations. Applied Psychological Measurement. 1, 581-592.
Hambleton, R.K. (1979). Latent trait models and their applications. In R. Traub (ed.) Methodological developments: New directions for testing and measurement (no.4). San Francisco: Jossey-Bass.
Hambleton, RK. and Cook, L.L. (1977). Latent trait models and their use in the analysis of educational test data. Journal of Educational Measurement. 14:75-96.
Hambleton, RK. and Swaminathan, H. (1985). Item Response Theory: Principles and Applications. Boston, MA: Kluwer-Nijhoff Publishing.
Henning, G. (1984). Advantages of latent trait measurement in language testing. Language Testing,1:123-134.
Henning, G. (1987). A Guide to Language Testing: Development, Evaluation, Research. Cambridge, MA: Newbury House Publishers.
Henning, G. (1988}. The influence of test and sample dimensionality on latent trait person ability and item difficulty calibrations. Language Testing, 5: 83-99.
Hudson, T.D., (1989). Mastery decisions in program evaluation. In Johnson, R.K. (ed) The Second Language Curriculum. Cambridge: Cambridge University Press. (pp. 259-269).
Hudson, T.D. & Lynch, B.K. (1984). A criterion-referenced approach to ESL achievement testing. Language Testing, 1, 171-201.
127
128 HUDSON
Hulin, C.L., Lissak, R.I., and Drasgow, F. (1982). Recovery of two- and threeparameter logistic item characteristic curves: A Monte Carlo study. Applied Psychological Measurement. 6:249-260.
Lord, F.M. (1980). Applications of Item Response Theory to Practical Testing Problems. Hillsdale, New Jersey: Lawrence Erlbaum Associates, Publishers.
Lord, F.M. (1983). Small N justifies Rasch Model. In Weiss, D.J. (ed) New Horizons in Testing: Latent Trait Test Theory and Computerized Adaptive Testing. New York: Academic Press. (pp. 51-61).
Mislevy, R.J., and Bock, R.D. (1990). PC-Bilog 3: Item Analysis and Test Scoring with Binary Logistic Models. Mooresville, IN. Scientific Software.
Mislevy, R.J., and Bock, R.D. (1986). PC-Bilog: Item Analysis and Test Scoring with Binary Logistic Models. Mooresville, IN. Scientific Software.
Popham, W.J. (1978). Criterion-referenced measurement. Englewood Cliffs, NJ: Prentice-Hall.
Wright, B.D. & and Stone, M.H., (1979). Best Test Design: Rasch Measurement. Chicago, IL.: Mesa Press.
Wright, B.D. & and Linacre, J.M. (1984). Microscale Manual. Black Rock,Connecticut. Mediax Interactive Technologies.