IRT ITEM DISCRIMINATION INTERPRETATION IN CRITERION … · 2016. 8. 11. · criterion-referenced test construction and interpretation. The study then specifically examines how the

IRT ITEM DISCRIMINATION INTERPRETATION IN CRITERION-REFERENCED LANGUAGE TESTING

THOMHUDSON

University of Hawai'i

This study investigates relationships among the IRT one-parameter fit statistics, the two-parameter slope parameter and traditional biserial correlations in terms of the role these indices play in criterion-referenced language test construction. It discusses the assumptions of the two models and how these assumptions can affect criterion-referenced test construction and interpretation. The study then specifically examines how the indices operate in terms of item discrimination when items are designed to determine whether or not and examinee is at a specific level of ability. Examinees in Mexico, Saudi Arabia and Japan were administered one of two forms of a functional test containing (Form A n=430, k=94: Form B n=400, k=95). The data were analyzed using the two IRT models and the results were compared. The results indicate strong relationships between biserial correlation, two-parameter slope, and one-parameter infit and outfit. The implications of these strong relationships for interpreting the indices are discussed.

INTRODUCTION

CRITERION-REFERENCED language tests (CRTs) are generally designed to

distinguish between two or more defined levels of language ability rather than

to determine the status of individuals relative to other individuals who have

taken a test. For example, CRTs are used to determine mastery or non-mastery,

placement into categories such as in need of remediation, readiness or exemption from instruction, or ability levels such as low, middle or high. As a conse

quence of this goal to discriminate between those examinees who know a

particular body of knowledge or who have particular defined abilities and

those who do not, the tests must contain items which are tailored both in terms

of content and in terms of their psychometric properties to the criteria which have been defined.

The development of such tests presents particular concerns in the area of

University of Hawai'i Working Papers in ESL, Vol. 9, No.2, December 1990, pp. 99-128

100 HUDSON

item analysis. Two of these concerns involve item fit and item discrimination. First, the items should be maximally discriminating within the defined level of ability at which they are designated, without being so narrowly discriminating as to be of little utility. That is, items will be designated to measure a skill which is assumed to be of interest at some particular ability level. Not only should the difficulty of the item be located within the band of ability with which it is associated, but most of its discriminability should be within that band as well. Second, the items should fit the trait which is being measured (Henning, 1984, 1987, 1988; Wright & Stone, 1979; Mislevy and Bock, 1986, 1990). That is, the response patterns of examinees must be such that no items appear to be measuring a construct other than the primary trait of interest. For example it should be the case that examinees with low ability on the test trait are not indicated as being of high ability on several of the items on the test. This would indicate that the items measure a different trait or that the primary trait is misspecified. As will be discussed in detail, the determination of violations of fit are closely linked to the test evaluation model which is applied.

Item response theory (ill.T) is increasingly being applied to language test evaluation, and ensuring the two requirements mentioned above is one consideration in determining which IRT model and which item statistics to employ (d. Lord, 1980, Chapter 10 for a lengthy discussion of item selection). While, the one-parameter logistic model, often referred to as the Rasch model, has frequently been criticized for failing to account for differences in item discriminability (Hambleton and Swaminathan, 1985), the two-parameter logistic model is criticized both for the potential lack of stability in estimation of the slope parameter with small numbers of examinees and for allowing items with divergent discriminability to remain on a test and hence possibly introduce multidimensionality (Hulin, Lissik and Drasgow, 1982; Lord, 1983; Henning, 1988). However, there are few studies which compare results from the two models in order to examine the effect of these potential problems on real test data, language or otherwise (Dinero & Haertel, 1977, Hambleton, 1979). This study examines the relationships among two item fit indices available with the one-parameter logistic model (infit and outfit), the slope as determined in the two-parameter logistic model and traditional measures of item discrimination such as the biserial correlation. It examines the role these indices play in item selection for criterion-referenced tests. That is, it is concerned with how these indices affect selection of items which are designed

IRT ITEM DISCRIMINATION

to assess performance of a well defined domain {Popham, 1978; Hudson & Lynch, 1984; Brown, 1988).

Background The one-parameter and two-parameter logistic models take the following

forms respectively in estimation of item parameters:

(1)

(2)

-1 P.(a) = 1 + exp(-D{a- b.)]

1 1

In these two formulas,Pi(a) is the probability that an examinee with ability a answers item i correctly. The parameter bi is the item difficulty of item i and D is a constant which is a scaling factor, usually set at D=1.7. Also, exp is a constant equal to 2.71828 (i.e., a natural or Naperian log). The two-parameter model in Formula (2) includes ai, the item discrimination of item i. This item discrimination parameter, or slope, ai is defined as theoretically having a scale ranging from minus infinity to plus infinity. High values of ai (e.g. 0.9 and above) indicate slopes that are very steep and low values (e.g. 0.3 and below) indicate slopes that increase more gradually as a function of ability (Hambleton and Cook, 1977). Thus, items which have a slope with a high value discriminate over a narrow range at the a level, while those items with a lower value discriminate over a broader a band. Thus, ai can be interpreted as providing information conceptually similar to that of traditional item discrimination indices such as the biserial correlation. Lord (1980) has shown a strong relationship between ai and the biserial correlation. For example, he has shown the following ratios between ai and the biserial: .20 to .2; .44 to .4; .75 to .6; .98 to .70; 1.33 to .80; and 2.06 to .90.

This issue of item discriminability has implications for test construction in settings in which criterion-referenced testing is appropriate (Hudson & Lynch, 1984; Brown, 1988; Bachman, 1989; Hudson, 1989). In order to place the discussion in context, assume that analysis of a test indicates the set of three item characteristic curves in Figure 1. Additionally, assume that if a< 0.0 then the examinee should be classed a being a Level A student while if a > than 0.0 then the examinee is a Level B student. The purpose of the test is to place the

101

102 HUDSON

student into one of the two ability levels.

1.00 .95 .90 .85 .80 .75 .70 .65 .60 .ss .so .45 .40

Figure 1 Hypothetical test item slopes

.35

.301-----

.25

.20

.15

.10 .OS s.---~

.oo ~========----------' -4.00 -3.00 -2.00 -1.00 .00 1.00 2.00 3.00 4.00

L----- Level A __ ......_ ___ Level B -----JI

Item #2 has a low slope across the entire test. Although its point of inflection is approximately 0.0 on the ability scale, it supplies little discrimination between Level A and Level B. That is, examinees with an ability level of -4.00 have a · 27% chance of getting the item correct while examinees with high ability level of +4.00 only have a probability of approximately 63% of getting the item correct. Thus, while the item does discriminate maximally at the 0.0 level, it discriminates very little. Item #3 has a very high slope within Level B, but supplies information over a very narrow range within Level B. Items with discriminations this high have very restricted utility in discriminating between Level A and Level B. However, Item #1 provides a great deal of information near the 0.0 point, the point which is supposed to divide Level A from Level B. Items #2 and #3 are not satisfactory while Item #1 is a good item for


discrimination and placement. It is most discriminating near the point of interest and discriminates over the range of abilities of interest. The importance of the slope in item selection does not have to do with whether one item has a slope that is similar to the slope of other items. Rather, it is an issue of whether the items which are selected have slopes like that of Item #1 in Figure 1.

As noted above, the two models treat slope differently. By not accounting for the slope parameter, the one-parameter model assumes that differences in discriminability of items are of no concern in the estimation of item parameters. That is, it differs from the two-parameter model in that in the process of estimating the item characteristic curve (ICC) all items are assumed to have equal discriminating power and vary only as to item difficulty. In terms of item fit, it is argued for the one-parameter model that items with very different discriminabilities necessarily do not fit the model and should be eliminated. The one-parameter model provides a variety of indices to determine item fit. Henning (1984, 1988, & personal communication) considers the Rasch model fit statistics to be tests of unidimensionality. He is of the opinion that including items with divergent discriminabilities and guessing levels can contribute to multidimensionality. Two of these fit statistics are infit and outfit (Wright and Linacre, 1984). Outfit is the standardized weighted mean square residual of unlikely item responses while infit is a weighted derivation of the outfit statistic which is focused on the area where responses deliver the most information (Henning, 1988). For the infit statistic, the deviations used in calculations of the outfit are weighted by the amount of information provided by each response. In this way, infit is not as sensitive to individual outliers which are distant from the item as the outfit statistic. Wright and Linacre (1984) indicate that items outside a band of -2.0 to 2.0 should be examined as possibly misfitting. According to this view, the Rasch fit tests all attempt to improve on unidimensionality by eliminating as misfitting those items that have highly divergent discriminabilities as well as those which have aberrant response patterns. Only those within a range of homogeneity of discriminability are judged to fit the model while others are rejected. Indeed, items that show extreme departure from the assumption are removed from the test through the tests of fit.

There are two potential difficulties with interpretation of the Rasch model fit statistics. Both problems relate to the fact that a single number represents

103

104 HUDSON

violations of fit regardless of the cause of lack of fit. That is, although the Rasch model assumes that the discriminations of all items are identical, and consequently indicates which items have extreme departure from that assumption, the fit statistics also theoretically identify items which have very unlikely examinee response patterns. The first problem of interpretation of an unacceptable fit value, then, is determining whether the item is unacceptable because of an anomaly in the response pattern or because the item has divergent discriminability. The second difficulty is that the statistic will not indicate which items have slopes that are either extremely flat or steep and hence not productive for level determination. It is possible that if all items on a test had an equally extremely steep or flat slope the fit statistics would not indicate that this was the case because the items would not have highly divergent discriminabilities. They would have homogeneous low or high discriminabilities. Further, it is not clear how divergent an item discrimination must be in order to be considered highly divergent.

An additional issue in determination of which model to use for analysis concerns the contributions that inclusion of the slope can have in test design. When there is a large enough sample size (e.g., 200 or above) and a sufficient number of items (e.g., 60 or above) to allow use of the two-parameter model, the inclusion of ai is useful in determining how effective an item may be. That is, just as bi determines the point at which the item is providing the most information, ai indicates just how much discriminating power the item will have between examinees at differing 9 levels. Hambleton {1979) has criticized the one-parameter model for rejecting too many items as misfitting in eliminating items with extreme discriminations. The two-parameter model in effect allows items at extreme ends by including the slope and then relying on x2 goodness-of-fit statistics to identify misfitting items (Mislevy and Bock, 1986, 1990). However, it is clear that it is not satisfactory merely to include ai in calibrations and then rely on a x2 goodness-of-fit statistic to eliminate items with unexpected response patterns. Items at the extremes may simply not be desirable in certain contexts and should be eliminated. That is, there may be a rationale for eliminating items with extremely low or extremely high slopes, independent of any reference to the one-parameter model assumptions.

This last point is an important issue. That is, the items should be eliminated because they are not productive, not because of a hypothesis that they violate the assumption of unidimensionality, and they should not be


retained merely because the slope can be taken into account in determining other item parameters. The items should be eliminated or retained because of their utility. It is not clear that items which have different discriminations are necessarily violating the assumption of unidimensionality. Items have different item difficulties but no one claims that this violates unidimensionality. Any psychometric model adopted must allow for the fact that language skills vary both in their difficulty and in over what levels of ability they remain difficult. That one item measures a language skill which is learned quickly and another item measures a language skill which is learned more slowly should not be taken to indicate that the test is not a unidimensional language test. Thus, items with extreme discriminations may be eliminated because they do not function well given the purpose of the instrument of which they are a part. They should not necessarily be eliminated because they have discriminations different from the average discriminations of the remaining items on the test. Analysis of item discrimination, whether from the one-parameter or the two-parameter perspective, requires that a less purist view be taken. In practice not all items will have exactly the same discriminability and these differences in discriminability should be accounted for when possible. The differences between items with a slope of 0.40 and those with slopes of 0.90 are important differences to take into account when determining item difficulty. If it is possible to employ the two-parameter model then these differences can be incorporated. They are additional sources of information. However, it is also desirable to take the slope into account when determining which items to retain.

In short, there appear to be at least five sources of information which may help ensure that test items are maximally useful within the level for which they have been designed as well as congruent with the trait they are supposed to be measuring. These are the infit, outfit, slope, biserial and a x2 goodness-of-fit statistic for expected item responses. The remainder of this study will examine the relationships among these indices. For this discussion, suspect items will be those which have a slope below .40 or above .90. These values have been selected because items with a slope of less than .40 have very low discrimination at all and items with a slope above .90 discriminate over a very restricted range of interest. Additionally, items with an infit or outfit greater than 2.0 or less than -2.0 are labeled as suspect in order to examine the values

105

106 HUDSON

suggested by Wright and Linacre (1984).

PROCEDURE Instruments

The test instruments were two pilot versions of the General Tests of English Language Proficiency (GTELP) developed by the National Education Corporation. Form A had 94 items and Form B had 95 items. The test consists of three different subtests, reading, listening and grammar subtests, which are used to place examinees into one of three functional ability levels.

Subjects The subjects are 523 subjects in Japan, 105 subjects in Saudi Arabia and

356 examinees in Mexico. They are from a variety of academic and occupational backgrounds. Each examinee took one form of the test. Of these examinees, only the 830 examinees with complete data sets who also scored 25% or higher on the total test were retained for analysis.

Descriptive information Basic descriptive information of the tests, subtests and subtests by level of

items for the two tests is presented in Table 2 and Table 3 below. The two forms of the test have very similar mean percent scores and percent score standard deviations on subtests, levels and total score. Additionally, it should be noted that the percent scores are low. The mean percent score on Form A is 52.39 and on Form B is 55.66. The tests as a whole appear to be difficult for the subjects in this test sample. Each form and each subtest have acceptable 9 reliabili ties.

Table 1 Descriptive information Form A (n=430)

Mean Mean% Score so correct %50 Items

Total 49.45 17.29 52.39 18.39 94 .951 Grammar 18.83 7.14 62.77 23.79 30 .907 Listening 15.39 6.54 43.98 18.69 35 .853 Reading 15.22 5.62 52.50 19.38 29 .854


Table2 Descriptive information Form B (n=400)

Mean Mean% Score SD correct %SD Items a

Total 52.88 17.46 55.66 18.38 95 .946 Grammar 19.45 6.95 64.83 23.17 30 .886

Listening 17.03 6.37 48.64 18.19 35 .837 Reading 16.40 5.98 54.67 19.92 30 .848

Item Analysis Item analysis involves IRT examination of the two-parameter slope, x2

statistic and biserials from PC-Bilog (Mislevy and Bock, 1986), as well as an examination of the one-parameter infit and outfit statistics computed with the Microscale program (Wright & Linacre, 1984). Item difficulties are computed using the two-parameter model, and the two forms were linked on 10 items from a common multiple-choice doze test given to all examinees.

The item information for each item is presented in Table 3. This information includes the slope, infit, outfit, biserial correlation, x2 goodnessof-fit statistic, and the two-parameter item difficulty. Items are rank ordered from the lowest slope to the highest slope. The item identification at the left contains two characters indicating which subtest the item tested (RD=reading, LS=listening, GR=grammar), followed by a character designating which form of the test contained the item (A or B), and then the sequential item number on its subtest. Suspect items (a slope below .40 or above .90, an infit or outfit less than -2.0 or greater than +2.0., and a x2 below .05) are indicated with an asterisk in the appropriate column of values.

107

108 HUDSON

Table 3 Item information sorted by slope

ITEM SLOPE INFIT OUTFIT BISER ·y__2 DIFFICULTY

RDA10 "0.083 .. 13.451 .. 11.935 0.260 "0.0000 0.070

l.SA32 "0.151 !t7.528 !t7.851 0.007 0.2934 2452

GRB27 "0.168 •6.197 •6.581 0.044 0.2471 2403

l.SB35 "0.176 .. 6.237 .. 6.337 0.051 0.1100 2007

l.SA35 "0.187 •6.528 .. 6.186 0.088 0.3652 1.929

l.SB29 "0.187 •6.733 •6.544 0.084 0.5695 1.254

RDA28 "0.199 •5.786 •5.093 0.115 0.5223 2645

RDB20 "0.208 .. 6.193 .. 4.377 0.106 0.1082 -1.505

RDB26 "0.211 .. 4.628 •5.116 0.119 0.4388 2532

ISB06 "0.215 •6.007 .. 6.039 0.162 0.5785 0.936

GRA22 "0.225 .. 6.602 "'6.777 0.185 0.0510 0.088

l.SA31 "0.235 "'4.296 •5.948 0.157 "0.0014 2218

RDB27 "0.239 ltJ.830 "'4.289 0.154 0.3149 2535

RDB28 "0.246 "'4.284 •4.237 0.186 0.1179 1.945

l.SB31 "0.267 "'2751 •4.263 0.174 0.1900 2693

RDB29 "0.269 •5.194 •4.751 0.193 "0.0066 0.287

RDB23 "0.290 1.822 ~.546 0.196 0.2051 -2989

RDA16 "0.293 •4.085 ltJ.840 0.235 0.9317 -1.371

LSB07 "0.312 ltJ.588 ltJ.668 0.293 0.8312 0.957 RDA21 "0.313 0.940 1.496 0.196 "0.0104 -4.003

LSA05 "0.316 .. 4.738 ~.556 0.267 0.2842 -0.514

l.SB11 "0.317 ltJ.857 ltJ.583 0.286 0.2078 -0.270

LSB30 "0.317 ltJ.524 ltJ.866 0.284 0.7884 1.104

LSA15 "0.319 1.955 ltJ.003 0.253 0.3147 3.200

LSA01 "0.320 •4.775 ltJ.401 0.271 0.1013 -0.022

LSB32 "0.324 "'2135 .. 4.647 0.256 "0.0019 1.865

l.SB28 "0.326 ltJ.786 •4.078 0.274 0.5511 0.397

RDA24 "0.345 "'2233 ltJ.387 0.277 "0.0076 2254

RDB30 "0.352 1.509 ~.513 0.296 "0.0467 2698

LSA16 "0.365 "'2479 ltJ.056 0.321 0.8258 t.m

IRT ITEM DISCRIMINATION 109

Table 3 (continued) Item information sorted by slope

ITEM SLOPE INFlT OUTFIT BISER x2 DIFFICULTY

RDB14 ltQ.369 *2.866 0.991 0.322 *0.0152 -1.162

ISA33 "0.374 1.690 ~.715 0.316 *0.0228 2.100

RDA19 "0.374 *2.604 1.042 0.312 *0.0160 -1.514

GRA03 ltQ.376 lt3326 lt3.092 0.348 0.6860 0.664

ISB15 "0.380 "2574 "2866 0.335 0.3415 0.808

RDA25 "0.381 lt3.162 lt3.355 0.349 0.9705 0.743

ISA34 ltQ.385 "2802 lt3.204 0.359 0.3473 0.873

ISA21 ltQ.387 ~.549 lt3.098 0.346 0.7088 1.097

ISA29 "0.396 0.926 lt3.243 0.334 *0.0424 2.196

ISB26 0.407 ~.403 *2.097 0.364 0.1178 -0.561

ISB09 0.408 ~.211 1.970 0.369 0.2451 -0.584

ISB12 0.417 *2.452 *2.531 0.360 0.4755 -0.239

ISA25 0.426 0.918 ~.637 0.372 *0.0233 2.014

GRA13 0.428 "2730 "2141 0.361 0.1209 -0.221

ISA27 0.428 0.504 ~.671 0.381 *0.0041 2.002

ISA11 0.435 *2.370 1.414 0.391 0.2137 -0.168

RDA11 0.443 "2355 0.993 0.395 0.2963 -0.462

ISA14 0.446 1.201 0.861 0.357 0.4888 -1.478

ISA23 0.447 1.815 *2.179 0.403 0.8206 0.645

RDA27 0.447 1.776 1.864 0.426 0.5951 0.750

RDA20 0.448 0.842 -0.183 0.322 lt().0185 -2.576

ISA08 0.459 "2043 "2457 0.416 0.8050 0.049

RDB09 0.460 1.727 1.652 0.415 0.4262 0.314

GRA15 0.461 1.823 1.612 0.417 0.7093 0.215

RDB13 0.463 1.393 1.261 0.409 0.3147 -0.771

RDB24 0.463 1.032 -0.021 0.381 0.2636 -1.653

GRBOt 0.464 0.411 0.236 0.326 0.4096 -2.646

ISB16 0.469 0.561 1.593 0.413 0.1227 1.337

GRB04 0.470 1.699 0.586 0.407 0.4740 -0.282

ISB18 0.471 1.195 1.678 0.405 0.9678 0.868

110 HUDSON


ITEM SLOPE INFIT OUTFIT BISER x2 DIFFICULTY

GRA21 0.488 1.680 1.691 0.442 0.6198 0.002

ISA19 0.491 1.737 0.835 0.441 0.8749 -0266

RDB16 0.491 1.800 0.288 0.409 "0.0078 -0.842

ISB19 0.493 0.929 1.511 0.431 0.7525 0.967

RDB08 0.494 0576 '*2.587 0.438 0.4440 -0.918

lSA26 0.4% 1.269 1.209 0.460 0.9504 0.391

ISA09 0.518 0.930 1.336 0.470 0.4858 0.002

RDB10 0.526 0.692 -0.261 0.455 0.6379 -0.326

ISA06 0.529 0.927 0.365 0.462 o.9m -0.211

ISB10 0.530 0.564 0.468 0.481 0.9292 0.333

ISB34 0.530 -0.134 1.801 0.445 0.0573 1.517 GRAtO 0.531 0.788 0.962 0.457 0.3767 -0.477

ISA28 0.531 0.794 0.400 0.475 0.4554 -0.090

RDB25 0.532 0.578 0.422 0.488 0.8902 0.575

GRB12 0.533 0.688 0.426 0.486 0.4072 0.380 lSB21 0.551 0.821 -0.040 0.448 0.5146 -0.627

GRB14 0.559 0.257 0.626 0.490 0.8024 0.097

LSB20 0.559 -0.277 0.646 0.495 0.0608 0.927

ISA24 0.560 0.387 -0.114 0.463 0.7128 -0.779

RDA18 0.563 0.608 -0.389 0.473 0.8732 -0.421

RDB21 0.565 -0.022 -0.263 0.512 0.8299 -0.004

RDA29 0.566 0.344 0.259 0.536 0.6042 0.196 RDB22 0.584 0.149 -0.186 0.459 0.1978 -1.596

ISB25 0.589 -0.216 0.085 0.510 0.3445 0.306

RDB18 0.590 -0.349 -0.148 0.498 "0.0213 0.165

ISB13 0.597 -0.455 -0.215 0.521 0.6401 0.268

GRA07 0.599 -0.324 0.258 0.449 0.1575 -1.373

lSA02 0.601 0.199 -0.680 0.440 "0.0171 -1.535

GRB03 0.613 -0.108 -0.923 0.505 0.3465 -0.363 ISB08 0.613 -0.397 -0.750 0.507 0.3624 -0.817




LSA20 0.618 -0.367 -0.553 0.528 0.7265 -0.275

GRA14 0.619 -0.476 0.422 0.542 0.6469 1.158

LSA22 0.621 -0.182 -0.249 0.540 0.7788 1.617

LSB27 0.623 -0.542 -0.204 0.529 0.6700 0.300

RDA03 0.634 -0.546 -0.468 0.537 0.2381 -0326

GRB18 0.634 -0.317 -0.118 0.530 0.3721 -0.368

lSA10 0.638 -0.266 -0.292 0.554 0.6051 0.528

GRB26 0.641 -0.386 1.307 0.532 0.4257 -0.328

GRB09 0.645 -1.189 0.645 0.554 0.1722 -0.312

lSA03 0.647 -0.855 -0.176 0.519 0.7212 -0.843

lSB14 0.651 -0.176 1.417 0.408 0.1485 -2.604

LSB22 0.658 -1.057 -1.034 0.566 0.8676 0.534

ROB19 0.674 -1.564 -0.996 0.582 0.1735 0.537

LSA30 0.676 -1.361 -0.856 0.591 0.5386 0.275

GRB25 0.676 -0.811 -1.100 0.548 0.8221 -0.424

RDA04 0.678 -0.234 -1.040 0.456 0.6234 -1.608

RDB17 0.682 -0.823 -1.707 0.546 0.1002 -0.667

lSB33 0.685 -1.298 -1.357 0.582 0.4397 0.346

GRA04 0.686 -1.395 0.750 0.521 -o.0114 -1.019

GRB13 0.690 -0.596 -0.927 0.508 0.1370 -1.155

LSB17 0.697 -1.203 -1.375 0.560 0.7305 -0.502

RDB12 0.697 -0.801 -1.086 0.538 0.4107 -1.028

RDA06 0.702 -0.808 -1.608 0.544 0.8819 -0.457

GRB05 0.705 -0.526 -0.655 0.527 0.5541 -0.947

LSB23 0.706 -1.187 -1.126 0.573 0.8354 -0.616

lSA13 0.717 -1.658 -1.374 0.610 0.5635 0.306

RDB02 0.718 -1.775 -0.983 0.585 0.6380 -0.234

RDA26 0.731 -1.234 -1.227 0.642 0.9946 1.083

RDB07 0.731 -0.734 -0.766 0.525 0.3757 -1.532

RDA15 0.734 -1.905 -0.555 0.585 0.5619 -0.437

112 HUDSON



GRB21 0.736 -1.639 -0.467 0.576 0.5872 -0.856

GRB23 0.740 -1323 -0.562 0.575 0.4594 -0.580

RDB15 0.741 -1.524 -1.760 0.587 0.2035 -0.544

GRA24 0.743 -1.228 -1.230 0.555 0.9092 -0.818

GRB06 0.744 -1.170 -1.840 0.568 0.5399 -0.835

RDA17 0.747 -1.558 -1.372 0.563 0.7787 -0.477

RDA02 0.751 -1.681 -1.130 0.568 0.4976 -0.693

RDA30 0.755 -1.750 -1.502 0.658 0.8308 0.399

RDAOB 0.760 -0.497 -1.478 0.509 0.6145 -1.582

RDA01 0.765 -1.265 -1.792 0.560 0.8135 -1.114

RDA14 0.765 -1.985 -1.879 0.605 0.5166 -0387

RDA09 0.771 "'-2.345 -1.445 0.604 "'0.0146 -0.422

RDA12 0.773 -1.207 -1.724 0.561 0.5871 -1.104

GRB02 0.781 -1.7Tl -1.089 0.600 0.5184 -0.611

GRB28 0.789 -2.266 -0.794 0.623 0.2882 -0.325

GRA02 0.806 -1.935 -1.925 0.580 0.5870 -0.427

GRB19 0.806 -1.319 -1.064 0.586 0.8341 -0.966

GRA27 0.808 "'-2.341 "'-2522 0.623 0.7722 -0.066

l.SA18 0.809 "'-2.244 "'-2446 0.619 "'0.0096 -0.163

RDA13 0.811 "'-2.798 "'-2010 0.663 0.1615 0.762

RDA05 0.816 "'-2.041 -1.638 0.593 0.5269 -0.788

GRB30 0.816 "'-2.058 -1.703 0.624 0.0829 -0.485

GRB08 0.824 -1.303 -1.106 0.578 0.8942 -1.127

GRB17 0.833 -1.422 -1.309 0.603 0.7806 -1.082

GRA30 0.835 "'-3.022 "'-2.550 0.655 0.0949 -0.001

GRB07 0.836 "'-2.516 "'-2761 0.646 0.8490 0.047

GRA06 0.839 "'-3.175 "'-2.199 0.657 0.3578 0.027

l.SA17 0.849 "'-2.953 "'-2.981 0.672 0.2343 0.121

l.SA12 0.852 "'-3.119 "'-2.368 0.683 "'0.0288 0.513

l.SB01 0.855 "'-3.043 "'-3.106 0.652 "'0.0111 -0.172



ITEM SWPE INFIT OUTFIT BISER x2 DIFFICULTY

GRA16 0.877 ·-2.936 -1.523 0.624 0.3699 -0.635

GRB10 0.889 -1.058 -1.797 0.574 0.2876 -1.525

RDB06 0.892 ·-3.166 -1.505 0.659 0.8276 -0.462

GRB20 otQ.908 ·-2.424 ·-2.617 0.653 0.9787 -0.627

GRA17 ot0.911 ·-3.177 ·-3.148 0.653 0.9811 -0.254 LSBOS otQ.912 ·-2.371 ·-2.685 0.639 0.9185 -0.612

GRA28 otQ.917 ·-3.551 "'-2.060 0.668 0.2184 -0.391

RDA23 otQ.920 ·-3.564 "'-3.239 0.691 0.1659 -0.133

GRAOB otQ.944 "'-2.958 "'-2.070 0.644 0.1906 -0.651

LSB24 otQ.955 -1.294 -1.785 0.609 0.9117 -1.338

RDA22 otQ.973 "'-4.267 ·-3.542 0.724 0.3044 0.001

GRB16 "'1.003 -1.211 -1.925 0.607 0.0606 -1.390

GRA23 "'1.007 "'-3.159 -1.795 0.639 0.3263 -0.781

RDB05 "'1.008 .. -4.449 •-4.066 0.705 0.7869 -0.139

LSB03 "'1.009 "'-2.165 -1.588 0.659 0.7123 -1.070

LSA04 •1.011 -1.710 -1.245 0.614 0.1021 -1.148

ISA07 •t.Q25 .. -4.244 "'-3.856 0.707 0.1518 -0.402

ROBOt "'1.029 "'-4.788 "'-4.044 0.722 0.2577 -0.236

RDB11 "'1.048 •-4.101 "'-3.721 0.714 0.4153 -0.544

GRA09 "'1.050 •-4.459 ·-2.666 0.693 0.1634 -0.423

LSB02 "'1.071 ·-3.815 ·-3.343 0.718 0.2017 -0.567

GRA19 •t.o76 -1.881 -1.934 0.619 0.3809 -1.141

GRB15 "'1.091 "'-4.286 "'-3.847 0.726 0.8575 -0.364

GRA20 "'1.116 "'-4.184 ·-3.461 0.713 0.9712 -0.546

LSB04 "'1.127 -0.673 "'-2.026 0.598 0.0541 -1.687

GRB29 "'1.173 ·-2.596 ·-3.018 0.738 0.8882 -0.932

GRA29 "'1.177 -1.678 ·-2.449 0.636 0.1938 -1.148

GRA01 "'1.203 ·-5.356 "'-3.192 0.730 0.1774 -0.511

GRA12 "'1.218 -1.686 -1.759 0.623 0.2219 -1.290

GRB24 "'1.223 "'-2.218 "'-2.546 0.703 0.7527 -1.118

114 HUDSON



RDB04 lt-}.224 ""-3.765 ""-3.202 0.731 0.8026 -0.845

RDB03 It'} .424 ""-5.745 ""-4.299 0.804 0.2764 -0.538

GRB22 lt-}.493 -1.536 ""-2.612 0.727 0.4524 -1.332

GRAll ""1.509 -0.932 ""-2.153 0.617 0.7110 -1.479

GRA18 ""1.545 ""-5.017 ""-2.951 0.776 0.4014 -0.781

GRA26 ""1 .582 ·-5.087 ""-4.301 0.773 0.9025 -0.678

GRA05 ""1.642 ""-2.084 ""-2.477 0.701 0.8246 -1.179

GRA25 .. 1.864 ""-5.119 ""-4.463 0.801 0.2318 -0.763 GRBll 112.126 ""-4.067 ""-4.102 0.848 0.5185 -0.970

In examining the patterns in Table 3, several points seem clear. First, there is a close relationship between slope, infit, outfit and the biserial. In general, low slope is associated with high infit and outfit values and low biserial correlations. Additionally, high slope is associated with low infit and outfit values and high biserial correlations. Correlations among the different indices and difficulty are presented in Table 4.

Table4 Correlations of indices

INFIT OUTFIT SLOPE BISER DIFF

INFIT

OUTFIT 0.954

SIDPE -0.831 -0.841

BISER -0.960 -0.958 0.872 DIFF 0.394 0.519 0.437 0.405


Since infit is a function of outfit the strong correlation (0.954) is to be expected. That is, they are not independent values. However, the strong overall correlation between the slope with infit and outfit (-0.831 and -0.841) is important. The narrower the slope, the lower the fit values. Thus, at the extreme ends of the slope continuum the fit indices demonstrate extreme values. This is plotted in Figure 2 and Figure 3 with the regression line of predicted values indicated.

Figure2 A scatterplot of the two-parameter slope to the one-parameter infit estimates. "."on diagonal represents predicted values of regression line. A=l observation, B = 2 observations, etc.

SLOPE 2.2 2.1 A 2.0 1.9 A 1.8 1.7 1.6 1.5 1.4 1.3

A A

A

A AA

1.2 A A. AAB 1.1 BB .. A A 1.0 ABB A.ABA 0. 9 BFB .. A

0.8 DDFFAA 0.7 BIJCB 0.6 BLFDA 0.5 AGGFC 0.4 0.3 0.2 0.1 0.0

ADBCICA A D.ABCAB

.. AC EC A

-6 -4 -2 0 2 4 6 INFIT

A

8 10 12 14

115

116 HUDSON

Figure 3 A scatterplot of the two-parameter slope to the one-parameter outfit estimates. "."on diagonal represents predicted values of regression line. A=1 observation, B = 2 observations, etc.

SLOPE 2.2 2.1 A 2.0 1.9 A 1.8 1.7 1.6 A A 1 . 5 AAA 1.4 .A 1.3 1.2 .CBA 1.1 AB.AB 1.0 CB .CAA 0.9 CCCB. 0.8 BCEGE. 0 . 7 BFLD .. AA 0.6 AGJF.A 0.5 ABGCHAB 0.4 A DAEDFB 0.3 A BADCBA 0.2 .. ABB CCA A 0.1 A 0.0

-6 -4 -2 0 2 4 6 8 10 12 OUTFIT

Figures 2 and 3 show that items with values above a slope of 0.9 or below 0.3 are generally also at extreme ends of the fit values. However, Table 4 indicates that the strongest correlation each of the three indices has is with the biserial. That is, if the item does not show a strong relationship to overall test outcome then both the dispersion around the slope will be great and residuals will be affected. This is shown in Figures 4, 5 and 6 below.


Figure 4 A scatterplot of biserial to the two~ parameter slope. "." on diagonal represents predicted values of regression line. A=l observation, B = 2 observations, etc.

BISER 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0. 60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00

-0.05 -0.10 -0.15 -0.20

AA A AAB AA

AADCAA A ABFEB A

BFJCABAAA A AKJCA

CFCCA AKFA ID.A

EH. CFA DB

BC .CA

•• E • B

A

-0.25 A

0.0 0.5 1.0 SLOPE

1.5

A

2.0 2.5

117

118 HUDSON

Figure 5 A scatterplot of biserial to the one-parameter infit estimates. "." on diagonal represents predicted values of regression line. A=l

observation, B = 2 observations, etc.

BISER 1.00 0.95 0.90 0.85 A 0.80 AB. 0.75 AAAA. A A 0.70 ACDAA B 0.65 AICABA 0.60 ACILBA 0.55 DLHAA 0.50 AAGCD 0.45 ADDFCA 0.40 ABBCDB 0.35 AC AEBA 0 . 3 0 ABABABA

B .BAA 0.25 0.20 0.15 0.10 0.05 0.00

A A AA A

A AA. A A .BB

-0.05 -0.10 -0.15 -0.20 -0.25

-6 -4 -2 0 2 4 INFIT

B ••

6

A

A

8 10 12 14


Figure 6 A scatterplot of biserial to the one-parameter outfit estimates. "."on diagonal represents predicted values of regression line. A=l

observation, B = 2 observations, etc. BISER 1. 00 0.95 0.90 0. 85 . A 0.80 B. A

0.75 AA.CA 0.70 CDAD 0 . 65 DDECA 0.60 BIGHB

CEFECC A ACBDEA

ACCCCCBA ABA GAB A A CBEA A B AADA

AAABA A A A.A A

B. B AB.AA

B

0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0 . 00 .A

-0 . 05 -0.10 -0.15 -0.20 -0.25 .A

-6 -4 -2 0 2 4 6 8 10 12 OUTFIT

The plots in Figures 4, 5 and 6 show two patterns in the relationships of biserial to the fit statistic values. First, items with high biserials increasingly show extreme negative fit values and those with low biserials show increasingly high fit values. Secondly, this relationship is less strong between the slope and the biserial than between the two fit statistics and the biserial.

Finally, item difficulty has the weakest correlational relationship to all of

120 HUDSON

the indices in Table 4. An examination of Figures 7, 8, 9 and 10 indicate that there truly is a weak relationship and that it is not merely an issue of curvilinearity masking a relationship.

Figure7 A scatterplot of the two-parameter slope to the two-parameter difficulty estimate. "."on diagonal represents predicted values of regression line. A=1 observation, B = 2 observations, etc.

SLOPE 2.2 2.1 2.0 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 A

0.2 0.1 0.0

-4

A

A

• A A B A

A

CAAA A A .BB

BAAAAABA

A DBC A A CBABEADAA A

A B ACCEDA ABB A BA CAFABCBA AA A

A A AB BCDADAAA AA

A AAA BAC •• BDA ACA A A A AAAAA. AA AAA A

-3 -2

A A • AA CABC A

-1 0 DIFF

1 2 3 4


Figure 8 A scatterplot of the biserial correlation to the two-parameter difficulty estimate. "."on diagonal represents predicted values of regression line. A=1 observation/ B = 2 observations/ etc.

BISER 1.00 0. 95 0.90 0.85 A 0.80 .BA 0.75 A AABA 0.70 B .CBCA A 0.65 AAACBBCAA AA 0.60 ACDBBCFBAABA 0.55 A BBBEGA AAB AA 0.50 B AAB A.BBCA A 0.45 CA AAACBC A AA A 0.40 A A B ABAAAAA A A 0.35 A A B B . ACA AA 0.30 A AA A .. AA ABA 0.25 A A A A. A A 0.20 A A AA . A 0.15 .. A A B 0.10 A .. A A B 0.05 AA 0.00 A

-0.05 -0.10 -0.15 -0.20 -0.25 A

-4 -3 -2 -1 0 1 2 3 4 DIFF

121

122 HUDSON

Figure 9 A scatterplot of the one-parameter infit estimates to the two-parameter difficulty estimate. "." on diagonal represents predicted values of regression line. A=l observation, B = 2 observations, etc.

INFIT 14 13 A 12 11 10

9 8 A

7 A •• A A

6 A .A AAA 5 A AA • A 4 A A A. AA AAA

3 AA A •• ACA A

2 A ABADBAAAA BAA A A 1 A A AA AAACAB CAAA A BA

0 B DA B CABCAA AA B

-1 CCDECDD. ABA A -2 AFABEGBA BA -3 AABABCA AA -4 AACDBA

-5 .BB A

-6 .A

-4 -3 -2 -1 0 1 2 3 4

DIFF


Figure 10 A scatterplot of the one-parameter outfit estimates to the two-parameter difficulty estimate. "." on diagonal represents predicted values of regression line. A=l observation, B = 2 observations, etc.

OUTFIT 14 13 12 11 10

9 8 7 6 5 4 3 A 2 1 A 0

-1 -2 -3 -4 -5

-4 -3

A

A A A. A

A. BA A A B

AA A A .. AA AAB A A AA .ACA ABBA

B ABAAABA AA A AAAAA DCB.A A B BA DADABCDB A A

-2

D CDBEFB ABB A BCFACDFAA AAA

AAABEACCA .AACBBA

-1 0

DIFF 1 2

A

3 4

There is a general indication that items at extreme difficulty or ease likewise have low slope and biserial values. However, this is not a strong pattern given that there are items at these extremes which fall across a span of slope and biserial values. In terms of the relationship between difficulty and the fit statistics, Figures 9 and 10 appear to confirm a weak association.

In terms of the interactions between the IRT indices, it should be noted that a close relationship exists between the two-parameter slope information and the one-parameter infit and outfit residual analyses. As noted above, the correlations between the two, -0.831 and -0.841, are strong. Table 3 shows that no items with a slope between 0.456 and 0.765 were indicated as misfitting

123

124 HUDSON

according to infit and only one item was identified as misfitting according to outfit. Infit and outfit systematically reject items below a slope of approximately 0.450 and above 0.800. However, Table 3 also indicates that five items between .460 and .800 were indicated as misfitting by the X2 goodnessof-fit analysis. These items were accepted by the infit and outfit indices. This would indicate that the two fit tests are less sensitive to deviations from expected response patterns when the slope is not extremely high or extremely low.

DISCUSSION

THIS STUDY was designed to address the issues which surround item discrimination in CRTs. Specifically, it aimed at examining the relative contributions of the one-parameter and the two-parameter models in constructing tests that both contain items which discriminate between ability levels and which have response patterns which are consistent with the items fitting the test. Identification of the relationships which exist among the indices can help in test construction in several ways.

In the present study the finding of high correlations between the biserial and each of the discrimination indices under consideration suggests general relationships which are of importance for test development. It indicates that perhaps items with a high biserial should not be selected as in conventional test development. Choosing items with the highest biserial may provide a test which has items with extremely high slopes, hence lessening the productivity of the items. That is, when items have very high biserials they may discriminate very well, but not be discriminating what we want to have discriminated. They may be so discriminating that the actual trait we wish to measure is not addressed. Thus, the conventional wisdom in traditional test analysis that items with high biserials are precisely the types of items which are desireable should be reconsidered.

Additionally, the results indicate that there is a strong relationship between the residual fit statistics and the slope. This is encouraging in that it suggests that the one-parameter model fit statistics allow item selection such that the assumption of a common item discrimination parameter may essentially be met. That is, it allows the identification and possible exclusion of items with extreme discriminations. This is potentially very useful in those


testing situations which would preclude the use of the two-parameter model as a result of small sample sizes or too few items. In general, most settings in which CRTs are needed are precisely those situations which have few examinees and a small number of items associated with each criteria of interest.

There are, however, at least five concerns regarding reliance on the oneparameter model and the associated fit statistics as well as with the traditional biserial correlation. First, in constructing the test information curve, the twoparameter ai is a useful source of additional information. It is clear that a test constructed eliminating items viewed as unacceptable according to the infit and outfit statistics from above would still contain items with some differences in discrimination. That is, the test would still contain items with ai as low as .290 and .446 and would contain item as with ai as high as 1.076 and 1.218. These differences should be accounted for in test examinee ability estimates and item difficulty estimates. Thus, the two-parameter model is to be preferred as generally supplying more information. Second, it remains to be seen whether the fit statistics demonstrate the same weakness of stability as the slope estimate does with small sample sizes. This will need to be examined in further research. Third, infit and outfit appear to be too heavily a function of biserial and/or slope and are consequently not as sensitive to other types of problems in response patterns as would be desirable. That is, the x2 analysis indicated six items within the range of + 2.00 infit and outfit as having unacceptable response patterns. It may be that residual analysis alone is not sufficient for the one-parameter model, just as the x2 analysis is not sufficient for the two-parameter model. For example, the x2 statistic does not indicate problems with items which have extremely high discrimination values. However, it is clear that we should not confuse item discrimination with fit. The infit and outfit statistics appear to frequently do this while the x2 statistic is moot on this point. Fourth, it remains to be seen how the infit and outfit values are affected by the extremely narrow distributions associated with many criterion-referenced tests. That is, with the current test high values of infit and outfit generally reflect low slope and low biserial correlations. However, with truncated distributions this same relationship might not hold. Fifth, while high values of infit and outfit reflect poorly discriminating values, and consequently lead to item decisions fairly consistent with those based on the two-parameter slope and traditional biserial correlation, extremely low

125

126 HUDSON

values of infit and outfit are more problematic to interpret. For example, in settings which indeed require extremely high values of the slope in order to discriminate at a very narrow threshold, the infit and outfit values below -2.00 should not be used to discard items.

Finally, it must be stressed that none of the statistics alone addresses content issues of the items. It is important to link any acceptance or rejection of items with a third source of information, content analysis. Items may be indicated as undesirable on the basis of fit or discrimination, yet have content that is representative of the trait which is to be measured. In focusing on the relationships among item statistics used in determining the acceptability or unacceptability of test items, this study does not mean to indicate that these indices alone are sufficient for thorough test development.


REFERENCES

Bachman, L.F. (1989). The development and use of criterion-referenced tests of language ability in language program evaluation. In Johnson, R.K. (ed) The Second Language Curriculum. Cambridge: Cambridge University Press. (pp. 242-258).

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord and Novick, M.R. (eds) Statistical theories of mental test scores.Reading, Mass.:Addison-Wesley.

Brown, J.D. (1988). Improving ESL placement tests using two perspectives. TESOL Quarterly. 23:1. 65-83.

Dinero, T.E. and Haertel, E. (1977). Applicability of the Rasch model with varying item discriminations. Applied Psychological Measurement. 1, 581-592.

Hambleton, R.K. (1979). Latent trait models and their applications. In R. Traub (ed.) Methodological developments: New directions for testing and measurement (no.4). San Francisco: Jossey-Bass.

Hambleton, RK. and Cook, L.L. (1977). Latent trait models and their use in the analysis of educational test data. Journal of Educational Measurement. 14:75-96.

Hambleton, RK. and Swaminathan, H. (1985). Item Response Theory: Principles and Applications. Boston, MA: Kluwer-Nijhoff Publishing.

Henning, G. (1984). Advantages of latent trait measurement in language testing. Language Testing,1:123-134.

Henning, G. (1987). A Guide to Language Testing: Development, Evaluation, Research. Cambridge, MA: Newbury House Publishers.

Henning, G. (1988}. The influence of test and sample dimensionality on latent trait person ability and item difficulty calibrations. Language Testing, 5: 83-99.

Hudson, T.D., (1989). Mastery decisions in program evaluation. In Johnson, R.K. (ed) The Second Language Curriculum. Cambridge: Cambridge University Press. (pp. 259-269).

Hudson, T.D. & Lynch, B.K. (1984). A criterion-referenced approach to ESL achievement testing. Language Testing, 1, 171-201.

127

128 HUDSON

Hulin, C.L., Lissak, R.I., and Drasgow, F. (1982). Recovery of two- and threeparameter logistic item characteristic curves: A Monte Carlo study. Applied Psychological Measurement. 6:249-260.

Lord, F.M. (1980). Applications of Item Response Theory to Practical Testing Problems. Hillsdale, New Jersey: Lawrence Erlbaum Associates, Publishers.

Lord, F.M. (1983). Small N justifies Rasch Model. In Weiss, D.J. (ed) New Horizons in Testing: Latent Trait Test Theory and Computerized Adaptive Testing. New York: Academic Press. (pp. 51-61).

Mislevy, R.J., and Bock, R.D. (1990). PC-Bilog 3: Item Analysis and Test Scoring with Binary Logistic Models. Mooresville, IN. Scientific Software.

Mislevy, R.J., and Bock, R.D. (1986). PC-Bilog: Item Analysis and Test Scoring with Binary Logistic Models. Mooresville, IN. Scientific Software.

Popham, W.J. (1978). Criterion-referenced measurement. Englewood Cliffs, NJ: Prentice-Hall.

Wright, B.D. & and Stone, M.H., (1979). Best Test Design: Rasch Measurement. Chicago, IL.: Mesa Press.

Wright, B.D. & and Linacre, J.M. (1984). Microscale Manual. Black Rock,Connecticut. Mediax Interactive Technologies.

Documents

IRT ITEM DISCRIMINATION INTERPRETATION IN CRITERION … · 2016. 8. 11. · criterion-referenced test construction and interpretation. The study then specifically examines how the