16
- ED 072'117 AUTHOR TITLE INSTITUTION SPONS AGENCY REPORT NO PUB DATE GRANT NOTE AVAILABLE FROM EDRS PRICE DESCRIPTORS ABSTRACT In this paper, an attempt has leen made to synthesize some of the current thinking in the area of criterion-referenced testing as well as to provide the beginning of an integration of theory and method for such testing. Since criterion-referenced testing is viewed from a decision-theoretic point of view, approaches to reliability and validity estimation consistent with this philosophy are suggested. Also, to improve the decision-making accuracy of criterion-referenced tests, a Bayesian procedure for estimating true mastery scores has been proposed. This Bayesian procedure uses information about other members of a student's group (collateral information), but the resulting is still criterion-referenced rather than norm-referenced in that the student is compared to a standard rather than to Other students. In' theory, the Bayesian procedure increases the *effective length* of the test by imprOving the reliability, the validity, and more importantly, the decision-making accuracy of the criterion-referenced test scores.. (Author) DOCUMENT RESUME TM 002 359 Hambleton, Ronald K.; Novick, Melvin R. Toward an Integration of Theory and Method for Criterion-Referenced Tests. American Coll. Testing Program, Iowa City, Iowa. Research and Development Div. Office of Education (DHEW), Washington, D.C. ACT-RR-53 Sep 72 OEG-0-72-0711 15p.; Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, Illinois 1972 Publication and Information Services Division, The American College Testing Program, P. O. Box 168, Iowa City, Iowa 52240 ($1.00) MF-$0.65 HC-S3.29 *Criterion Referenced Tests; Internal Scaling; Measurement Instruments; Measurement Techniques; *Scoring; Scoring Formulas; Test Construction; *Testing; Test Interpretation, *Test Reliability; *Test Validity J

Hambleton, Ronald K.; Novick, Melvin R. TITLE Toward an … · 2013. 12. 24. · Ronald K. Hambleton Melvin R. Novick. Over. hke3. years, standard procedures for con-structing, administering,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hambleton, Ronald K.; Novick, Melvin R. TITLE Toward an … · 2013. 12. 24. · Ronald K. Hambleton Melvin R. Novick. Over. hke3. years, standard procedures for con-structing, administering,

-

ED 072'117

AUTHORTITLE

INSTITUTION

SPONS AGENCYREPORT NOPUB DATEGRANTNOTE

AVAILABLE FROM

EDRS PRICEDESCRIPTORS

ABSTRACTIn this paper, an attempt has leen made to synthesize

some of the current thinking in the area of criterion-referencedtesting as well as to provide the beginning of an integration oftheory and method for such testing. Since criterion-referencedtesting is viewed from a decision-theoretic point of view, approachesto reliability and validity estimation consistent with thisphilosophy are suggested. Also, to improve the decision-makingaccuracy of criterion-referenced tests, a Bayesian procedure forestimating true mastery scores has been proposed. This Bayesianprocedure uses information about other members of a student's group(collateral information), but the resulting is stillcriterion-referenced rather than norm-referenced in that the studentis compared to a standard rather than to Other students. In' theory,the Bayesian procedure increases the *effective length* of the testby imprOving the reliability, the validity, and more importantly, thedecision-making accuracy of the criterion-referenced test scores..(Author)

DOCUMENT RESUME

TM 002 359

Hambleton, Ronald K.; Novick, Melvin R.Toward an Integration of Theory and Method forCriterion-Referenced Tests.American Coll. Testing Program, Iowa City, Iowa.Research and Development Div.Office of Education (DHEW), Washington, D.C.ACT-RR-53Sep 72OEG-0-72-071115p.; Paper presented at the annual meeting of theNational Council on Measurement in Education,Chicago, Illinois 1972Publication and Information Services Division, TheAmerican College Testing Program, P. O. Box 168, IowaCity, Iowa 52240 ($1.00)

MF-$0.65 HC-S3.29*Criterion Referenced Tests; Internal Scaling;Measurement Instruments; Measurement Techniques;*Scoring; Scoring Formulas; Test Construction;*Testing; Test Interpretation, *Test Reliability;*Test Validity

J

Page 2: Hambleton, Ronald K.; Novick, Melvin R. TITLE Toward an … · 2013. 12. 24. · Ronald K. Hambleton Melvin R. Novick. Over. hke3. years, standard procedures for con-structing, administering,

a S I73

TOWARD AN INTEGRATION or'THEORY AND METHOD FOR

'CRITERiON4EFERENCED TESTS

1 -I

E 4414,4 ., 444 E'44 E

E

,rg- vviqEEA '4. '4'

RE, ANTE,E4,44, EE'

A E. 444 ,E El ,

f , '44 E 4 44E 41, R, "

1)4 .44 44,, 4. 4A4 24 Eu, 4E,

U $ DEPARTMENT OF HEALTHEDUCATION & WELFAREOFFICE OF EDUCATION

THIS DOCUMENT HAS BEEN REPRODUCED EXACTLY AS RECEIVED FROMTHE PERSON OP ORGANIZATION ORGINATING IT POINTS or VIEW OR OPINIONS STATED DO NOT NECESSARILYREPRESENT OFFICIAL OFFICE OF EDUCATION POSITION OR POLICY

.

;2:1 ,"

'4tit

b

AMERICAN COLLEGE TESTING PROGRAM

Page 3: Hambleton, Ronald K.; Novick, Melvin R. TITLE Toward an … · 2013. 12. 24. · Ronald K. Hambleton Melvin R. Novick. Over. hke3. years, standard procedures for con-structing, administering,

TOWARD AN INTEGRATION OF THEORY AND METHOD

FOR CRITERION-REFERENCED TESTS

Page 4: Hambleton, Ronald K.; Novick, Melvin R. TITLE Toward an … · 2013. 12. 24. · Ronald K. Hambleton Melvin R. Novick. Over. hke3. years, standard procedures for con-structing, administering,

C,

Prepared by the Research and Development DivisionThe American College Testing Program

All rights reserved. Printed in the United States of America

For additional copies write:Publication arAd Information Services Division

The American College Testing ProgramP.O. Box 168, Iowa City, Iowa 52240

(Check or money order must accompany request.) Price: $1.00

Page 5: Hambleton, Ronald K.; Novick, Melvin R. TITLE Toward an … · 2013. 12. 24. · Ronald K. Hambleton Melvin R. Novick. Over. hke3. years, standard procedures for con-structing, administering,

ABSTRACT

In this paper, an attempt has been made to synthesize some of the current thinking in thearea of criterion-referenced testing as well as to provide the beginning of an integration oftheory and method for such testing. Since criterion-referentrid testing is viewed from adecision:theoretic point of view, approaches to reliability and validity estimationconsistent with this philosophy are suggested. Also, to iniprove the decision-making accuracyof criterion-referenced tests, a Bayesian procedure for estimating true mastery scores has beenProposed. This Bayesian procedure uses information about other members of a student's group(collateral information), but the resulting estimation is still criterion-referenced rather thannorm-referenced in that the student is compared to a standard rather than to other students. Intheory, the Bayesian procedure increases the ".iffective length" of the test by improving thereliability, the validity, and more importantly, the decision making accuracy of thecriterion-referenced test scores.

k

7

,

Page 6: Hambleton, Ronald K.; Novick, Melvin R. TITLE Toward an … · 2013. 12. 24. · Ronald K. Hambleton Melvin R. Novick. Over. hke3. years, standard procedures for con-structing, administering,

TOWARD AN INTEGRATION OF THEORY AND METHODFOR CRITERION-REFERENCED TESTS' .2

Ronald K. HambletonMelvin R. Novick

Over hke3. years, standard procedures for con-structing, administering, and analyzing tests, andinterpreting scores in the contexc of standardinstructional models and methods have becomewell-known to educators. With these models, testshave been used primarily and most successfully toestimate each examin?e's ability level and topermit comparative statements (e.g., ranking)across examinees. Recently, however, there havebeen numerous suggestions for, and demonstra-tions of, instructional models and methods in theschools where the well-known classical mental testmodels for test construction and test score inter-pretation appear to be less useful. Examples ofthese instructional models include: Computer-Assisted Instruction (Atkinson, 1968; Suppes,1966), Individually Prescribed Instruction (Glaser,1968), Project PLAN (Flanagan, 1967, 1969), andA Model of School Learning (Carroll, 1963, 1970;Bloom, 1968; Block, 1971). Common to most.ofthese instructional models as well as to severalothers are such features as the specification of thecurriculum in terms of behavioral objectives,detailed diagnosis of beginning students, the avail-ability of multiple instructional modes, individualpacing and sequencing of material, and the carefulmonitoring of student progress.

While not all educators agree on the usefulnessof these instructional models in the schools, theposition taken in this paper is that these models areusef, ., and that their useiulness will be enhancedby a /eloping testing methods and decision pro-cedures specifically designed for use within thecontext of these models. The purpose of this paperis to outline some appropriate statistical methodsthat may prove of use in making instruction&decisions for students.

It appears that much of the discussion in thisarea (for example, see Block, 1971; Carver, 1970;Ebel, 1971; and Glaser & Nitko, 1971) stems fromdifferent understandings as to the basic purpose of

1 1

testing in these instructional models. It would seemto us thgt in most cases the pertinent question iswhether or not the individual examinee hasattained some prescribed degree of competence onan instructional performance task (see, forexample, Harris, 1972b), Questions of preciseachievement levels and comparisons among indi-viduals on these levels seem tc be largely irrelevant.In many of the new instructional models, tests areused to determine on which instructionalobjectives an examinee has met the acceptableperformance level standard set by the modeldesigner. This, test information is usually usedimmediately to evaluate the student's mastery ofthe instructional objectives covered in the test, soas to appropriately locate him for his next instruc-tion (Glaser & Nitko, 1971). Tests especiallydesigned for this particular purpose have come tobe known as criterion-referenced tests. Criterion-referenced tests are specifically designed to meetthe measurement needs of the new instructionalmodels. In contrast, the better known norm-referenceo tests are principally designed to producetesi. :::.cores suitable for ranking individuals.on theability measured by the test. Sometimes this occurswith the understanding that some cut-off score willbe introduced to reject some percentage of stu-dents for the next level of instruction.

1

This paper was begun while the first author was on a Summerf'ostdoctorai Fellowship at The American College Testing Programand completed with support from the Office of Education Theresprch reported herein was performed pursuant to Grant No.GEG-0-72-0711 from the Office of Education, U.S. Department ofHealth, Education, and Welfare Contractors undertaking suchprojects under government sponsorship are encouraged to expressfreely the'r professional judgment in the conduct of the projectPoints of view or opinions skated do not, therefore, necessarilyrepresent official Office of Education position or policy The authorsars indebted to Roy Williams and Thomas Hutchinson for helpfulcomenen ts.

2 Paper presented at the annual meeting of the National Council onMeasurement in Education, Chicago, 1972

Page 7: Hambleton, Ronald K.; Novick, Melvin R. TITLE Toward an … · 2013. 12. 24. · Ronald K. Hambleton Melvin R. Novick. Over. hke3. years, standard procedures for con-structing, administering,

Criterion-Referenced Tests: Definitions and Selected Issues

A "criterion- referenced test" has been defined ina multitude of ways in the literature. (See, forexample, Glaser & Nitko, 1971; Harris & Stewart,1971; Ivens, 1970; Kriewall, 1969. and Livingston,1972a). The definitions are sufficiently differentthat a test may be classified as norm-referencedaccording to one definition, criterion-referencedaccording to another, or more typically, exhibitcharacteristics of each to a greater or lesser extentdepending on the definition. The intentionallymost restrictive definition of a criterion-referencedtest was proposed by Harris and Stewart (1971):"A pure criterion-referenced test is one consistingof a sample of production tasks drawn from awell-defined population tg4erformance, a samplethat, may be used to estimate the proportion ofperfr'rmances in that population at which thestudent can succeed." On the other hand, possiblythe least restrictive definition is that by Ivens(1970) who defined a criterion-referenced test as"one made up of items keyed to a set of behavioralobjectives." A very flexible definition has beenproposed by Glaser and Nitko (1971): "Acriterion-referenced test is one that is deliberatelyconstructed so as to yield measurements that aredirectly interpretable in terms of specified perkformance standards." According to Glaser andNitko, "The performance standards are usuallyspecified by defining some domain of tasks thatt he student should perform. Representativesamples of tasks from this domain are organizedinto a test, Measurements are taken and are used tomake a statement about the performance of eachindividual relative to that domain." This definitionis less restrictive than Harris and Stewart's in that itdoes not limit consideration to a single instruc-tional objective. A common thread runningthrough the various approaches to criterion-referenced tests is that the definition of a well-specified content domain and the development ofprocedures for generating appropriate samples oftest items are important. (For more on this, see,Bormuth, 1970; Glaser & Nitko, 1971; and Hively,Patterson, & Page, 1968.)

It should be noted that these are also concernsof those interested in constructing norm-referencedtests; however, not to the same extent. Less often

C.4.

c.i

1

is there an interest in making inferences about... which particular skills an individual has or does not

have from his performance on a norm-referencedtest. Thus, norm-referenced testing is seldomdiagnostic. Primary examples would be theScholastic Aptitude Test (SAT) and, to a lesserextent, the ACT Assessment. Exceptions would betests such as the Iowa Tests of Basic Skills whichhave important features of both norm- andcriterion-referenced tests. Such tests are norm-referenced because they are geared to reportinghow well a student compared with others in certainwell-defined populations (e.g., through percentilescores). Yet, they are criterion-referenced in thatthey are keyed to specific instructional objectives,are multiscaled, and diagnostic. However, they do'not' involve apriori judgment as to acceptableperformance levels and a consequent judgment asto whether or not an individual student attains thisperformance level. Further distinctions betweennorm-referenced tests and criterion-referenced testshave been presented by Block (1971), Ebel (1971),G laser ( 1963), Glaser and Nitko (1971),Hambleton and Gorth (1971), Hieronymus (1972),and Popham and Husek (1969). j

If one accepts the Glaser and Nitko definition ofa criterion-referenced test, it is apPerent that thetest may often be multidimensional while made upOf unidimensional. subscales. That is, the itemsfrom a criterion-referenced test are organized indistinct and different subscales of homogeneousitems measuring common skills. (The possibility ofa single item subscale is not ruled out.) Aninstructional decision for each individual is thenoften made on the basis of his performance oneach subscale. Major interest may, thus, rest on thereliability and validity of subscale scores.

One of the problems yet to be reckoned with forcriterion-referenced tests is an instance of thebandwidth-fidelity issue (Cronbach & Gleser,1965). When the total Jesting time is fixed andthere is interest in measuring many competencies,one may be faced with the problem otwhether toobtain very precise information about a smallnumber of competencies or less precise informa-tion about many more competencies. Time alloca-tion algorithms (analytical procedures for deciding

2

Page 8: Hambleton, Ronald K.; Novick, Melvin R. TITLE Toward an … · 2013. 12. 24. · Ronald K. Hambleton Melvin R. Novick. Over. hke3. years, standard procedures for con-structing, administering,

how many items on a test should measure eachobjective) of a rather different kind than thosepresented by Woodbury and Novick (1968) andJackson and Novick (1970) will be required. Theywill be closer in spirit, but not identical to thosegiven by Cronbach and Gleser (1965). The problem

of how to fix the length of each subscale so as tomaximize the percentage of correct decisions orsome similar measure or overall decision-makingaccuracy on the basis of test-results has yet to beresolved or, indeed, to be formulated satisfactorily

Distinction among Testing Instruments, Measurement, and Decisions

clarificationconcerning appropriate mea-surement models for these pew instructional pro-grams can be obtained by properly distinguishingbetween testing ;nstruments and measurement.With the availability of a test theory for norm-referenced measurement (e.g., see Lord & Novick,1968), we have procedures for constructing appro-priate measuring instruments, i.e., norm-referencedtests. Then, the pertinent question seems to bewhether or not the instructional models whichrequire different kids of measurements (i.e.,criterion-referenced measurement) also require newkinds of tests or. whether the usual tests with

,41

alternate procedures for interpreting test scores canbe used. We subscribe to the belief that differenttests are needed, constructed to meet quite different specifications than those typically set fornorm-referenced tests (Glaser, 1963). We do notpropose, however, to explicate a developed theoryof criterion-referenced measurement in this papernor to prescribe a technology for criterion-refer-enced test development. Such explication shouldbe based both on a well-developed instructionaltheory and on a decision-theoretic formulation ofthe measurement problem. Only the latter is eventouched on here. The test development technologywoula be concerned primarily with methods ofobtaining a representative sample Of behaviorsfrom a specified domain.

It should be noted that a norm-referenced testcan be used for criterion-referenced measurement,

1.

albeit with some difficulty, since the selection 'ofitems is such that many objectives will very likelynot be covered on the test or, at best, Mill becovered with only a few items. A criter'Jn-ref erenced test constructed by proceduresespecially designed to facilitate criterion-referencedmeasurement can and sometimes is used to makenorm referenced measurements. However, acriterion-referenced test i not constructedspecifically to maximize the variability of testscores (whereas a norm-referenced test is). Thus,since the distribution of scores on a criterion-referenced test will tend to be homogeneous, itobvious that such a test will be less useful .fcordering individuals on the measured ability. Insummary, then, a norm-referenced test can be used.:-'to make criterion-referenced measurements, and acriterion-referenced test can be used to makenorm-referenced measurements, but neither usagewill be particularly satisfactory.

Thus it may be misleading to talk about tests aseither norm-referenced or criterion-referenced sincemeasurements obtained from either testinginstrument can be explained with norm -referenced interpretation, criterion-referencedinterpretation, or both. The important distinction,we believe, is between norm referencedmeasurement and criterion-referencedmeasurement. This-distinction was made by Glaser(1963) but seems to have been ignored by severalsubsequent writers.

Decision-Theoretic Approach to Criterion- Referenced Measurement

Our own conceptual framework for criterion-referenced measurement goes this way. LikeCropbach and: Gleser (.1965), we see testing as a

U

3

decision-theoretic process. One of the main differ-ences between norm-referenced tests and criterion-referenced tests is in rms of the kinds of

Page 9: Hambleton, Ronald K.; Novick, Melvin R. TITLE Toward an … · 2013. 12. 24. · Ronald K. Hambleton Melvin R. Novick. Over. hke3. years, standard procedures for con-structing, administering,

decisions they are specifically designed to make:Norm-referenced measurement is particularly use-ful in situations where one is interested in "fixed-quota" selection or ranking of individuals on someability continuum. Criterion-referenced measure-ment involves what Cronbach and Gleser (1965)would call a "quota-free" selection problem. Thatis, there is no quota on the number of individualswho can exceed the cut-off scores or threshold ona criterion-referenced test. A cut-off score is set foreach subscale of a criterion-referenced test to.separate examinees into two mutually exclusivegroups. One group is made up of examinees withhigh enough test scores ( the cut-off score) 'toinfer they have mastered the material to a desiredlevel of proficiencThe second group is made upof examinees whnid not achieve the minimumproficiency standard. At this stage of the develop-ment of a theory of criteriomreferenced measure-ment, the establishment of cut-off scores is

primarily a,value judgment. Much research mightusefully be undertaken to provi& guidelines forthis judgment. The educational goal is, of course,to have everybne achieve the standards. This isattempted by means such as individualizing instruc-tion to the point of providing multiple instruc-tional modes (Cronbach, 1967), individual pacing,and sequencing, as well as providing variousremedial programs.

The primary problem' in the new instructionalmodels, such as individually prescribed instruction,is one of determining, if the student's masterylevel, is greater than a specified standard iro. Here,it is the "true" score for an individual i in someparticularly well-specified contbnt domain. It mayrepresent the proportion of items in the domain hecould answer successfully. Since we cannotadminister al! items in the domain, ,we sample some'small number to obtain an estimate of ai, repre-tented as iri. The value of iro is the somewhatarbitrary threshold score used to divide individualsinto the two categories described earlier, i.e.,Masters and Nonmasters.

Basically then, the examiner's problem is tolocate each examinee in the correct category:There are two kind; of errors that occur in thisclassification problem: false positives and falsenegatives. A false-positive error occurs when theexaminer estimates an examinee's ability to beabove the cutting score when, in fact, it is not. A

4

I

false-negative error occurs when the examinerestimates an examinee's ability to be below thecutting score" when the reverse is true. The serious-ness of making a false-positive error depends tosome extent on the structure of the instructionalobjectives. It would seem that this kind of errorhas the most serious effect on program efficiencywhen the instructional objectives are hierarchical innature. On the other hand, the seriousness ofmaking a false-negative error would seem todepend on t6 length of time a student would beassigned to a remedial program because of his lowtest performance. (Other factors would be the costof materials, teacher time, facilities, etc.) TheMinimization of expected loss would then depend,'n the usual way, on the specified losses and theprobabilities of 'incorrect classification. This is thena straightforward exercise in the minimization ofwhat we would ,call threshold

In an attempt to view the above discussion in amore formal manner, suppose we take somecutting score, iro, and define a parameter w suchthat

w= 1 if ir iro

co = 0 if < .

Now ,if we obtain an estimate of ai, then anestimate of co can be obtained in the followingway: 4

= 1 if r o and

= 0, if /r^i < iro ,

Defining our error of estimation as (C.) co), it isclear that the error takes on one of three values,+1; 1, 0, corresponding to whether we make afalse-positive error, a false-negative error, or acorrect classification. Also, note that the squares ofthe errors and their absolute values are identical.Thus, any procedure that minimizes squared-errorloss (SEL) in the o.) metric also minimizes absolute-error loss (AEL) in that metric. Furthermore, theminimization of SEL aril AIL in the w metric isequivalent to the minimization of threshold lossfor ir in the special case where the losses assoc!ated

Page 10: Hambleton, Ronald K.; Novick, Melvin R. TITLE Toward an … · 2013. 12. 24. · Ronald K. Hambleton Melvin R. Novick. Over. hke3. years, standard procedures for con-structing, administering,

with false positives and false negatives are equal.The criterion-referenced measurement problem is,thus, one of determining an estimator i;) of w.bydetermining an estimator ir of r with a thresholdloss function and converting this to an estimate ofw. We shall exemplify this process shortly. Notethat with threshold loss, the estimate ir of Ir is nota single number but one of two intervals, [0, re) or[iro 1] It might well, be argued that what wedescribe here is not "measurement" at all; and, infact, it might be useful to avoid use of the termmeasurement in the above context.

The following example will illustrate an applica-tion of threshold loss. To estimate a person's rvalue under threshold loss, first write down thelosses associated with the two kinds of incorrectdecisions. Thus, we take

Q(e) = 0

Q(e) = a > 0

Q(e) = b > 0

The expected loss if we set ("A') = 1 is

a[Proa(r < roldata)]

and if we et it is

b[Prob(ir roldata)] .

if e = 0,

\ if e = +1,

if e = 1 .

(1)

(2)

Thus, we set l's.) = 1 or 0 depending upon whetherexpression (1) or expression (2) is the smaller. Ti;isdecision corresponds to estimating with thresholdloss whether r no or r < ro. Note, however, thatwe may decide that w = 0(ir < no), i.e., take c:o. 0not because Prob(ir < Ird date)'> Prob(ir roldata)

but because a is very much larger than Oahe lossassociated with a false positive is very much greaterthan that associated with a false negative.

Suppose we judge the.loss associated with a falsepositive to be a = 8 "units" and the loss associatedwith a false negative to be b = 1 unit. Further,supOose that given the data

Prob(ir Z re) = .85 and, hence, Prob(ir < 15

then, the value of (1) is

a[Prob(ir < rdiclata] =

and the value of (2) is

b[Prob(ir > no)Idata] =

(8)

(1)

(.15)

(.85)

=

=

1.2,

.85.

Hence, we take u.) = G and classify the student as anonmaster. Now, notice that the comparison of (1)and (2) is equivalent to the comparison of the a/bto the ratio

[Prob(ir voidata] / [1 Prob(ir a roldata)]

This spotlights the fact that the educator need notstipulate a and b in any absolute value. He needonly stipulate the ratio db. In this example, sinceProb(ir re) = .85, the student will be clastified asa nonmaster unless the ratio a/b < 5.67. Generallywith a and b as given, a student will be classified asa- master only if Prob(ir 70) > .89,approximately:

It should be noted that the above approachgeneralizes quite easily to situations.where there-are possibly 'several different treatments, severalrelevant levels of mastery on each skill, and. severaldifferent prerequisite skills. Details of such situa-tions will be given elsewhere.

j Bayesian Estinition of Mastery Scores

In order to determine if an examinee hasmastered a particular skill (i.e., instructionalobjective), we analyze his responses to items on acriterion-referenced test designed td measure thatskill. These items plus the items designed tomeasure achievement of other skills are organizedtogether to form a criterion-referenced test.

5

Each student is assumed to have some masteryscore, ni, which may be the proportion of items inthe domain he can answer correctly. The measure-ment problem is to estimate ri from some usuallysmall number of test items. Typically, a student'smastery score is estimated to be his props/ton-correct score. Mastery scores' are estimated for the

Page 11: Hambleton, Ronald K.; Novick, Melvin R. TITLE Toward an … · 2013. 12. 24. · Ronald K. Hambleton Melvin R. Novick. Over. hke3. years, standard procedures for con-structing, administering,

purpose ofdecision- making: If tr-i ?- re, the studentis sent on to new work; otherwise with tri < :re, heis assigned some remediai work. Before presentinga Bayesian solution to the mastery assessmentproblem, let us consider the problem of estimatinga single student's true score 7r..

Generally, the method of using the proportion-correct as an estimate of tri is not entirelysatisfactory when the number of items on whichthe proportion is based is few and when there aremany students In situations where one is inter-ested in estimating many parameters; some, bychance, will be substantially overestimated andothers, underestimated. The implication of this isthat many errors of classification will be made. Inestimation or in making mastery decision's on thebasis of small amounts of information, we run therisk of making many errors. What is the solution?Because of the extensive amount of testing takingplace, it is klually impractical to consider lengthen-ing the test. However, a Bayesian estimation

,procedure proposed by Novick, Lewis, and Jackson(1972) provides, at least theoretically, a way ofobtaining more information i each examineewithout requiring the administration of anyadditional test items. According to Novick 'et al.(1972), this can be done by using not only thedirict informat;on provided by a student's (sub-scale) score, but also the collateral inforMationcontained in the test data of other students.(Another possibility and worthy of furtherresearch is the possibility of using the student'sother subscale scores and previous history as

collateral information.)

A familiar example of how this can be donecome'; from the application of classicactest theory(Lord & Novick, 1968) to norm - referencedmeasurement. Within the classical test theorymodel, each examinee's observed score x on a testmay be used as an estimate of his true score r. Thestandard deviation of 'error scores acrossexaminees in the population (standard error of

.4)xx'measurement) will be ax(1 pwhere ax isthe .standard deviation of observed scores, and pxx,is the reliability of the test. This formula provides ameasure of the Inaccuracy, on the average, of usingthe observed score as an estimate of true score. Analternative method of estimating true score is touse a regression estimate r = xPxx' Px (1 Pxx1),.where px is the mean-observed score in the

6

population of examinees. It can be shown that theaverage error in the population obtained by using

as an estimator of r is OxPx'xi.4X)(1)''S

This is called the standard error of estimation. Bycomparing formulas, it is easilyl, seenthat the standard eyror of estimation is smallerthan the standard error of measurement and issubstantially smaller than the latter when pxx, is.low. This is because, in effect, we are usinginformation about the group of which the indi-vidual is a member to provide "prior" informationfor the Bayesian estimation of each person's truemastery score. With this approach, under commoncircumstances, the Bayesian method can effect anincrease of precision equivalent to that whichwould be obtained by adding between 6 and '12items to the test (see Novick, Lewis, & Jackson,1972). Thus, the Bayesian method hissomethingsubstantial to offer in the contexy'rof norm-referenced measurement problems, and similarly, itwould seem that the same potential exists witlfcriterion-referenced testing problems.

However, it should be noted that our previousdiscussion has stressed that the threshold-lossestimates will brk,required. The estimates obtainedby Novick, Lewis, and Jackson (1972) were basedon a zero-one loss function, and thus, a modifica-

":" tion of the Novick, Lewis, ,and Jackson methodwould be desirible.- At present, cumbersomenumerical methods would be required to obtain asolution.

One example that rather dramatically illustratesthe effect of the Bayesian estimation procedure isthe following. Suppose we administer a criterion-referenced test to a group of examinees before andafter instruction. Let us limit ourselves to theproblem of estimating mastery scorns on a

particular objective for the group of examii;ees onthe two test occasions. Suppose that the tests areshort, and hence, probably have only moderate_reliability. Suppose further that the mean pretestand posttest scores are .4 and .8, respectively, andthe threshold score is .65. Now a student with aproportion-correct score of .7 on the pretest wouldunder the usual procedure be allowed to skip thatparticular unit of instruction. However, chances arethat this student's mastery score is overestimated.The Bayesian analysis might well decide that hewas. a nonmaster. Speaking" loosely and withrespect to a squared-error loss method, the

Page 12: Hambleton, Ronald K.; Novick, Melvin R. TITLE Toward an … · 2013. 12. 24. · Ronald K. Hambleton Melvin R. Novick. Over. hke3. years, standard procedures for con-structing, administering,

Bayesian analysis.might regress his estimated scorefurther toward the mean than the cutting scoreand, thus, assign him to take instruction on theskill.

Consider now a student with a proportioncorrect score of .6 on, the posttest. Here the

Bayesian analysis could be ..16 that hs "estimatedscore," in effect, exCeads ';non, instead ofassigning him to some rrinedial piogrd-n, he will beallowed to go on to new work. Hovi..ever,..if, hisposttest group had a mean oerformanc of ,68, hewould probably be a stimatesi to be a n)nmaster:

Approaches to Reliability and Validity Estimation

In practical applications of criterion-referencedtesting, it-would seem that in order to evaluate thetest, it would be necessary -to know somethingabout the consistency of decision making acrossparallel forms of the criterion-referenced test oracross repeated measuren-r.nts (i.e., reliability).Another aspect of the measurement procedure thatshould seemingly be Considered is the aocJraci ofdecision making (i.e., validity). Me tyoblem ofreliability and validity estimation for criterion-referenced tests is considered next. '

Because the designer of a criterion-referencedtest has' little_ interest in discriminating amongexaminees; nosattempt is made to select items to'produce a'test of maximum test score variability,and thus,' that variance will typically be senall.Also, criterion-referenced tests are usually adminis-tered either iihmediately before or after small unitsof instruction. Thus, it is not surprising that 'wefrequently observe nomogeneous distributions oftest scores on the pre- and ,posttests, but centeredat the low and high ends of the achievement scales,respectively. It is well known from the studyi-ofJessica( test t)leory (Lord lie Ncivick, .1968) thatwhen the variance of test scores ,is restricted;Correlational -estimates of reliability and validitywill be low. Thus, it seems clear that the classicalapproaches to reliability and validity estimationwill need to be interpreted more cautiously fordiscarded) in the analysis of criterion-referencedtests. Perhaps, an t,even more serious reservationconcerning the classical approacWto reliability and

,validity estimation for criterion-reterenced tests, ifone looks at these psychometric concepts in,

decisiontheoretic terms, is that the correlationalmethod represents an inappropriate choice 'of a lossfunction (squared-error loss in the tr. metric) withwhich to evaluate a test. This point 'will beexpanded upon later.

7

However, before considering a decision-theoreacapproach. to reliability andvalidity estimation, letus review some alternate approaches proposed byother-writers. Carver (1970) argues that the relia -6ility of any test dependi upon replicab;lity, butreplicability is not dependent upon testi scorevariance. If a group of examinees all ola,aintimilarscores (to other members of the group) or\parallelforms of some. criterion-referenced tes., near'perfect replicabiliN exists even though test relia-bility, estimated using classical correlationalmethods, would 'be closi to zero. This rather, .

'extreme example . points P-it the shortcOming ofthe correlational approach to-reliability estimation.Carver (1970) propose two statistics to assesscriterion-referenced test .bility. First, he says,"The reliability of a single. form of a criterion-referenced device could be estimated by admin-istering it to!.two comparable groups. The per .

centage that' met the criterion in one group couldbe compared to the percentage that met -thecriterion In the other group [0. 561." The more-comparable the 'statistics, the more reliable the testcould be said -to be. Secondly, Carver suggestedthat the reliability of a criterion-referenced testshould be assessertby comparihg the percentage ofexaminees achieving the criterion or. parallel tests.'

Cox and Graham (1966) report the 'use of thecoefficient of reproducibility as an alternati'oe tothe classical approach to reliability estimation fordne special type of criterion-referenced test. Theycalculate the coefficient for a sequentially scaledcriterion-referenced test deiigned for use in a unitof instruction where objectives can be identified asbeing 'sequential in nature. Tests are said to bescalable if for: a 'Particular Ordering of items,individuals are able to answer all questions up to apoint and none beyond. The coefficient ofreproducibility is a measure of the extent to which

Page 13: Hambleton, Ronald K.; Novick, Melvin R. TITLE Toward an … · 2013. 12. 24. · Ronald K. Hambleton Melvin R. Novick. Over. hke3. years, standard procedures for con-structing, administering,

k

-group performanex satisfies this condition As Cox(1970) suggests, the problems of using" thecoefficient of reproducibiltry as a reliabilityestimate have yet to be determined.

Another interesting suggestion for reliabilityestimation comes from the work of Livingston(1972a, 1972b), He proposes a reliabilitycoefficient which is based on squared deviations of-scores from the performance standard (or cuttingscaie) rather than the mean- as is done in thederivation of reliability for norm-referenced testsin classical test theory. The result is a Tenabilitycoefficient which has several of the importantproperties of -a classical estimate of reliability. Infact, it can be easily Olown, that the classicalreliability is simply a special case of the new!i`iatiolity coefficient, However, several psycho-metricians (e.g.* Harris .1972a) have expresseddoubts concerning the usefulness of Livingston'sreliability estimate,

Our own feeling is that Livingston misses thecent for much of criterion-referenced testing. It isnot, as:he suggeSts, "tz: know how far to student'sj

tire deviates from a fixed standard." Rather, theproblem is one of deciding whether a student's talemrformfince lev41 is above or below some cuttingscore. In fact. in most practical applications ofcnteriorreferenced 'tests, the test score is used todichotomize individuals into either a "mastery" ora "nor:maven," category. Thus trom our con-ceptoalization of the measurement probrem withcriterion-referenced measurement, Livingston'schoice of a loss function with which to evaluatethe reliabtlity of a criterion-referenced test iswrdrig Specifically, 'we suggest that squared-errorloss in The 7 metric is not appropriate and ;Matthreshold loss is appro-ptiate,

r"4 iw.. a may be the case that a measurementsituation %will arise with the new instructionalmodels and a squared -error or absolute-error Jossfunction may be appropriate, but in such a

situation, a is unlikely that there would simulta-neously be a great concern with a threshold score.

t220

While there has been little work done on theproblem of assessing reliability, even less work hasbeen reported to date on establishing the validity ofcriterion-referenced test scores7-Abcve all else, acr i terion-referanced test must have contentvalidity. According to Popham and Husek (1969),cuntent validity is determined by "a carefullymerle judgment, based on the test's apparentrelevel;e to the behaviors legitimately inferablefrom those delimited by the criterion." If tech-niques such as those advocated by Hively,Patterson, and Page (1968) or Bormuth (1970) fordefining content domains and item generation rulesare followed, content validity follows. If otherprocedures are used, the task of determiningcontent validity becomes mote difficult

While we would suggest that the traditionalconcepts of reliability and validity could bereplaced by a complete decision-theoretic formu-lation, it will nevertheless-be useful to point out arelationship between these approaches. Suppose weare given two criterion-referenced tests which in aspecified pbpulation and for a specified qualifyingscore -.70 are parallel (in the classical senseseeLord & Novick, 1968) in thew metric. Denotethe estimates of -w-for person,i on the two tests bythe observed scores coli and w2i and define thereliability of the test as the correlation overpersons of 1 j'and w21. This is, of course, classicalreliability theory in the co metric. It is notpartithlarly satisfactory for the usual reasons thatproduct moment correlations are unsatisfameasures of association or agreement for binary(zero-one) variables. A more satisfactory measureof reliability might simply be the proportion oftimes that the same decision would be made withthe two parallel instruments.

Validity theory would take the same form,except of course, that a new test Y would serve ascriterion and the qualifying score on the secondtest need not correspond with the qualifying scoreon the predictor criterion-referenced test. Thecriterion ''Test" might' well be derived from per-formance on the next unit of instruction, or itwould be a job-related performance criterion.

2

Page 14: Hambleton, Ronald K.; Novick, Melvin R. TITLE Toward an … · 2013. 12. 24. · Ronald K. Hambleton Melvin R. Novick. Over. hke3. years, standard procedures for con-structing, administering,

REFERENCES

Atkinsor, R. C. Computer-based instruction ininitial reading. In Proceedings of the 1967Invitational Conference on Testing Problems.Princeton: Educational Testing Service, 1968.

Block, J. H. Criterion-referenced measurements:Potential. School Review, 1971, 69, 289-298.

Block, J. H. (Ed.) Mastery learning: Theory andpractice. New York: Holt, Rinehart, andWinston, Inc. 1971.

Bloom, B. S. Learning for mastery. EvaluationComment, 1968, 1, No. 1.

Bormuth, J. R. On the theory of achievement testitems. Chicago: University of Chicago Press,1970.

Carroll, J. B. A model of r=hoot learning. TeachersCollege Record, 1963, 64, 723:733.

Carroll, J. B. Problems of measurement related tothe concept of learning for mastery. EducationalHorizons, 1970, 48, 71-80.

Carver, R. P. Special problems in measuring changewith psychometric devices. In EvaluativeResearch.- Strategies and Methods. Pittsburgh:American Institutes for Research, 1970.

Cox, R. C. Evaluative aspects of criterion-referenced measurement. Paper presented at theannual meeting of the American EducationalResearch Association, Minneapolis, 1970.(ERIC, ED 038 679)

Cox, R. C., & Graham, G. T. The development of asequentially scaled achievement test. Journal ofEducational Measurement, 1966, 3, 147-150.

Cronbach, L. J. How can instruction be adapted toindividual differences? In R, M. Gagne (Ed.),Learning and Individual Differences. Columbus,Ohio: Charles E.aerrill, 1967.

-Ebel, R. L. Criterion-referenced measurements:Limitations. School Review, 1971, 69, 282-288.

Flanagan, J. C. Functional education for theseventies. Phi Delta Kappan, .967, 49, 27-32.

Flanagan,\JI C. Program for learning in accordancewith n s. Psychology in the Schools, 1969, 6,133-136.

Glaser, R. Instructional technology and themeasurement of learning outcomes. AmericanPsychologist, 1963, 18, 519-521.

Glaser, R. Adapting the elementary school cur-riculum to individual performance. In Pro-ceedings of the 1967 Invitational Conference onTesting Problems. Princeton: Educational Test-ing Service, 1968.

Glaser, R., & Nalco, A; J. Measurement in learningand instruction. In R. L. Thorndike (Ed.),Educational Measurement. Washington:American Council on Education, 1971, 625-670.

Hambleton, R. K., & Gorth, W. P. Criterion-referenced testing: Issues and applications.Center for Edutational Research TechnicalReport No. 13, School of Education, Universityof Massachusetts, Amherst, 1971. _

Harris, C. W. An interpretation of Livingston'sreliability coefficient .for criterion-referencedtests. Journal of Educational Measurement,1972, 9, 27-29. (a)

Harris, C. W. An index of efficiency for fixedlength mastery tests. A paper presented at theannual meeting of the American EducationalResearch Association, Chicago, 1972. (b)

Page 15: Hambleton, Ronald K.; Novick, Melvin R. TITLE Toward an … · 2013. 12. 24. · Ronald K. Hambleton Melvin R. Novick. Over. hke3. years, standard procedures for con-structing, administering,

Harris, M. L., & Stewart, 0. M. Application ofclassical strategies to criterion- referenced testconstruction. A paper presented at the annualmeeting of the American Educational ResearchAssociation, New York, 1971.

Hieronymus, A. N. Today's testing: What do weknow how to do? In Proceedings of the 1971Invitational Conference on Testing Problems..Princeton: Educational Testing Service, 1972.

Hive ly, W., Patterson, H. L, & Page, S. A "uni-verse-defined" system of arithmetic achievementtests. Journal of Educational Measurement,1968, 5, 275-290.

Ivens, S. H. An investigation of item analysis,reliability and validity in relation to criterion-referenced tests. Unpublished doctoral disserta-tion, Florida State University, 1970.

Jackson, P. H., & Novick, M. R. Maximizing 'thevalidity of a unit-weight composite as a functionof relative component lengths with a fixed totaltesting time. Psycliometrika, 1970, 35, 333-347.

Kriewall, T. E. Applications of information theoryand acceptance sampling principles to themanagement of mathematics instruction. Un-published doctoral dissertation, University ofWisconsin, 1969.

10

Livingston, S. A. Criterion -referenced applicationsof classical test theory, Journal of EducationalMeasurement, 1972, 9, 13-26. (a)

Livingston, S. A. A reply to Harris' "An inter-pretation of Livingston's reliability coefficientfor criterion-referenced tests." Journal of Educa-tional Measurement, 1972, 9, 31. (b)

Lord,' F. M., & Novick, M. R. Statistical theories ofmenal test scores. Reading, Mass.: AddisonWesley, 1968.

Novick, 114 R., Lewis, C., & Jackson, P. H. The'estimatio of proportions in m groups. Psycho-metrika, 972, 37, in press.

Popham, W. Husek, T. R. Implications of,s,criter' measurement. Journal ofEducational Meaib ment, 1969, 6, 1.9.

Suppes, P. The uses of corn ters in education.Scientific American, 1966, 215, -221.

Woodbury, M. A., & Novick, M. R. Maximizi g thevalidity of a test battery as function of relativetest lengths for a fixee total testing time.Journal of Mathematical Psychology, 1968, 5,242-259.

Page 16: Hambleton, Ronald K.; Novick, Melvin R. TITLE Toward an … · 2013. 12. 24. · Ronald K. Hambleton Melvin R. Novick. Over. hke3. years, standard procedures for con-structing, administering,

ACT Research Reports

This report is Number 53 in a series published by the Research and Development Division of The American College TestingProgram. The first 26 research reports have been Jeposited with the American Documentation Institute, ADI AuxiliaryPublications Project, Photoduplication Service, Library of Congress, Washington, D.C. 20540. Photocopies and 35 mm.microfilms are available at cost from ADI; order b ADI Document number. Advance payment is required. Make checks o:money orders payable to: Chief, Photoduplical rvice, Library of Congress. Beginning with Research Report No. 27, thereports have been deposited with tpe Nptional A iary Publications Service of the American Society for Information Science(NAPS), c/o CCM Information Sciences, Inc., 22 West 34th Street, New York, New York 10001. Photocopies and 35 mm.microfilms are available at cost from NAPS. Order by NAPS Document number. Advance payment is required. Printed copies($1.00) may be obtained, if available, from the Publication and Information Services Division, The American College TestingProgram, P.O. Box 168, Iowa City, Iowa 52240. A check or money order must accompany the request.

,..,

The reports since May 1970 in this series are listed below. A complete list of the reports can be obtained by writing to thePublication and Information Services Division, The American College Testing Program, F. 0. Box 168, Iowa City,, Iowa 52240.

No. 34 Research Strategies in Studying College Impact, by K. A. Feldman (NAPS No. 01211;, photo, $5.00, microfilm, $2.00)

No. 35 An Analysis of Spatial Configuration and Its Application to Research in Higher Education, by N. S. Cole; & J. W. L.Cole (NAPS No. 01212; photo, $5.00; microf Hp, $2.00)

No. 36 Influence of FinanciakAleed on the Vocational Development of College Students, by A. R. Vander Well (NAPS No.01440; photo, $5.20; microfilm, $2 OC)

No. 37 Practices and Outcomes of Vocational-Technical Education in Technical and Community Colleges, by T. G. Gartland, &J. F. Carmody (NAPS No. 01441; photo, $6.80; microfilm, $2.00)

No.' 38 Bayesian Considerations in Edt.^.ational Information Systems, by M. R. Novick (NAPS No. 01442; photo, $5.00;microfilm, $2.00)

No. 39 Interactive Effects of Achievement Orientation and Teaching Style on Academic Achievement, by G. Domino (NAPSNo. 01443; photo, $5.00; microfilm, $2.00)

No. 40 An Analysis of the Structure of Vocational Interests, by N. S. Cole, & G. R. Hanson (NAPS No. 01444; photo, $5.00,microfilm, $2.00)

No. 41 How Do Community College Transfer and Occupational Students Differ? by E. J. Brue, H. B. Engen, & E. J. Maxey(NAPS No. 01445; photo, $5.50; microfilm, $2.00)

No. 42 Applications of Bayesian Methods to the Prediction of Educational Performance, by M. R. Novick, P. H. Jackson, D. T.Thayer, & N. S. Cole (NAPS No. 01544; photo, $5.00; microfilm, $2.06)

No. 43 Toward More Equitable Distribution of College Student Aid Funds: Problems in AssessingStudent Financial Need, byM. D. Orwig (NAPS No. 01543; photo, $5.00; microfilm, $2.00)

No. 44 Converting Test Data to Counseling Information, by D. J. Prediger (NAPS No. 01776; photo, $5.00; microfiche, $2.00)

No. 45 The Accuracy of Self-Report Information Collected on the ACT Test Battery: High School Grades and Items ofNonacademic Achievement, by E. J. Maxey, & V. J. Ormsby (NAPS No. 01777; photo, $5.00; microfiche, $2.00)

No. 46 Correlates of Student Interest in Social Issues, by R. H. Fenske, & J. F. Carmody (NAPS No. 01778;, photo, $5.00;microfiche, $2.00)

No. 47 The impact of College on Students' Competence to Function in a Learning Society, by M. H. Walizer, & R. E. Herriott(NAPS No. 01779; photo, $5.00; microfiche, $2.00)

No. 48 Enrollment Projection Models for Institutional Planning, by M. D. Orwig, P. K. Jones, & 0. T. Lenning (NAPS No.01780; photo, $5.00; microfiche, $2.00)

0

No. 49 On Measuring the Vocational Interests of Women, by N. S. Cole (NAPS N '). not available at this time.)

No. 50 Stages in the Development of a Black Identity, by W. S. Hall, R. Freedle, & W. E. Cross, Jr. (NAPS No. not available atthis time.)

No. 51 Bias in Selection, by N. S. Cole (NAPS No. not available at this time.)

No. 52 Changes in Goals, Plans, and Background Characteristics of CollegeBound High School Students, by J. F. Carmody, R.H. Fenske, & C. S. Scott (NAPS No. not available at this time.)

11