55
RESEARCH MEMORANDUM NUMBER 19 JUNE } BEST TEST DESIGN AND SELF-TAILORED TESTING BENJAMIN D, WRIGHT AND GRAHAM A, DOUGLAS Statistical Laboratory I Department of Education The ilniversety 0 f Chicago

Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

RESEARCH MEMORANDUM NUMBER 19

JUNE }

BEST TEST DESIGN AND SELF-TAILORED TESTING

BENJAMIN D, WRIGHT AND GRAHAM A, DOUGLAS

Statistical LaboratoryIDepartment of Education

The ilniversety 0f Chicago

Page 2: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

BEST TEST DESIGNAND SELF-TAILORED TESTING

Benjamin D . Wright

Graham A . Douglas

University of Chicago

University of Western Australia

AR,qTRArT

Shows how the Rasch model and a few natural decisionsabout the nature of tests and their targets leads to simplepractical procedures for best test design and self-tailoredtesting .

The error minimization necessary for best testdesign is developed and applied . Tables for convertingscores observed on self-chosen segments of uniform testsinto test-free measures and their standard errors are pro-vided .

Robustness with respect to designerrors in target location and item calibration is evaluated .and a convenient unit of measurement for psychological andeducational variables suggested .

TmTRnmTrTTnm

When an examiner wishes to measure a person he mustobtain an appropriate measuring instrument . He may do thisby selecting from available standard forms or he may drawupon a pool of calibrated items and compose a test tailoredto the requirements of his measurement target . He may be apsychometrician constructing an integrated series of standardforms from a national pool of well-calibrated items or ateacher making a test for a pupil from items in his files .In each case the examiner requires a clear way of thinkingabout tests and targets which allows him to deduce from the

Page 3: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

nature of the target he wishes to measure the characteristicsof the best possible test for the job . We will develop andexplain such a way of thinking and derive its practical con-sequences for how to design, construct and administer besttests .

A simple practical procedure is especially urgent ifthe examiner wishes to bring test and target into the bestpossible relationship during the process of testing . Tailoredtesting (Lord, 1971) tries to do this by various stepwiseprocedures .

Unfortunately most schemes are either expensive,requiring computer assistance, or complicated, placing heavydemands on test-taking ingenuity .

The .procedure we will develop lends itself to a simple,practical and inexpensive scheme for self-tailored testing .It makes it possible to present a target person with a standard booklet of items sequenced in increasing difficulty andto invite him to choose any starting point in the bookletwhich suits him . Having chosen, he can work up into harderitems until he reaches items so difficult he no longer feelshe can do his best and/or down into easier items until hereaches items so easy they no longer challenge him .

If the booklet is properly constructed, the examinercan use the target person's performance on such a variablesegment of self-chosen items to estimate his ability . If theitems in the booklet are more or less equally spaced on a log-odds scale and have been constructed to more or less . fit theRasch model, then three easy statistics : the sequence numberof the easiest item reached, the sequence number of the hardestitem reached and (through a simple self-scoring sheet) thenumber of intervening items succeeded on, are sufficient forreading the target person's estimated measure and its standarderror directly from a small family of easy-to-use tables . Noneed for computer assistance, nor machine scoring, nor lengthycalculations .

No need for pretests to identify the rightindividualized test .

The process is self-tailoring .

As thetarget person takes the test he finds for himself the itemsin the test booklet of difficulty best for him .

RASTC. CONCFPTS nF TFsT nPRTCN

In order develop such a convenient system of measure-ment, we must decide what it is reasonable to imagine happenswhen a person to be tested responds to an item used to makemeasurements . What could a response we might observe tell usabout a person we might wish to measure? What part does the item

Page 4: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

play?

[low do item and person interact to show us some-thing about the person's position on an interesting butdifficult to observe variable? How shall we think aboutthis "latent" variable along which item and personsupposedly interact?

The basic concepts we need toestablish and clarify are :

latent variable, responsemodel, measurement target and test design .

The T.3t2nt Va_r ; nj2 e

A fundamental problem in the systematic developmentof our knowledge about the world is how to weave a usefulconnection between experience and idea . Careful observations are the core of our experience .

Our motive forbothering to make careful observations is our idea abouthow we suppose things to be . We know our specific obser-vations to be limited and incomplete yet we try to see inthem indications of general ideas which we plan to usecomprehensively at other places and times not yet experienced .The concept of a "latent variable" is intended to keep thisdifference between passing experience and persisting ideaclearly before us and so to help us weave a useful connec-tion between them .

The distinction between a latent or underlyingvariable and a manifest or observable one is analogousto the distinction between the parameter of a model anda statistic intended to estimate it .

The parameterrepresents an idealized conception of all that we wishto know . The statistic represents a particular realiza-tion of what little we can observe . Our interest isfocused on the latent variable and, whilst it can onlybe known through its observable manifestations, it isthe latent variable which is the motive and meaning ofour observations .

In mental measurement the "intelligence," "ability"or "achievement" variables along which both items andpersons are supposed to be positioned are latent variables .

It is through the calibration of items alongthese latent variables that we transform a person'sobservable responses into his "unobservable" measure-ments .

Our interest in his actual responses is transient .Once we have used our knowledge of item calibration toextract from his item responses an optimal indication ofhis measure we have no further interest in the responsesthemselves and turn instead to our motive for observingthese responses in the first place, namely our wish tomeasure the person on the fundamental but latent variable .In this discussion we will use the term ,"ability" to

Page 5: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

refer to person measure and speak of persons as "dumb"and "smart" and of items as "hard" and "easy ." Butthe latent variable could as well be "achievement" or"intelligence" or some "attitude" or "inclination" orany other attribute for which we constructed relevantobservations .

But if a variable is latent and we cannot observeit directly, how can we know what it is? How can wearrive at its operational definition? The answer liesin the items we use to make measurements . To be usefulthese items must be calibrated along the latent variable .Each item must have its own location, the position atwhich its difficulty matches the ability of a person forwhom that item is just right . 1 When the pool of itemsfrom which we select the elements for a best possibletest has been calibrated on a latent variable, thenthese items and their locations on the latent variableprovide its operational definition . A measurement ofa person on the variable will place him among itemswith difficulties near his estimated ability . Themeaning of his position on the variable will be definedby these nearby items .

nri rri n

an(j

.bra l P

In general the origin and scale of a latent vari-able are arbitrary . We must make some specific choicein order to proceed with measurement . If we ask, "Whatis the distance to New York?", we find that a usefulanswer depends on settling two prior questions : "Fromwhere?" and "In what units?" These are requests for anorigin and scale . Thus even for the familiar variable"distance" an origin and scale must be defined beforewe can make useful measurements .

The origin is the place from which distance

We are not ready to derive more exactly what "just right"might mean . For that we need to specify a model for whathappens when a person responds to an item . But we can an-ticipate our final definition : "just right" will meanthat the probability of a correct response is exactlyone-half, that is "even odds for a correct response," or"a log odds difference between person ability and itemdifficulty of zero ."

Page 6: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

along the latent variable will be "measured ." In generalthere is no unique zero place . We have to decide upon achoice convenient and useful for the measurements we wantto make .

The same is true for scale . We may be motivatedto make a choice of origin which frees us from fiddlingwith negative numbers . We may be motivated to choose ascale which frees us from decimals or uses the decimalpoint in an informative way . When no more useful choiceis in sight we may even nominate a sample of person measure-ments or item calibrations as our standard and take theirmean value as our origin and their standard deviationas our scale . But whatever our pleasure we will be forcedto make some choice . In the discussion that follows wewill use either the center of our target or the centerof our test as our origin and the log odds for a correctresponse as our scale . At the end we will recommend atransformation of origin and scale with pleasing pro-perties .

The concept of a latent variable is indispensiblefor acquiring a grasp of what testing is all about andessential for guiding our attempts to measure . A latentvariable is operationally defined by the pool of cali-brated items which provides the elements for measuringit . How to identify, select and calibrate items alonga latent variable is described in Rasch (1960), Wrightand Panchapakesan (1969), Andersen (1972a, 1972b, 1973)and Wright and Douglas (1975) .

TbP RPSnnnsP MndPI

What do we observe when a person of unknown abilitytries a calibrated item of some estimated difficulty?Usually just the binary response "right" or "wrong ."In order to use this binary observation as a basis forestimating a person's ability and hence for measuringhim we must formulate a plausible relationship betweenhis unknown position, or "ability," on the latent vari-able, the calibrated "difficulty" of the item and theobservable "response ." We will call this relationship

More complex observations have been proposed andmodelled (Andersen, 1972a, Samejima, 1968 and Bock,1972) but so far they have found few uses . Since ouraim is to formulate and develop procedures consistentwith common practice, we will not adventure into thepossibilities of more complex observations in thisarticle .

Page 7: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

the "response model ." By what criteria shall we formu-late it?

The response must be reasonable . The way personsand items enter into it must be consistent with what wealready know .

We also want the simplest model whichwill do the job . Beyond reasonableness and simplicitywe want the measurements resulting from the model to befree from irrelevant influences such as how the itemshappen to have been calibrated and on whom (Rasch, 1960) .

In particular we want to be able to adjust our inter-pretation of the implications of each observed responsefor the difficulty of the item prov6king it so that wecan reach an "item-free" estimate of the person's abilityIWright, 1967) .

Finally since making measurements willamount to using a collection of well-designed observa-tions to estimate a person's ability we want our responsemodel to have parameters which can be estimated satis-factorily .

What kind of relationship can we expect betweenthe unobservable characteristics of person and item andthe observable response their interaction is supposed toproduce?

No matter how smart a person may be we cannotbe sure he will succeed on a particular item . No matterhow dumb, we cannot be sure he will fail . Thereforeour only reasonable course is to formulate a probabilisticrelationship between person ability, item difficulty andthe response they are supposed to influence .

How shall ability and difficulty combine to deter-mine this probability for a successful response? Weexpect a person's chances for success to increase withhis ability but to decrease with item difficulty . Eitherthe difference or the ratio of these two parameters willexpress these expectations . Since differences are simplerto work with, we will use them .

It only remains to formulate these differences insuch a way that,no matter what their values, they will definea probability between zero and one . The simplest realization of these reasonable requirements is the responsemodel for binary observations proposed by Georg Rasch11960) . s

3 For further comments on this kind of model see Birnbaum,1968, 431-34 and 445-46 .

Page 8: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

This particular response model meets all ourrequirements .

Not only is the relationship betweenperson, item and the probability of a correct responsereasonable and simple, but, as the work of Rasch andmany others has so amply shown, the calibration ofitems can be person-free and the measurement of personscan be item-free .

What about the estimation of these person anditem parameters? It has always been popular to takethe number of right answers on a test as an indicationof the-ability of a person . That popular practice isjust what the Rasch model calls for . In fact, if wetake the widespread belief in the fundamental meaning-fulness of an unweighted test score as our point ofdeparture and ask what response model this beliefimplies, then we can deduce that the Rasch model isthe one and only reasonable model consistent with thatbelief and its practice .

The very same statistic which,according to the Rasch model, leads to an unbiased,consistent and sufficient estimator of person abilityis that same unweighted test score which nearly everyonealready uses to make their measurements .

The Rasch model for what happens when person vresponds to item i with binary response Xvi=1 (for a

correct response) and X =0 (otherwise)vias :

where

P {xvi I Sv 1 d i ) =

8 v = the ability of person v, and

xvi (a v -d i.)e

the difficulty of item i

can be expressed

In order to emphasize that item difficulty in thismodel is a persisting property of the item and not of theperson, we point out the difference between the colloquialuse of "difficulty" as in the complaint "That item is easyfor you but hard for me .", and the technical use in whichthe item has its own fixed difficulty and the complaintbecomes "You are smart but I am dumb ." one value of theresponse model is that it formulates the relationshipbetween these two kinds of difficulty explicitly . Collo-

Page 9: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

quial difficulty is person-bound . It is the probabilityof a person with a particular ability succeeding on theitem .

But technical difficulty is person-free . It isthe fixed item parameter which joins with person abilityin the response model to produce the probability of acorrect response .

Practical discussions of how to estimate the para-meters of this model are given in Rasch (1960), Wrightand Panchapakesan (1969), Wright and Douglas (1975) andWright and Mead (1975) .

ThA Tnrgpt

When an examiner plans a measurement there must bea target person or group of persons about whom he wantsto know more than he already knows . If he cares aboutthe quality of his proposed measurements then he willwant to choose or construct his measuring instrumentwith the specifics of his target in mind .

In order todo this systematically he must begin by setting out asclearly as he can what he expects of his target . Wheredoes he suppose it is located on the latent variable?How uncertain is he of that approximate location? Whatis the lowest ability he imagines the target could have?What is the highest? How are other possible values dis-tributed in between?

Sometimes an examiner has explicit prior knowledgeabout his target . He or someone else has measured itbefore and so he can suggest its probable location anddispersion directly in terms of these prior measures onthe latent variable and their standard errors . Sometimesan examiner has sample items calibrated along the latentvariable, some of which he or his client believe areprobably just right for the target, some of which arenearly too hard and some of which are nearly too easy .Then he can take from the difficulties of these sampleitems rough indications of the probable center andboundaries of his target .

One way or another the examiner must assembleand clarify his suppositions about his target as wellas he can so that he can derive from them the test designwhich has the best chance of most increasing his knowl-edge .

If he knew enough about his target, he would notmeasure it .

But no matter how little he knows, he alwayshas some idea of where his target is . Being as clear aspossible about that prior knowledge is essential for thedesign of the best possible test .

Page 10: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

A target specification is a statement about whereon the latent variable we suppose the target to be . Weexpress our best guess by specifying the target's supposedcenter, its supposed dispersion and its supposed shapeor distribution .

If we let

M = our best guess as to target location,S = our best guess as to target dispersion,D = our best guess as to target distribution,

then we can describe a target

G by the expressionG(M,S,D) and we can summarize our prior knowledge andhence our measurement requirements for any target wewish to measure by guessing as well as we can valuesfor the three target parameters, M, S, and D .

A picture of a target is given in the upper halfof Figure 1 .

Guessing the supposed location M of atarget seems straightforward .

But guessing the dispersion S

and the distribution D

forces us to thinkthrough the difference between a target wh ; nh-i s a singleperson and one which is a group . For the single person,S can describe the extent of our uncertainty about wherethat person is located . The larger our uncertainty, thelarger S .

If we can specify boundaries within which we feelfairly sure the person will be found we can set

S

sothat

M ± kS

defines these boundaries (where

K=2 or 3) .Then even if we have no clear idea at all about the dis-tribution D of our uncertainty we can neverthelessexpect (thanks to Tchebycheff's inequality) that at least~1-1/k 2 ) of the possible measures will fall within M + kS .

If we go further and expect that the measures wethink possible for the person will pile up near

M thenwe may even be willing to take a normal distribution asa useful way to describe the shape of our uncertainty .In that case we can expect ,95 of the possible measuresto fall with M +- 2S and virtually all of them to fallwithin M + 3S .

We will refer to these two target distributionsas the Tchebycheff interval and the normal,

We mightconsider other target distributions .

But these twocover an examiner's state of mind with respect to theshape of his target rather well . If he feels unhappyabout thinking of his target as approximately normalthen it is unlikely that he will have any definitealternative clearly in mind . Thus the most likely

Page 11: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

RelativeFrequency

P

TARGET G(M,S,D)

.2

RelativeScoref=r/L

TEST T(H,W,L)

ShapeD

IntervalNormal

M_+2S M+3Scovers covers

> .75 > .89.95 .99

AbilityParameter

H-W/2

Fi

H+W/2

b

BEST TESTDESIGN

H = MW = 4SL = C/SEM2LOD = C/LSEM =4 < C < 9

LOD = Ab = of Af = C/L

AbilityStatistic

FIG . 1

The Distribution of a Target and the Operation of a Test

Page 12: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

alternative to a normal target is one of unknown distri-bution, captured by a Tchebycheff interval . It is thissimplification of possible target shapes to just tworeasonable alternativeswhich makes a unique solution tothe problem of best test design possible and practical .

If the target is a group rather than an individualthen we may take S

and D to be our best guess as tothe standard deviation and distribution of that group .If we think the group has a more or less normal distribu-tion then we will take that as our best guess for D .Otherwise we can fall back on the Tchebycheff interval .

Finally the examiner must be explicit about howprecise he wants his measurement to be . This is hismotive for measuring . It is just because his presentknowledge is too approximate to suit him that he wantsto know more precisely where his target is and, if it isa group rather than an individual, more precisely aboutits dispersion .

But whether the target is an individualor a group our decision about the standard error ofmeasurement SEM will be made in terms of individuals,for that is what we measure .

In the case of a one person target we want SEM tobe enough smaller than

S

to reward our measurementefforts with a useful increase in the precision of ourknowledge about where the target person is located . Inthe case of a group target we want to achieve an improvedestimate not only of M, the center of the group, butalso of S, its dispersion . The observable variance ofmeasures over the group estimates not only the underlyingvariance in ability S2 but also the measurement errorvariance SEM2 .

Our ability to see the dispersion ofour target against the background error of measurementdepends on our ability to distinguish between these twocomponents of variance . Since they enter into the observ-able variance of estimated measures equally, the smallerSEM is with respect to S, the more clearly can weidentify and estimate S 2 , the component due to targetdispersion .

Thus for all targets we seek

SEM «S .

ThP Test

A test is a set of suitable items chosen to go to-gether to form a measuring instrument . The completespecification of a test is the set of all parameters whichcharacterize these items . But when we examine a pictureof how a test works to transform observed scores intoestimated measures, we see that the operating curve israther simple and lends itself to specification through

Page 13: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

just a few test parameters . If we use the simplicityand clarity of the Rasch model as our guide, we can de-duce that the only parameters which influence the opera-ting characteristics of a test are the difficulties ofits items .

When we impose a reasonable fixed distribu-tion on these difficulties, then no matter how many itemswe use, we can reduce the number of statistics sufficient tocharacterize the test to three .

In the lower half of Figure 1 we can see from theshape of the test operating curve that its outstandingfeatures are its position along the latent variable,which we will call test heigt, and the range of abilitiesover which the test can measure more or less accurately,characteristic caused primarily by the dispersion of

the item difficulties which we will call test width,

But height and width do not complete the characteriz-ation of a test . When we look more closely to see howprecisely the test can measure on the latent variable wediscover a discontinuity in what we can observe and hencemeasure . Both its least observable difference LOD andits least believable difference

SEM are strongly relatedto the number of items in the test, that is its 1anath,From this we s-ee that any test design can be defined moreor less completely just by specifying those three charac-teristics ; height, width and length . If we let

H

W

L

the height of the test on the latentvariable (i .e ., the average difficulty ofits designed items),

the width of the test in item difficultiesli .e ., the range of its designed itemdifficulties), and

the length of the test in number of items,

then we can describe a test design

T by the convenientexpression

T (H,W, L) .

In the practical application of best test design wewill have to approximate our best design T

for a targetG from a finite pool of existing items .

In order to discriminate in our thinking between the test design T(F1,W,L)and its approximate realization in practice, we willdescribe an actual test as t(h,w,L), where

h = the average difficulty of its actual items, and

w = an estimate of their difficulty range .

Page 14: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

PRINCIPLES OF REST TEST DESIGN

What is a best test ? 4

One which measures bestin the region within which measurements are expectedto occur .

Measuring best means most precisely . Soa best test design T(H,W,L) is one with the smallesterror of measurement SEM over the target G(M,S,D)for given length L(or, what is equivalent, with small-est L given SEM) . But "over the target" implies theminimization of a distribution of possible SEM's .Thus a position with respect to the most likely targetdistribution must be taken before the minimization ofSEM can proceed .

We have brought the profusion of possibletarget shapes under control by focusing our practiceon their two extremes, interval and normal . How shallminimization be specified in each case? For a normaltarget it seems reasonable to maximize average preci-sion, that is to minimize average SEM, over the wholetarget .

To decide what to do for an interval targetwe need to know how the SEM varies over possibletest scores .

When we derive an exact form for theprecision of measurement, we

see that for ordinarytests with less than three log odds units betweenadjacent items, precision is a maximum for measurementsmade at the center of the test and decreases as testand target are increasingly off-center witn respect toone another .

For tests centered on their targets thismeans that, maximizing precision at the boundaries of aninterval target is the only way to maximize precisionover the target interval . So for interval targets wewill maximize precision at the target boundaries .

When we derive the SEM from our response modelwe will discover that it is an inverse function of theinformation about ability supplied by each item responseaveraged over the test . Since the most informativeitems are those nearest the ability being measured andthe least informative those farthest away, the precisionover the target will depend not only on the distribution

4 Attempts to answer this question have been made byBirnbaum (1968, p . 465-71) and although many of our ideasare consistent with his efforts we believe that we havetaken their implications to a logical and practical con-clusion .

Page 15: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

of the target but also on the shape of the test . Thusthe question of what is a best test also depends onour taking a position with respect to the best distri-bution of test item difficulties .

What are the reasonable possibilities? If wewant to measure a normal target,then a test made upof normally distributed item difficulties might produce the best maximization of precision over thetarget .

This is the conclusion implied in Birnbaum'sanalysis of information maximization (Birnbaum, 1968,p . 467) .

However, normal tests are clumsy to compose .Normal order statistics can be used to define a uniqueset of item difficulties but this is tedious . Moreproblematic is the odd conception of measuring impliedby an instrument composed of normally distributedmeasuring elements .

A normal test would be like ayardstick with rulings bunched in the middle and spreadat the ends . Measuring lengths with such an irregularlyruled yardstick would be awkward . In the long run, evenfor normal targets, our interest becomes spread outevenly over all the lengths which might be measured .Equally spaced rulings are the test shape which servesthat interest best . That is the way we construct yard-sticks .

The test design corresponding to an evenlyruled yardstick is the uniform test in which items areevenly spaced from easiest to hardest (Birnbaum, 1968,p . 466) .

Two target distributions, normal and interval,and two test shapes, normal and uniform, produce fourpossible combinations of target and test . Weinvestigatedall four .

Now we must turn to our response model and derivefrom it the formulation of SEM so that we can becomeexplicit about which designs maximize precision by minimizing SEM .

We want to know what aspects of test designT(H,W,L) and test shape influence

SEM and how we canvary these characteristics in response to a targetspecification G(M,S,D)

in order to minimize

SEM overthat target .

Page 16: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

The response model specifies

Pfi

where

Pfi = the probability of a correct response atf and i,

bf = the ability estimates at relative scoref = r/L,

d . = the calibrated difficulty of item ii

According to maximum likelihood estimation (Birnbaum,1968, p . 455-69 ; Wright and Panchapakesan, 1969, p . 41-44) bf is estimated from a test of length L and items

{d i }, i=1,L through the equations

f = E P fi /L

for f = 1/L, (L-1)/L

(3)i =1

with asymptotic variance

S 2bf

L

b d .e f - i.

(2)

L

1

= SEMf ,

(4)

E

P i (1-P fi )i=1

f

which is the square of the standard error of measurementat relative score f .

In (3) we see that SEM depends on the sum ofPfi (1-P fi ) over i . This expression is a function of b fand all the di . However fluctuations in P(1-P) are

rather mild for P between .2 and .8 . To expedite in-sight into the make-up of SEM, we will reformulate itso that the average value of Pfi (1-P fi ) over i is one

component and test length is the other :

Page 17: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

SEM f

= \ILE

Pfi (1-P fi )i=1

L

Thus we factor length out of SEM in order to find alength-free error coefficient .

Resuming our study of the operating curve of atest in Figure 1 we see that the formulation for theleast observable difference in ability LOD is db=abAf .

Since the least observable increment in relative score®f is l/L, all we need to complete the formulation ofLOD is the derivative of b with respect to f, whichfrom (2) and (3) above is

_ab8f L

E

Pfi(1-Pfi)i=1

But this is the inverse of the average value ofP fi (1-P fi ) over i which we isolated in (5) . We willidentify this as the error coefficient C f and note

that it is not only a function of f but also of thedistribution of item difficulties .

Then

LODE = Cf/L

and

SEMf = ~C_f/L

With SEM in this form we note that as far as test shapeis concerned it is Cf which requires minimization . This

will be true whether we use C .

to minimize SEM givenmin

L or to minimize L given SEM .

The Error Coefficient

What is the nature of this error coefficient C f ?

The expression P fi (1-P fi ) is the information on b fcontained in a response to item i (Birnbaum, 1968,p . 460-6£3) . So its average over i is the average infor-

Page 18: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

mation on bf per item in the test . Thus C f is theinverse of this average test information . The greaterthe information obtained the smaller C f and hence thesmaller SEMf and so the greater the precision .

this question in two ways : in terms of the influence ofreasonable values of (b f-d i ) on Pfi or, if we arewilling to focus attention on uniform tests, in terms oftest width

W

and the boundary

P's,

Pfl

for

i=1,the easiest item, and P fL for i=L, the hardest item .

see that when b f=d i and their difference is zero thenP fi = 1/2 and P fi (1=P fi ) = 1/4, but when bf-d i=-2

then P fi = 1/8 and P fi (1-P fi ) = 1/9 . Since an averagevalue can never be greater than the maximum nor less thanthe minimum value, we can use these figures as boundsfor

Cf .

and

4

<

Cf

<

9

(9)

Turning to the bounds we can derive for C f fromtest width W and the boundary probabilities P fl and

P fL of a uniform test, we can use an expression(Birnbaum, 1968, p . 466)

where W

What values can we expect of Cf? We can approach

Beginning with reasonable values of (b f-d i), we

When

-2 < b f - d i < 2,

P fl

P fL

1/8 < P fi < 7/8

C fW

W/ (Pfl-PfL)

(10)

the item difficulty width of a uniform test,the probability of a correct response byb f on the easiest item on the test, and

the probability of a correct response bybf on the hardest item on the test .

When b f is contained within the difficulty boundaries of

Page 19: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

the test, that is d l < b f < dL and W>4 then

1/2 < Pfl-PfL < 1 and C fW must fall between W and

2W, that is

It follows from considerations (9) and (11) thatSEM = C/L is bounded by 2/~T < SEM < 3/FL for anytest when -2 < (b f-d i ) < 2, and by W/L . < SEM < 2W/_E

for uniform tests when

W>4

and

dl< bf< dL .

Rlmn]Pot R17~PS

W < CfW < 2W

for W>4 , d l <b f <d L(11)

Best test design depends on relating the charac-teristics of test design T(H,W,L) to the characteristicsof target G(M,S,D) so that the SEM is minimized inthe region of the latent variable where the measurementsare expected to take place . The relationship betweentest and target visible in Figure 1 makes the generalprinciples of best test design obvious . To match testto target we aim the height of the test at the center ofthe target, spread the width of the test to cover thedispersion of the target and lengthen the test until itprovides the precision we require .

In simplest terms the rules for best test designcan be specified as :

where C is the error coefficient with values between4 and 2W, producing a test for which the operatingcharacteristics will be

a least observable difference

LOD< 2W/L and

a standard error of measurement

SEM< 2W/L .

(i) Center T on G by H = M

(ii) Spread T over G by W = 4S

(iii) Lengthen T to reach SEM by L = C/SEM2

Page 20: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

MATHEMATICS FOR ERROR MINIMIZATION

Even though we have been explicit about the natureof the target G and the test design T, the directminimization of

C is impeded because it is an implicitfunction of its unknowns . We can see this by derivingthe asymptotic variance of the ability estimates and hencethe C coefficients .

When the item parameters are considered known, thelikelihood to be maximized for the estimation of abilityparameter sv is the product over items of subject v'sprobabilities of success on each item :

L

LL

x(Sv -d i~

eiElxvi(Rv-sl)

e

rs v iE lxvi 6 1e

vin

s

a

-_

L

s

_a

1

-_

L

6v- ai-1

l+e v- 1

II[1+e v

]

II[1+e

1J

i=1

i=1

The expression on the right demonstrates that the test scorer is sufficient for estimating the parameter S

since thelikelihood may be factored into two components, v only one

rs vof which',

e

is dependent on $v, and thisch, LII [l+e v

1Ji=1

component depends on the observations only through r(Birnbaum, 1968, p . 425-29) . Taking logarithms we have :

L

L

S -d ,Lv=

Logs- v

=

rsv

-

E

xvi6i

-

E

tog

[1+e

v

1 )

(13)i=1 i=1

Differentiating this log likelihood with respect to sv gives

L

S v-a i

Lr - E

e

= r - E P vi

(14)i=1

Sv-6i )

i=1l+e

with second derivative

(12)

Page 21: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

a2Lv

L

a _aE

1~- -

< 0DRv i=1 S 2

(15)(1+e v -aa .

V

Since r, or f= r/L, is sufficient for estimating s v wemay re-write 114) and (15) in terms of the L-1

estimatesb f

aL f

L

eb f -6 i

LLf - E

= Lf -

E P fi

16)ab f

=

i=1

b f - 6

i=11+e

a 2 L~

ab 2f

where

b f -d iP

efi

b and

L

b f - a i

L

E

a

E Pfl (1-Pfi

117)i=1

b -8i 2

i=1

f i

f

When we set (16) equal to zero we obtain an implicitsolution for b f ;

Lf = E P

L

118)i=1

In maximum likelihood estimation the negative inverse of thesecond derivative is the asymptotic error variance of the

fE P f~ ( 1 - P f~

i=1

estimate :

SEMI = L C f/L (19)E P f~ ( 1 - P f~ I

i=1

Hence

C L ( SEMf (20)

Page 22: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

Iterative techniques (such as Newton-Raphson) may beemployed to obtain solutions for b f and C f (Birnbaum,

1968, p .455-59) . However, if we are willing to specifythe distribution of {S i } and to approximate the discrete

set of S . with a continuum, we can derive explicit solu-1

tions for both b f and C f which are excellent and useful

approximations to the maximum likelihood estimates .

Uniform Items

A continuous uniform density function for S withmean S . and range W is given by

1/W

S .-W/2 < S < S .+W/2

g(S) =

(21)

0 otherwise

our response model, written as a function of thevariable S, is

By equating the relative score f with its expected valueP f (S) we obtain

(S .+W/2

J(h' (x) /h (x) ]

dx

with solution

f = ( W ) kog

P f (S)g(S)dd =

which is of the standard form

b f -S .+W/2l+e

b f -S . -W/2l+e

S .+W/2

b S WS .-W/2 l+e f-

d6 (23)

(24)

Page 23: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

Solving for e fW leads to

from which, taking logarithms, an explicit approximationfor b f results :

bf = 6 .+W(f-1/2) + tog

where

bf = 6 . + tog

1ff

when W = 0

and

b oy = 6 .

when f = .5

Turning to a similar approximation for the errorcoefficient Cf we approximate the average over L discrete

d i by the integral over g(6) to arrive at

(6 .+W/2

P f (6)[1-P f (6)]g(6)d6

(27)L

eW(f-1/2) [1-e -fW ]

1

iE1

P fi(1 -Pfi)

[1_e-(1-f)W)

6 . -W/2

(25)

(26)

This integral can be evaluated by noting that Pf(6)(1-Pf(6)]

is the logistic density function, ~(6), whose integral is thelogistic cumulative function T(6) . Whence

C =

W

(28)fW

{`Y (b f -6 .+W/2)

-

Y' (bf-6 .-W/2)}

which is approximately the ratio of test width W to thedifference between the probabilities of success on theeasiest item at 6 1 and hardest item at 6 L e

C fW =

P

W P

(29)fl fL

Page 24: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

This error coefficient may be expressed in terms of bf as

CfW =

where Cf,O

CW b [C bW ]

W[l+ebf-S .+W/2

][l+ebf-d .-W/2

]b d .+W/2 b d .-W/2

[e f-

- e f-

J

and by replacing b f by f and W as given by (26) as

W

(1-e -W j[1-e-fW][1_e-(1-f)W]

Our final approximation for uniform items is for theaverage value of CbW over the latent variable when thetarget distribution of b is normal :

D (b)

'-

N [M,S 2 ] .

We proceed by taking the expected value of CbW over therange of the normal distribution in the metric of b .

00

CbW dD (b)

when W = 0

when W = 0, f = .5 .

W

[1+eb -d .+W/2] [1+eb- S .-W/2 ] e -[b-M] 2/2S 2db (32)

v~'2-7TS

-00

[e b-d .+W/2

- eb-$.-W/2J

Although the algebra is tedious there is no obstruction toevaluating this integral :

(30)

(31)

Page 25: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

we set

Our approximation for a test with normally distributeditems is based on the similarity in shape of the normal andlogistic distributions . If a scaling factor of 1 .7 is usedfor the logistic, the difference between their cdf's is

I(D(x)-Y'(1 .7x)l < O1 for all x .

(34)

In terms of the corresponding pdf's, we obtain by differen-tiation,

I~(x/1 .7)/1 .7-f(x)I < .015 for all x(5) .

(35)

Our procedures for obtaining the approximations fornormal items are identical to those for uniform items exceptthat the above inequalities connecting the normal and logisticdistributions are required to solve the integrals .

For our first approximation, with

f =

1V-2-Tr z

j

g(d) --, N[6 .,z2]

00

Y[b-8] g(S) dS

-00

(36)

5 Results (34) and (35) were verified numerically by compar-ing the respective functions for values of x ranging from-4 .5 to 4 .5 . In the case of (34) the difference at x = 0is zero and the maximum difference of .O1 occurs at _+2 .0 .In the case of (35) the difference at x = 0 is .015, themaximum, whereas the minimum difference of .003 occurs at+3 .5 .

CW W /2+6 .

(eW/2_e_W/2)(eW/2+e-W/2]+eM+S2/2-d .+e-M+S2 (33)

Where C O = 4 when W = 0, M = d . and S = 0 .

Normal Items

Page 26: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

and after replacing the cdf of the logistic by its normalapproximation we find that

_

4, [ b - d .

)

(37)v/2 ~9 + Z2

We use inequality (34) again to obtain

bf=

d .

+

~1

+

( 1Z7 ) 2

tog

( 1ff)

(38)

The expression under the radical acts to expand the initial

estimates (log [ 1 f f 1) . The larger Z, the greater thisexpansion . Setting Z = 0 and then f = .5 produce

= S . + tog and

b.5 = 6 . as in (25) .

Similarly for CfZ, inequality (34) produces :

~-

[tog( f )]2/5 .8C fZ = ~

2~ 9+Z 2 e

1-f

(39)

or, in the metric of bf ,

Cb Z = vl-2~ 2

e

(40)f

where Cb,O = 1 .7 V2_~ = 4 .26 34 4, when b = d . and Z = 0 .

The bias in this approximation is a result of the applica-tion of the inequalities (34) and (35) . In terms of standarderrors of measurement, SEM = /C/L, it is extremely small,

Page 27: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

since 4 .26 = 2 .06 ti 2 .0 6)

Finally for an avera"ge CbZ over a normal targetwe find

[M-d .) 2

CZ)

e=

"2n

(2 .9+Z 2

2[2,9+Z 2 -S

]

2 .9 + Z2 - S2

subject to the restriction that 2 " 9 + Z 2 - S 2 > 0 .Setting M = 8 ., S = 0 and Z = 0 shows the same biasin

CZas- (39) .

we have investigated the accuracy of each of these sixapproximations (26, 30, 33, 38, 40, 41) for a wide varietyof test designs . The uniform approximations are exceptionallyaccurate, resting as they do on the replacements of a dis-crete uniform by a continuous uniform distribution . The nor-mal approximations are less accurate but produce abilityestimates which are within .1 log odds units of the maximumlikelihood estimates for a range of cases wider than thoselikely to be encountered in practice (Douglas, 1975)

MINIMIZATION OF C f AND C FOR NORMAL AND UNIFORM ITEMS

We will deal with normal items first because thoseapproximations for C add C lend themselves to differen-tiation and thus enable us to minimize Z algebraically .To minimize CfZ at the interval target boundary of M+2Swe place

(b-b .) 2 = 4S 2in

(40)

and

2

2CZ=

2tr

2 .9+Z2

e2S

x(2 .9+Z

)

(42)

6 The reason that the approximation for b f in the specialcase is exact and the approximation for C fZ is biased isbecause only the first inequality (34) is employed in obtain-ing b f whereas both are required for C fZ . The differencein (34) at Y=0 is zero but the difference in (35) at zerois the maximum .015 .

(41)

Page 28: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

By setting the first derivative of CZ with respect to Zequal to zero, and noting that the second derivative isnegative when 45 2 - 2 .9 < 0 we see that Z = 0

gives C

= J27 (1 .7) e2S2/2 .9 = 4 .26 e 2S2/2 .9Z,min

when _ .85 and

Z = 45 2 - 2 .9

gives C

= 2 J27t e1/2 .S = 8 .275 when S > 17 = .85 .Z, min

2

For the average CZ we find by a similar process that

Z = 0 gives C

1.7 J2ir

4 .26

1 .7=

=

when S <

= 1.2Z min. J'1-(1 5 7)

2

J'1-(1 5 7)2

J2

and Z=/2S 2 -2 .9 gives C

= 2J27t S=5S

when S > 1 .7 = 1 .2 .Z,minJa

These explicit minima allow for easy evaluation of C andC over various values of target standard deviation S .(Douglas, 1975) .

In the case of uniform items we were forced to deter-mine the values of W which minimize C fW and CW numeri-cally because we could not solve expressions (30) and (33)after differentiation . Therefore we computed CfW and CWfor a wide variety of combinations of f and W and obtainedthe minimizing

W's

and their related

Cmin

and

Cmin

byinspection . (Douglas, 1975) .

Table 1 lists the values of the minimum error coefficientsCmin

and

Cmin

for both uniform and normal items and the cor-responding W in the case of uniform items . The uniformapproximation is exact for all practical purposes . But thenormal approximation contains a bias due to the interchange oflogistic and normal pdf's during the development of theapproximation .

As

a result

the

C

,min

and

Cmin

values

forthe normal approximation are too large above the dotted lines

Page 29: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

En

N(L)H

0 r-lw (d

>: 0-rq z

U5 " LfAro

aro ~4

0C: w

"ri "ri

U D

HN N

W +) 't7~-1 A t~07 G1 ~RC , ri

U ind-)

w a)w CPG) 14O (dU F

) .I -I0 roN E

W Oz

s

"rl ro

"rl r-trd

a)

aH

41

r

0

n

ro

4JNv

rifu

0s~

ro

0w

v

ro

v

v

0\N

Gro

v1 tJ" rl

N Nri ro

41

co v

tot~O m

" rl " rl

ro~U

" rl " riX .~O 3

040CL rro

NIn

E+ "rldo

O N O O1J 00 r "d CO Min 3 000 00 " " "(1) . "i v to r-E-4

s+ 0tDr tn0~ V° 014O C O O N l0 M M to 00 -i

a.1 w " ri4) "rl d° V° to %0 r 00 Or., r I U .-a~4 DroF

41ro in

v~+ F0 r. kor-+to Lnrr Nr(nz .-i " rl N M IM r N N to r O

ro . . . . . .IU ~~r~ ~ru)~ rcoo

Oz

O O to O to O+.) ko O r M r O

3 000G) N In l0 00 dl r "1F H

~+ O to 0\ N c0 O C' N l0O r- . O N O l0 r-l l0 Ol N d°

v w rqiT " ri E cP V' t!1 ,D O 0~ O N M14 C.' U r-1 H riro DF

r-iro +1,> N1-I Nv E++~ ~otor (nrM orr)

r-i "H N C O N N M IT V' to

U d. V' to ~0 N O N ~0)4 r-1 r-/ r-I raOz

41 (1) oLno tooto oto0a,a 0Ntn r0N tr)r0

)4 O O O O " -1 H r-i r-l Nro "F'ts

+J

Page 30: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

MPASTTRFMFNC PRorrnTTR12S

When we compare the normal and uniform tests oninterval targets there is no doubt that the uniform testis more precise . This is especially true when S exceeds 1 .

When we compare the normal and uniform tests on normaltargets they differ so little as to seem equivalent in theirmeasuring precision . Since uniform tests are better forinterval targets and equally good for normal targets thereis no motivation for considering normal tests any further .Therefore we will focus the rest of our best test designstrategy on uniform tests . Table 2 gives optimal uniformtest widths for a variety of normal and interval targets .

For best test design on either interval or normaltargets we select a set of equivalent items (where W = 0) ora set of uniform items with W

as indicated by Table 2 .For example, if the target is thought to be approximatelynormal with presumed .standard deviation S = 1 .5, the optimumtest width W is 4 . Note that for any value of

S a smallerW

is indicated when a normal target shape is expected .

Table 2 also shows the efficiency of a simple rule forrelating test width W to target dispersion S . The ruleW = 4S comes close to the optimal W for narrow intervaltargets and for wide normal targets .

When we are vague aboutwhere our target is we are also vague about its boundaries .That is just the situation where we would be willing to usea normal distribution as the shape of our target uncertainty .When our target is narrow however, that is the time when weare rather sure of our target boundaries but, perhaps, not sowilling to specify our expectations as to its precise distri-bution within these narrow boundaries .

To the extent thatinterval shapes are natural for narrow targets while normalshapes are inevitable for wide targets,

W = 4S

is a usefulsimple rule .

The efficiency of this simple rule for normal andinterval targets is given in the final columns of Table 2 .There we see that it is hardly ever less than 90 per centefficient ; if we cross over from an interval target to anormal target as our expected target dispersions exceeds1 .4, then the efficiency is never less than 95 per cent .

Page 31: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

UJ4J(1)Ir

~4 $40 (1344. rl

ro

4J 1aU) a)a) 4J

N pq qH

W $4a ObCq

44 G~ roE-1

r" 1roE

4-) $4O 0z

U)

a)O

raro m> +J

roEi

r

III

O 00 ~D I .-1 r ~rO Ol tT 1 O 00r-I I

111

I

N N M M v e7° d° Ill Ul ~o to ko

r

00

V' Ill "o

O

O0 ~.o r O Ol O r-1 N M

r-4 .-1

1

1-1 14 .-1

1-4

N

11

v 4-)to ro~4 tr

(1) ro 1 ,1U 4J roz 41"rl riU) ro H

ror-I 1"1 "rl(d 0 +)

.r1 ro+~ al ~

ra a)

04

Ov) ra .t.

041 rd 41 V 4Ja)

IN

U)tT 0 C

0 V) ..i (1),

Ur Q) (a

U

11 11 if 11

-H ~-I J.J

U)

U)ro

\

3 d' ~ ~v

0

U U a s

E " rl

U

vc4 v 3

1 .14)

111 a)(1) -r1 1"1 r>I

.C

r-1

0 41

>+

3

04 (L) 44 r

UE ri

"ri" r1 A E ro

NU) ro OS 4-)

-rir-1 E S-1

UU) -rl " r1 a)

-r1-ri ro 4J U

4-1.>~ > 04 :~E-+ ro

44O :

W

x

U ,~ U)

(d 41 A>40 4 0 U) U)

a) 4-)>"a a) 0 AO to d.J

O a) 11 21ri U (a -rlU it >~ 3

44 .rl(a O 4J ~t 44

ro+J 4J a) +J 11 44 c)H .-1 1: 0 U)

3 A as ~ .rs0 ro .r1 41 U}4 °1~ U HU)0la (1) O .r1 0 as s.Iit E U) 44 44 4J 104S~ 0 (a 44

U) a) ro 4J -I >4~4 sa 0 Z cc ri0 b U a) E ra44 ro U) " .i -r1 ro

a) -r1 1-1 U 4-1a) 14 U) 0 " r1 04 CJ'w .c 'V ~4 44 0 Iv

-rl U) U a )4 4a1.J " r1 (1) a) 44 44m 0 .[ \ 000

4J 3 E Us`I a 4 x(1) b - a ~4 4J , .Jto 1~ U) "r1 0 to Mr 41 11 1" 1 a s~0 0 v -r+ ~+ a) a)0 A tr u) E m '-+

Page 32: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

our investigations have shown thatgiven a target M, S and D there exists an optimal testdesign H and W from which may be generated a unique setof L uniformly distributed item parameters {S i } . However,

this design is an idealization and cannot be perfected inpractice . Real item pools are finite and each item diffi-culty is only an estimate of its corresponding parameter andhence inevitably subject to calibration error . No examinerwill ever be able to select the exact items stipulated byhis best test design

{S i} .

Instead he must attempt to se-

lect from among the items he has available, a real set of{di } which come as close as possible to his ideal design

Thus parallel to the design specification T(H,W,L) wemust write the test description t(h,w,L) characterizingthe actual test {d .} which can be constructed in practice .1This raises the problem of estimating h and w from theset of items difficulties {d .} . We will take their observed1mean d . as h . AS for their width, W, we investigated a

number of alternatives : the range w = V12S . of a uniforms

ddistribution in which Sa was the variance of d ., theobserved "first" range w 1 = (d L-d1 )( L1 ) and the observed

"second"

range

w2=

((dL +

dL-1

-

d2-

d1)/2)(LL2) .

Therewere few differences among the results we obtained from thesealternatives for w . our preference for w2 is based on theobservation that it is a slightly more stable estimate of Wthan w1 and even ws for short tests .

The approximations

derived in order tominimize C may

be used to set up two tables which allowan examiner to read off the ability estimate and its standarderror for any relative score f which is observed on any testof width w and length L .

Page 33: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

Table 3 gives us the position of the estimated measurebf , relative to test height h, for any observed relativescore f on a test of width w . Since this test t(h,w,L)will not in general be centered at zero, we must then addthe test height h estimated by d . to the tabled valuesin order to arrive at the

final estimated measure

b f .

ifwe identify the entries in Table 3 as x fw then the finalability estimate b f is :

error of measurement,

b f = d . + xfw

when f > .5

b fd . - x (1-f)w

when f. <

.5 .

For example, if the test characteristics are t(1,5,30) then.a person who obtains a score of 12,or 40 per cent of the itemscorrect,would have an ability estimate of

Table 4 gives us the square root of the error coefficient,

Cfw in preparation for obtaining an estimate of the standard

We are now in a position to give explicit, objective andsystematic rules for the design and use of a best possible test .

COMPLETE RULES FOR BEST TEST DESIGNAND MEASUREMENT

For design T(H,W,L) on target G(M,S,D)

1 . From our hypothesis about M we derive H2 .' From our hypothesis about S we derive an

optimum W either by consulting Table 2,or by using the simple rule W = 4S .

3 . From our requirements for measurement precision,namely the value of SEM we seek, we deriveL = C/SEM2 where C either comes from the

b .4 = 1 .0 - x ,6 .5

= 1 .0 - 0 .6

= 0 .4 .

Page 34: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

OD O N l0 1- 01 N V1

C- (3) (14 LI) 00 N O 1-4

N N N 11N 1+1 1+1

M 9 4 4 4 to to 1D

to 00 r-1 M to ['- 0 rl

V' tD 01 N In O (N 1`

r~ N N N N N M

MMM V1; V' to to

to I- 01 -I M to i- 01

rl M W 00 r-1 V' 00 M

r"l ri r-1 N N c q,

CN*M MM V'

VVto

M to t .̀ O1 ri N V' 110

O O M to O .-I to O

-4 1-4 ri ri CN,N N

CN,M MM

'IT* V' to

N V' to t- O1 O N V1

t0 O O N to 1b ri O

ri ri r-1 r-1 ri N N N

NN M M M M V' V1

ri N V' U1 I- O O N

M m [- 01 N to OD M

ri r1 ri r-1 ri r-1 N N

NN N N M M M V'

O r-I N V' to t0 O 0\

r4 M to [- 0\ N Lo O

ri r-I r-1 ri ri ri r-i ri

NN N N N M M V'

Ol O rl N M to 0 O (n rl M to

C; N

O ri N M V' to t0 1-

O 0 O rl N M t11 0

1` Ol ri M

O O r- ri ri r-I ri ri

ri ri N N

1` O O O rl N V' 0

tD OD Ol 1-i

O O O ri .~

r-I ri

ri rl 1-1 N

V1 N O O %.0 V' N O

O l0 V' N O CO tD V'M M M N N N N N

r-1 ri .-I r-1 ri O O O

4i

r-1u3

n413

44I

r-1

O O tD V' N O O tDUl V V' V' V1 V1 M M

O N to r- m ri M O

N O O O 0O ri r-1 rit)InNH O N to t0 00 O N V1

Or. r1 O O O O O rl ri ri$4OW.14 O N V tD I- 0) e-1 M

O O O O O O ri r-1

S4OVa O N M to [- 1b O N

N O O O O O O r4 r1

OO 1-I M V' tD I- m rl

0 0 0 0 0 0 O ria

M " rl O ri M V' to 1- O Ol

W to O O O O O O O Oa 34.1

a xH O r-1 N V' U) tD 1-- O

N In O O O O O O O O4J

.ra4JNW

O 1-I N M V' tf1 to 1`

-r1 M O O O O O O O O

.r1A

O ri N M M V' to t0

N O O O O O O O OD"rl

10 O ri N N M V1 t11 Ori4) ri 0000000094

O ri N N M V' to t0O

O O O O O O O O

O N V' t0 1p O N V'to to to tf) to D tp t0

1- 0) M 1-

N N M M

3

344

v1 (- ri to 44

CN* (nMX X

M t0 m M b 10

1q,c N M II II

r O O O ra N M V'

0 0 0 ri .-1 1-1 H 1-4

Ul (~ OD O N to N N

.-4 ri 1-1 N N N N r4

44A

WA

u) in

[- O O O1 O N M V' to 1- 00 O N V' 00 Nn V

O O O O ri 1-1 ri rl r-I H ri N N N N MW 44

0 O O N V1 ID O O N V' 0 O O N V'O 0 i` N r- i` 1` O CO O O O ON 0% m 0\ 44

H

Page 35: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

fn M M M M M M VM M M M M M M M

N N N N N N N NM M M M M M M M

000000'-4 ,-1M M M M M M M M

Ol 01 O al dl O O 01ty N N N N N N N

N N N N N N N N

t0 t.0 ~0 ~0 t0 t0

t0

N N N N N N N N

dr V' d' dr to to to

tV N N N N N N N

M M M M M M M dr

N N N N N N N N

N N N N N N N N. .

. .N N N N N N N N

O O kD d' N O O t9cT Vr V° V' V° M M

3

44I

r1

Orf

NroroO

Oa

" r1 O

\l44

3

c)° II

Wa w

HO44

to

344U

41d)

" rlU"rl44

M

44d)OU

N

LIONIVW

ra

O

'IT dr dr 4f' CT dr to to l0 1.0 [- Ol r! M OD toraw3

M M M M M M M M M M M M lIr V° C° to1u

N N N M M M I~T eT to to t0 w O M r to 1r-I

M M M M M M M M M M M M cT cr cT tor-v

ri r-I r1 ri N N N M -qrcp to r Ol N 0 to 31M M M M M M M C'n M M M (nM d° d° to

Nal O O O O r! r1 N (N M 'IT y O r-a tD d°

N M M M M M M M 1+'l M M M Mar3u

IS) O O 0l O 01 O r-I ri N Vr to OD O to C'UN N N N N N M M M M M M M cT Cr to

~O r r r O O 01 Ol O ri M d° r O to MN N N N N N N N M M M M M dr d° to

Mto to t,0 t0 l0 r O O 6l O N M r Ol to N V V

N N N N N N N N N M M M M r") Vr to

~ud' Vr Id' to to t9 l0 r O 0\ r-I M O O d' N V V

N N N N N N N N N N M M M M I~T to N

M M M V' C to l0 l0. .. . . e o

r OO O N to OO M N. .. . .

N N N N N N N N N N M M M M d° toV n

I

3

d' N O O l0 dr N O O to It N O O ko C'M M M N N N N N .-I r-I r1 `4 r1 O O O 1

3ra rl r-; rl ri r-I rl N N N M M M ~' to l0 r O d\ ri C r N r-1

N N N N N N N N N N N N N N N N N N N M M M 'T unU

3O O O O O rl r-I rd rl N N N M 'dr d' 0 l0 r Ol r-I M r N r-IN N N N N N N N N N N N N N N N N N N M M M d° t11

(Drl

O O O O O O r1 rI r1 ri N N M M 'IT to l0 r O r1 M (, N rl 7

N N N N N N N N N N N N N N N N N N N M M M T to4)

a0 N d' 110 O 0 N dr w co O N IT O O O N d' to O O N d' w F-to to to trl to l0 l0 19 O r r r r r OO CO CO C9 CO Ol 0\ 0) O\ .14

Page 36: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

4 . From these H, W and L we generate the designset

{S i }

according to the

formula :

value minimized by W given in Table 4,or is approximated by the simple rule C = sS .

_W2L (L-21 + 11

1 = 1,L1.

For test t(h,w,L) from design T(}i,W,L)

5 .

'Commencing with

-6 i

we

select items

d i,

fromthe item pool such that they best approximatethe set {S .}, i .e ., by minimizing the dis-crepancy ei = di

L6 . We calculate h = d . = E di/L and

i=1

and

w

=

((dL

+

dL-1

-

d2-

d1) /2) ( LL 2 ) -

7 . We administer the set of {d i } as the test

t(h,w,L), obtain

bf for f and w as given

in Table 3 and SEM = ~ / ~ with C fw for

f and w as given in Table 4 .

BEST TEST PERFORMANCE

Confidence in the use of these techniques depends ona knowledge of their functioning over a variety of typicaltesting situations .

We investigated their performance witha simulation study designed to check on two major threatsto the successful functioning of this testing strategy .

By simulating a person at a fixed point on thetarget (say -0) and subjecting him to a seriesof tests with centers increasingly removed fromhis ability level we evaluated the effectivenessof this technique as tests are increasingly "off-center" with respect to their targets .

(ii)

By introducing increasing magnitudes of randomdisturbance into the designed item calibrations{d ;} we simulated and then evaluated the practica :situation where the actual test realized t(h,w,L)departs increasingly from its generating designT(H,W,L) .

Page 37: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

steps :The simulation was set up according to the following

1 . Set person ability at S = 0 .

2 . Consider values of test height H = 0,1,2, testwidth W = 2,5,8 and test length L = 10,30,50 .

3 . Determine the item spacing parameter V = W/Land set the disturbance standard deviation atQ = V/2, 2V so that the magnitude of discrepancyis keyed to multiples of the spacing parameter .

4 . Generate {d .} from H, W, and L and {d, = 6,+e,}from

ei 'v N[O,Q2 ] .5 . Calculate h and w from {d,} .16 . Use a random number generator and the response model

with

= 0 and

{d,} to obtain a raw score r forieach simulated person .

7 . Obtain values of b f and SEM from Tables 3 and 4 .

8 . Repeat steps (6) and (7) 100 times to simulate 100independent replications of the application of thistest to this person .

3 . Summarize these 100 replications in terms of :Mean ability estimate or bias, since B=0 : BIASStandard deviation of ability estimates :

SAStandard error of bias :

SE=SA/10Significance of the bias :

BIAS/SEMean standard error of measurement :

MEAccuracy of SEM as an estimator of SA :

SA/ME

From the results of this simulation study we have-deriveda set of general rules regarding the degree to which an examinermay depart in practice from a uniform spacing of item difficultiesbefore these measurement procedures produce unacceptably dis-turbed ability estimates . These rules outline the area ofpossible test designs (combinations of H, W, and L) within whichthe bias in measurement is limited to less than .1 log oddsability units . They allow the test designer to incur item dis-crepancies ei = di - d i , that is item calibration errors, as

large as 1 .0 . This may appear to be unnecessarily generous,since it permits use of an item difficulty of 2 .0, say, whenthe design calls for 1 .0, but it is offered as an upperlimit because we found a large area of the test design domainto be exceptionally robust with respect to independent itemdiscrepancies .

Page 38: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

ROBUST REGION OF TEST DESIGN DOMAIN

When no item deviates more than 1 log odds unitfrom its designed value

and

or

When no item deviates more than 0 .5 log odds unitfrom its designed value

and (Iii-~'<2,W<8,L>30) then BIAS < .1, BIAS/SEM < .2

or

(1H-aI<l,W<8,L>10) then BIAS < .1, BIAS/SEM < .2

As test length increases above 30 items, virtuallyno reasonable testing situation risks a measurement biaslarge enough to notice . Tests in the neighborhood of 30 itemsor more, of almost any reasonable width (W < 8) and whichcome within 1 log odds unit of their targets

(IH-Sj<1)

arefor all practical purposes free from bias caused by randomdeviations in actual item calibrations of magnitude less than1 log odds unit . Only when tests are as short as 10 items,wider than 2 log odds units and more than 2 log odds unitsoff-target does the measurement bias caused by item calibra-tion error exceed .2 .

Table 5 lists summary statistics for some tests withL = 10 .

In Table 5 we see that on the short, narrow test,2 log odds units above its target, test bias is .25 unitswith BIAS/SE = 4 .2, indicating that abilities are over-estimated .

With only 10 items and a person with

13 = 0,2 log odds units below test height H = 2, the expected rawscore is less than 2 items .

In order to obtain finite abilityestimates we are forced to discard zero raw scores thustruncating our distribution of estimates . Hence our simula-tions in this instance produce estimates which are too highand standard errors which are too small leading to an SA/MEratio well below the expected value of

1 .0 .

On the short, wide, off-target test the bias of - .29log odds units is due to the large discrepancy (maximum .8)between design and test .

This discrepancy also threatensthe short, wide test even when it is on-target . There the

O H-~1<2,W<8,L>10) then BIAS < .2, BIAS/SEM < .4

(IH-~I<1,W<8,L>30) then BIAS < .1, BIAS/SEM < .3

Page 39: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

TABLE 5

The Bias in a Short Ten ItemTest Caused by Design Discrepancies

Note :

Max.Discrepancy e=d-6 is the largest discrepancy betweenthe designed item difficulty 6 and the realized item dif-ficulty d .BIAS = mean of 100 ability estimates .SA

= standard deviation of 100 ability estimates .SE

= SA/10 standard error of BIAS .ME

= mean SEM of 100 ability estimates .

Caused by severe truncation of ability estimate distribution due tomany zero scores .

Narrow Testw = 2

Max . Discrepancye = .2

Wide Testw = 8

Max . Discrepancye = .8

BIAS . - .03 BIAS .18SA . .77 SA . 1 .09

On SE .08 SE .11TargetH = B BIAS/SE : -0 .38 BIAS/SE : 1 .64

ME . .70 ME .- .95SA/ME . 1 .09 SA/ME 1 .15

BIAS . .25 BIAS - .29SA .58* SA .91

Off SE . .06 SE .09TargetH=B+2 BIAS/SE : 4 .16 BIAS/SE : -3 .22

ME .90 ME 1 .01SA/ME .64 SA/ME .90

Page 40: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

bias is .18 log odds units . But BIAS/SE = 1 .6 only and,18 is less than .19 = .18/ .95 sEM units, so this short,wide test may function well enough when it is more or lesson target, since its maximum bias is lost in its

sEM.

The short, narrow, on-target test serves as a compari-son for the above . Here the bias of - .03 is small, theratio of ability estimate standard deviation SA to theaverage standard error of those estimates ME is

1 .08,virtually one and so the measurement technique is workingwell .

FX'qmp] PT

This client has very little knowledge about his targetpopulation other than the fact that the persons are, in hisbest guess, more or less centered on the item pool . Thushis target center is at M = 0 and and we need to come tohis aid in selecting his target boundaries . From the appro-priate pool of items we select a small number of items whosedifficulties range in value from about 2 .0 to 4 .0 . Wethen ask the client to select that item (or items) which hebelieves will be too difficult for all but 5 or 10 per centof his target . Then the calibrated difficulty of that itemon the latent variable gives us an approximate upper boundfor his target . The lower bound can be obtained by symmetryor by repeating the same steps with easy items .

Let us assume that the suggested upper bound is 4 .0and that the client would like his

sEM to be less than 0 .5log odds units . The following sequence of steps would becarried out .

2 .

3 .

4 .

FXAMP7 FC

M = 0 implies H = 0 .Since the upper boundW from Table 2 is 11,minimization ofthe intervalminimization oftarget) .

is 2S = 4, the recommendedif the criterion is the

standard errors at the boundary oftarget (or 8, if the criterion is the

sEM over a normally distributed

Using either Table 4 or the simple rule C = 8Sprovides an upper bound for C of 16 . Thisimplies for an sEM = 0 .5 that L<C/sEM I <16/ .25<64, ora length of about 60 items .These 60 items are selected from the pool such thatthey best approximate a uniform distribution with itscenter at 0 .0 and a range of 11 . Let us assumethat the three easiest items available are -4 .0, -4 .2

Page 41: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

and -4 .2

and that the three most difficultare 4 .5, 4 .3 and 4 .2 . These six values implyan actual test width of w = 8 .4 .

Finally, letus assume that the actual mean of the 60 itemdifficulties is d . = 0 .4 =h .

Thus, while ourdesign was T(0,11,60), our best available testturned out to be t(o .4,8 .4,60) .

5 .

This test is administered and we obtain the setof raw scores r . These are converted to relativescores

f.

If we approximate

w = 8 .4 by w = 8we may use that column of Table 3 and h = 0 .4

toestimate ability for any f. A person with 42items correct has f = .7 . The corresponding entryin column w = 8

of Table 3 is

1 .7

and hiscorresponding ability estimate isb .7 = 0 .4 + 1 .7 = 2 .1 .

6 .

In order to determine the

SEM

associated with thisability estimate of

2 .1 we go to column w = 8 ofTable 4 and°find that

VC - = 3 .0

and so SEM = / e- / Yt = 3/ 60 = 0 .4 .

F ..xam= 1 P 2

re-jm=t-jtinti anti Mu 1 t ; nl P Target c

We have talked about single targets, whether theyrepresent an individual or a group, in order to come to gripswith the conceptual and practical problem of the best testdesign .

In practice, however, we may be confronted withseveral targets spaced out on the latent variable along whichwe are measuring . Sometimes we want to design a single testwhich is best for such a series of targets . In that case, thesolution is straightforward and simple .

Then we have to assumethat the series of targets are in fact one large target andfrom the combined features of this series extract the specifi-cations of that single compound target .

The best test designthen is the one which minimizes precision or test length forthat single compound target .

However, when two or more targets are far enough aparton the latent variable, we will discover that the precision ofmeasurement accessible through a single test designed for alltargets combined is unsatisfactory . This will motivate us tosee what we can do to increase precision .

The only course ofaction which can accomplish this is to design several teststailored more exactly to the individual components in the com-pound target, that is, to design a best test for each ofseveral sub-targets .

Of course, this can lead to subtests withno items in common . It might seem that this would place us ina situation where we could no longer compare the results of one

Page 42: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

subtest on one sub-target with those from another . But ifwe remember that our items are calibrated on a common latentvariable and that the measures we extract from our varioustests will also be on this latent variable, then we see thatwe can compare the positions of different sub-targets byestimating their location and dispersion on the common variablefrom measurements made from the individualized tests tailoredto each of those targets .

If it should become important to estimate how wellpersons in one target might have done on a test suitablefor another target had they also taken that other test, wecan utilize the ability estimates of that target and thedifficulties of the items on the other test which they havenot taken to estimate the proportion of those untaken itemswhich they would have probably gotten correct had they takenthem .

Consider the typical research in which investigatorshave studied the differences between a control group A and anexperimental group B after the latter has received an experimental treatment .

Let us assume that there is evidence tosuggest that the treatment has a dramatic effect with theconsequence that, group B is expected to' end up 3 .5 unitsabove group A . To be specific we will allocate the followingworking specifications for the two targets :

GA (-1 .5, 1 .5, D) and GB (2, 1, D) .

Not only is GA expected to be at a different location onthe latent variable but its dispersion is expected to begreater .

Ideally we should construct two separate tests forthese two distinct targets . That this would be the preferablestrategy will be demonstrated by considering what happenswhen we compare a compound target strategy with a multipletarget strategy .

When the two targets are compounded wearrive at GAS

~ -.25,

2 .125, D) by calculating the lower

boundary

-1 .5 - 2(1 .5) = -4 .5 for G A , the upper boundary2

+

2(1)

=

4

for

GB ,

using their mean

-.25

as

MAB

and

one-fourth their difference

8 .5/4 = 2 .125

as SAB . An S

of about 2 implies optimum W of 11 for minimization of SEMat the boundaries . To avoid unnecessary complications in thisexample we will assume that the optimum test is always avail-able (i .e ., that W = w and H = h) and that L = 40 . Underthe multiple target strategy the two optimum tests would thenbe

TA

( -1 . 5 .

8,

40)

and TB( 2,

5,

40)

while the one com-

pound test would be T ..

( -.25,

11, 40) .

Page 43: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

In Table 6 we have listed the expected results forten persons ranging in ability from -4 .5

to 4.0 .

Thefirst five may be considered to belong to

GA and thesecond five to G For every one of these 10 persons, theSEM under a multiple strategy is smaller than under thecompound strategy .

We can convert these differences intothe number of items which could be saved by using themultiple strategy .

The middle section of Table 5 shows thatif we used only 30 items for TA and 25 items for T R wewould obtain approximately the same SEM's as achieved onthe 40 items compound test .

The right section of Table 5shows the raw scores and

SEM's

which we would predict ifthe

GA persons had taken the

TBtestand vice versa .

If we were interested in knowing the differencebetween the two groups in terms of the proportion of itemsfrom the pool which each group had correct, then the rightsection of Table 6 shows that if persons at the center ofGs

(e .g ., person number #8) had taken T A

their expectedscore would have been

r = 35

whereas those in

GAat itscenter earned score r = 20 . Thus we can estimate an expected15 item advantage in score for persons at the center ofgroup e when compared with persons at ,the center of group A.

97T.7-TATIDRRn TRSTTNG 9

When an examiner wishes to measure a person efficientlyand objectively he needs not only insight into how thevariable in which he is interested is manifested in responsesto test items but also an unambiguous and practical statisticaltechnique for identifying which items are qualified to serveas a basis for measurement and for calibrating these qualifieditems on the variable so that responses to them can be used toestimate measurements .

The plausible and statisticallysound Rasch response model for selecting and calibrating use-ful test items and subsequently for measuring persons leadsnot only to an operational definition of the variable but alsoto a coherent and practical system for optimal measurement andhence best test design . Within such a system self-tailoredtesting becomes an easily available by-product .

The person to be measured can be handed a booklet oftest items more or less equally spaced in increasing diffi-culty from easiest to hardest and invited to choose anystarting place in the booklet with which he feels comfortable .From that self-chosen starting point the examinee can workat his own will and speed in either direction, forward intoharder items or backward into easier ones, until he reacheshis own performance limits or runs out of time . Whatever the

Page 44: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

41N(DE

N

".4

v

rl43 FCartl v

4)ri

" .I O

r1 Lta41

O

>1 C7Ax

to Ut0

Pr r00 W

W Oa

~4 44m

cD o

E

O 1~3 O.4) -rI

4-1 UO .r1w

4J v

v a

v N

4Jto

v A~ rt

-PU)NE

AOO04

OU

1.)U)v

ro .s~O

41

O04

OU

44O

WU)

S4

U)U)NUN

E

O ON r)' O Ln O V' o0 r)' I- O n M O

)a U O 0) th rP M d° r1' to t- Ov s~ W II " " " " " W 11

ro m cn ra F4 (no 5 a a

1 )"1 1- 1)N O U) U)U) W v vO )-1 E v E v)"1 v )-) r1 r-4 N 00 r ~4 [- N to r- OlU P. O r1 O N M M M M

U UIn

is isO toM N

O to M <n O "' CO O CO O %oW II l0 to N to to W II to to qzr to to

a . e a . (na a

U)1-)

v O OE r-C C' !ti) rP

N [- In C- N ~'. to 01 00 al d'4) 4-) W II to "d' d° TV to 4J W 11 d' M M Mr-I N U) . . . . . U) Na v a v a-1 E E

N vIOMOI~r)' CC) rr0koN

O rl N N M . O N N MU UU) U)

041 IVN z 00 d' N N d' N N d' to COv W II v) U) to u'1 to Ul to to to toE U) . . . . . . . . . .

aba

O v04 ~4 ko o ~o ri ~o 1-1 -zr 00 .-+ f

O rl ."-I N N N N N M MO UU U)

41".i Ln O to 0 to 00000

-rl d' M r1 O rI O .-I N M C,q I I I

+)UN

-t-1 rI N M C' to lO t- CO 041 O.Q

U)

a71O 4F4 CQCAW WCA»

Page 45: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

level and length of this self-chosen seqment, all that areneeded to obtain an objective, item-free person measureand its standard error are the serial numbers of the easiestand hardest items tried and the number of successes inbetween .

These three observations are sufficient to look up ina simple series of tables the person's estimated measureand the standard error of that estimate . That each examineemay work a different segment of items and so perform on hisown uniquely self-tailored test, does not interferein anyway with the objective comparison of examinees . All measuresbased on responses to calibrated items are on the singlecommon scale defined by that calibration . Comparisons be-tween persons are made on this common scale and are quiteindependent of which items were used to make the measurementor why . Of course, the suitability of test items for thiskind of use must be planned for in their writiAl- .

and care-fully developed in their editing . The items must be examinedstatistically and their qualifications as instruments of measurement demonstrated .

But the ubiquitous use of unweighted testscores as sensible and useful measures depends, at bottom, onsatisfying just these conditions . If items do not qualify forthe self-tailored testing described, then they do not qualifyfor inclusion in any unweighted test score .

Page 46: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

RTRT TngRAPPY

Andersen, E . B . The numerical solution of a set of conditionalestimation equations .

Jn>>rnal of the Rnval Rtatistinalfinny , 1972, 34, 42-54 .

Anderson, E . B . A computer program for solving a set of con-ditional maximum likelihood equations arising in the Raschmodel for ques tinnnairPs_

RAsAarnh Mamnrand>>m 72-ti,_Princeton, N.J. :

Educational Testing Service, 1972 .

Andersen, E . B .

Conditional Inference and Models for Measur-iag , Copenhagen :

Mentalhygiejnisk Forlag, 1973 .

Birnbaum, A . Some latent trait models and their use in infer-ing an axaminee's ability . In F . Lord and M . Novick (Eds .),Statistical Theories of Mental Test Scores . Reading, Mass . :Addison-Wesley, 1968 .

Bock, R . D . Estimating item parameters and latent abilitywhen responses are scored in two or more nominal categories .PSVrhnmPtrika , 1972, 37, 29-51 .

Douglas, G . A . Test design strategies for the Rasch psycho-metric model .

Unpublished doctoral dissertation, Depart-' ment of Education, University of Chicago, 1975 .

Lord, F . M. Some test theory for tailored testing . InW . K . Holtzman (Ed .), Cnmr»>tar Assist-Pd Tnstr>>nt ion .New York : Harper and Row, 1971 .

Rasch, G .

A mathematical theory of objectivity and its consequences for model

construction .

RPnnrt frnm F� rnnPan MPPtinrrnn Statistins, FnnnnmPtr ics and Management Sciences,Amsterdam, 1968 .

Samejima, F .

Estimating latent ability using a pattern ofgraded scores .

Psychometrika Monograph Supplement, No . 17,Richmond, Va . :

The William Byrd Press, 1969 .

Wright, B . D . Sample-free test calibration and person measure-ment .

In Proceedings of the 1967 Invitational Conference onTPStinrr PrnhIPms ,

Pr inratnn r N_,T_ "

FA>>natinnal

TAstJnCYService, 1967 .

Wright, B . D . and Panchepakesan, N . A procedure for sample-free item analysis . Educational and Psychological Measurement,1969, 29, 23-37 .

Wright, B . D . and Mead, R. J . Calfit : Sample-free item calibra-tion with a Rasch measurement model .

RPSParnh MPmnranch>m ,Nn_ 1R_

Statistical Laboratory, Department of Education,University of Chicago, 1975 .

Page 47: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

Wright, B . D . and Douglas, G . A . Better procedures forsample-free item analysis .

RPSParrh MPmnrand>>m f Nn_ 2D ,Statistical Laboratory, Department of Education, Universityof Chicago, 1975 .

Page 48: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

APPENDIX A

RRYTFS

A CONVFNTFNT MFTRTC' FOR MFARTTRFMFNT

The units of measurement we have been using so farflow directly from the simple logistic response model .Since the estimated probability of a correct response is

bf

-

di__ ePfi

bf d i1 + e

the odds for a correct response are

Pfi

/

(1

..

Pfi b

_

ebf

di

and the natural log odds for a correct response are

C)ri ai n

Thus we have been working in differences between personabilities and item difficulties expressed in natural logodds units .

But our origin and scale are quite arbitrary .We may consider whether there is a translation and scalingof these units which would make our measurement metricespecially convenient .

The units must be centered somewhere . Initially wecenter them locally either on our tests or targets . But inthe development of a substantial pool

of-items marking animportant latent variable we would want to establish a standardreference point with some useful meaning in the sphere ofmeasurements .

We might pick the easiest possible item (or theleast conceivable ability)

as the origin and calibrate all

Page 49: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

harder items (and greater difficulties) with respect tothat zero point . Or we might pick a standard group ofitems (or persons) and use their center as the origin .If we do not like negative numbers we can shift our wholescale upward so that all calibrations and measurements whichcan occur fall above zero .

Unless we can relate our measurements to some unitswith meaning outside the statistical characteristics of ourmeasurement system, we may as well use the least observabledifference in ability, LOD,and/or the least believabledifference, SEM, as our smallest unit . Our aim would be toconstruct our scale so that it conveys increments in SEMand LOD clearly and conveniently . To do this we mighttry to arrange things so that the minimum SEM we expectto encounter is about 1 or a bit larger while the correspond-ing minimum LOD is a .bit less than 1 but larger than .1 .If we did this, we could round our work to the nearest tenthand use the decimal point to remind us of the differencebetween LOD and SEM .

The derivation from our response model of LOD and SEMgives us, in natural log odds units,

LOD = C/L, where 4 <C < 9

when -2 <(a-d) <2

and

SEM = Y"L-OD > LOD, as long as L > C .

In addition, we will often want to compare two, moreor less, independent measurements with one another to see ifthey are significantly different . For that it will be usefulto define a least significant difference, LSD, as two standarderrors of the difference between similarly precise but inde-pendent measures,

LSD = 2 V'2_ SEM .

In order to discover a scaling factor which will separateSEM from LOD at the decimal point we will consider twoboundary situations .

First, we seldom measure with testsshorter than 20 items . If we consider the worst precision ofsuch a "short" test, which is at its edge, say 2 log odds unitsoff-center, then we will have a maximum error coefficient Cof about 9 and a maximum LOD = 9/20 = .45 . Second, we canseldom afford to administer more than 100 items and even atthe center of such a "long" test the error coefficient C can-not be less than 4 . This produces a minimum LOD = 4/100 = .04 .

Page 50: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

We have combined these boundary values in the tophalf of Table lA and calculated from them the relatedboundary values of SEM and LSD .

We can see that if wewant to use the .decimal point to mark the difference betweenLOD's < 1 and SEM's > 1 then the scaling factor we need isin the neighborhood of 5 .

We are working with relationships between log oddsand probabilities of a correct response . It might be conve-nient to have equal increments in log odds attached toeasily remembered probabilities . Even odds imply P = .5 .Three to one implies P = .75 . Nine to one implies P = .90 .We see that the exponents of 3 form a nice system . .

The probability of a correct response moves from

,10 to

.25 t o,50 to .75 to .90

in equal increments of tog 3 . Sincethe scale of tog 3 in log odds units is 1/tog 3 = .91, wecan obtain a scaling factor near 5 by multiplying .91 by 5to produce the factor 4 .55 . This scaling factor sets thestandard increments at five units each as Table 2A shows .

Our notation for ability measures in natural log oddsunits has been b . Now we will use B = 4 .55b

as ournotation for our scaled ability measures .

Instead of referring to these two metrics as large and small "b" we will callthem "log odds units" and "brytes ."

The boundary values of LOD, SEM and LSD in brytes aregiven in the bottom of Table 1A . There we can see that wewould not expect an

LOD below

,2

brytes so that one decimalplace in our calibrations and measurements will be quiteaccurate enough .

We also see that SEM ranges between 1 and3 brytes, so that our least believable differences will runjust greater than one bryte .

Thus the decimal point, in thebryte scale, nicely separates SEM from LOD .

Finally, wesee that differences between independent measures must get upto around 4 or 5 brytes before we can begin to take themseriously .

The essential best test design tables for best testdesign and use in brytes are given in Tables 3A through 5A .

Page 51: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

O r- coO

a M V1

N'C7

A

O

O 1-w N W

O m r-4 m-a

wu) u

u

U

U

a

V'

WOa

n

O A N$4

N

O

o

aO

a)

r-1

N

\a-1

4

.) U NII

11

to II ii IIv

as

as ca ~ caw r"1

b

ca o w Ina ro

s~

O o ~r N N oN a U) am >

ro

a11

11

H w

N

I"1

'b

vrob a) a) Ub

O

~d

't3

U U >~1

1

a) a) S4O

O

~d ~4 a)ca

a

A

A

a) a) 4-+- 44 44 4444 44 -1

ro

ro

-4 .,q C1U ca A-r1 N

a) a) s~04

4J

>1roO

N

ro

~

r~ ri roH

Z

a)

ro

A A Uy ro ro ~ra

t~ 41H CT > "1 a) r1

N 4J In 1) O

i-1

(1) .rI f.-(1) (n a) U) 0

S-1

N .-i a%H a) w a

O

A a) --+H H 1

rc

o as U)LT N

En

1~ +) a)

1

V V 4)O 1,+ O 1,+ a) U) En Na O a o rn ro ro as

A mb ro ~ a)w U) w U) U

w

a a am O Ob w uab ~+ O N >r O

Q) Q) a)J-) 4-1 a)

Is S~ CP C CP (1)O a) v Q) 'L3 L4a U W as U w a1 T1

a3 ro

Page 52: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

where P

Relation Between RelativeBrytes and the Probability of Success

Person AbilityRelative to

Item Difficulty

TABLE 2A

e (B-D) /4 .55

BD

1 + e (B-D) /4 .55

one Bryte =

4 .55 Log Odds

4 .55 =

5/tog 3 .

Probabilityof a Correct Response

B - D PBD

-20 .01-15 .04-10 .10- 5 .25

0 .505 .75

10 .9015 .9620 .99

Page 53: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

Values of W for Best Uniform TestsWith Normal and Interval Targets in Brytes

TABLE 3A

Target Std . Dev .

S

Normal TargetW over N(M,S2 )

Interval~Target

W at (M+2S)

Simple RuleW=4S

2 .0 0 0 8

2 .5 0 0 10

3 .0 0 5 12

3 .5 0 10 14

4 .0 0 15 16

4 .5 0 20 18

5 .0 0 25 20

5 .5 5 30 22

6 .0 10 30 24

6 .5 15 35 26

7 .0 20 40 28

8 .0 25 45 32

9 .0 35 50 36

Page 54: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

rt

r-i N N M M

I-T m

h

01 ra d' 00 d' M

rCO Qt O r-I N M

d'

to

kD

r

01

O r{ M

Ln rr1

ri r1 rl rI -4

rl

ri ri N N N N N

d'

M N N N r{ ri N

N

M

n O

r-I mo

to O(~

O O1 O rI N M C'

Ln lD

r

00

O ri M

LI1.-1 ri

r-I r-4 ri

r4

ri rI r-I N N N N

tD Ln M

N ra O m Ot

O

O O rl V' m r Ol

kD

r CO 01 O r4 r-1 N

M

d' lD r OD O\

.-1

M

rt

ri ri ri

rI

ri rl r1 r-1 ri N N

T

r Ln N O 01 !- tD

LI1

N N ~10 O N O\ rl

m

lD r O O1 O1 O r1

N

M 'IT to tD 00 O\ N

rI

rI

r-I

r-I ri

.-I

rl r

"i

ri N

M

01 t0 M r4 O 0 1-1

N

rI r-4 rI M w M

to

ul lD r O OD O O

rI

N M 4 LIl ~ OC O

r-I

ri

ri rl r-I r-i ~i ri N

t.D

M O O N O l0 M

rI

O 00 O O> N O O1

d'

to Ln lD [~ r 00 01

O

O

" -I

N I`

"1

Ln

00r1

1-4 r1 r1 r4 r-4 rl 1-1

O

N d' l0 O O N

u1

N Ln 0 ul lD Lo O

V'

N O O O d' N O

M

M M N N N N N

O

t0 d' N O O %D V'

rl

~i rl ri r1 O O O

44IO

O kD 'cT N O O %D

O O

O O O O rl ra r-I

u1O

rI N M to r

Nto O

O 00 r r t9 Ln d'

N d'(L) O

O r-i N M !}' U1 lD

H

O O

O tD to M ri r O

O d'W O

O r-i N M d' d' u1

$4 II) O

r to N 01 r d' N

O MW O

O r-I N N M 'T 0

NN

O

tp M O l0 N 01

O

O r1 r-I N M M d'

to O

lD r-i r M O CT O

W 3 Na 44 O

O ri rI N N M v

xH N

O O O

v1 O to O to O O

41 NO

O r-I r

"I

N N I

" 1

M

W to O

d' O1 M O N r N

O

O O ri rl cq, M

rl.1~ O O

'IT O N Q0 O 11 0\

r-IO

O O r-I r-I N N N

N.ri

O

d' r ri u1 O\ M r

b tor! 0

O O r-I rl r-I N N

N

O

ct' r ri to CO N lo

OO

O O r-I ri H CN*

(r)

LA Ul O ) r

O

00 01 O

.-i

N ~' iD

ri

r-I r-I ri rl

34-1k k

M

O N r N r M O

ul

N d1 r r O1 M M

Q QM

M 1)' V' Ln u1 110 tD

r

00 OD O O H M Ln

ri

r-I r-I r-I

II44 44

14

Ln O d' 01 d' 0, d'

O

r d' N N M r r-

0aM

M M 'v V' ul N l0

r

r 00 01 O i-I N C'

r4

r-I r-I ri

O

d' 01 N c0 N O IM

0`

Ln M r-I O r-I Ln 0

h VM

M M ~N V' ul Ln O

r

O Ol O r1 N V'

r-I

rl r-4 r-I

4-1 4-I

O

00 O N d' "D W O

N

V' 0 CD O N I)' ~D

lD

Q r r r r r O

00

Cb M m O1 O\ 01 O\

44H

I-i

~o N O d' O t9 M

O

O 110 110 tD 01 ll

:r

Itr

C1'

d' Ln LIl -1D r r O

01

Ol O rI N M ul r

r-I

rl rI r

"I

r-I rl

3w

LD

ri lD N r M O\ to

r1

O1 r Ln LI1 r N N

1

Page 55: Best Test and Self-Tailored Testing · Tailored testing (Lord, 1971) tries to do this by various stepwise procedures. Unfortunately most schemes are either expensive, requiring computer

3wU

~.rWN

}iO44

3

U

1,S~U

U

4444(1)OU

O

NW

OLn

toM

OM

N

0N

Ln

Ori

Ln

O

14 r-I N N N N N N

M M dr dr to r M O

N tD O Lfl Ln r r M

to Lo Ln Ln Ln Ln Ln Ln

Ln Ln Ln Lr) Ln M M Lfl

kD LD r r w M r1 Lnri r-i r-1 r-1 r-I r-i ri ri

ri rl ri ri ri ri ri ri

ri ri r1 ri r-i r-l N N

dr IV 'IT dr Rr in to to

%D ko r 0o M O N IV

r r4 to N r1 dr Vr O'IT cryr1~rdrcyrv

yrqr var ~vLnuiLn

LnmoLDro~r-iLnri r-1 r-I r-i r1 ri ri ri

ri r1 r-1 ri rl r-1 ri r-l

r1 r1 rl r-i r1 r-1 N N

r r r r r r 00 O

MM O ri N dr LD O

r-1 m o r ~ O ra O

M M M M M M M M

MM ~T v Vr dr dr Vr

Ln to Lfl O r a1 r1 c)rri ri ~4 r-4 r-1 ri ri r1

ri r4 ri -4 ri ri ri r-1

r-4 14 r-; ri -i r-1 N N

al at a1 al al O O r-1

ri N M tr tD O O M

to O Ln M N LD O to

N N N N N M M M

M M M M M frt dr dr

4 Ln L~ LD r O O drr-1 r-l r1 r4 ri r-i r-1 ri

r1 r-1 ri ri r-1 1-i r-i r-I

r-i ri r-i r-1 r1 ri N N

r-1 r1 r-l N N N M 'IT

'dr to LD O 0 r4 dr r

O Lo r1 O m M cr MN N N N rl N N N

N N N N N M M M

dr dr Lfn Ln LD 00 o drrA

r1

r-I ri

ri

r 4

r-4

ri

r-l

r-I r-1 ri

ri

r~

rl

ri

r-i

ri

ri

ri

r-1

ri

N

N

-IT 'IT 'IT -~r Ln Ln LD Lo

r ao o r1 M to M r-l

0 o 110 "r 1~r al~ M 0

r1

r-i

r-l r1

ri

r-i

r-) r-l

r-q

r-1 N

N

N

N

N

M

r°1

cr

Vr

Ln

LD

r

O

rr14 r-i r-l ri ri ri ri ri

ri r-I ri r-i r-1 r1 ri r-i

ri ri r1 r-1 r1 -i N N

r r r r o o a, o

ri N M Ln r o N ~D

o Ln ri o ri t.D ri N

O O O O O O O ri

ri ri ri r1 ri N N N

M M dr LIl LD r O Mr1 '-1 r-4 r-1 r-i rl rl r-l

ri ri r-a r-1 r1 r1 ri -4

ri r-i r1 r-i r-1 ri N N

O O r-i r-l r-i N M dr

to Lfl O O N dr r r-i

O O O O O O O O

O O O ri ri ri 11 Nr-i ri r-i ri r1 r1 ri r1

r-1 rl r-i r-l r-i r-l ri r-i

Ln 0 tD ~1D r r Ca al

O N M Ln 00 O M r

dl al al ~ O 0; 1 dl

O O O O O ri r1 r1r-I r1 r-i r-l r-i r-I r1

N N N M M Vr to

r d\ O N Ln O ri Ln

a1 Ln N ri N Ca N N

M a\ a\ al al a1 al al

a\ M O O O o r-4 ri

ri N M cr Ln LO al Mrl ~q r-1 ri ri -i

ri ri ri ri ri ri r1 N

r-i r-i r-1 N N M Cr Ln

Lo O a, r-A dr r O dr

O Cr r-i O N O N N

a1 m a\ a\ al a> al al

a1 a\ a\ O O O ri ri

r-i N M lzr Ln lD M Mri ri r1 r-1 r1

r-i ri ri ri r-a r-I r-i N

wI

O O t.O -IT N O 00 LO.to 'IT IT V° I'll dr M M

O N C' ~D m O N VrLn Ln Ln Ln Ln O O O

Cr N O 00 t0 'IT N OM M M N N N N N

~D CO O N Cr ~.D CD OLo~Drrrrrco

O i0 'r N O M O Crr-l r-1 r-i ri r-q O O O

N dr LD CfJ O N 'ITmcoo~coa~ala,a,

3

Ln ri r LD r M Lfl Lfl

N M 14 4 Ln r a\ c`'1ON

ON

r-i ri ri ri rl r-i ri Nv n

3 SN r Vr M Ayr O dr Vr

N N M 4 Ln r a1 rl),ri r-i r-l r-l r-1 ri ri N

v