22
371 From Classical Test Theory to Item Response Theory: An Introduction to a Desirable Transition – Nenty. ______________________________________________________ __________ 33 FROM CLASSICAL TEST THEORY (CTT) TO ITEM RESPONSE THEORY (IRT): AN INTRODUCTION TO A DESIRABLE TRANSITION H. Johnson Nenty ______________________________________________________ __ Introduction In spite of Thorndike’s (1918) contention that “whatever exist at all exists in some amount, to know it thoroughly involves knowing its quantity as well as quality,” objective measurement has for long been viewed as an exercise that can only take place in the physical realm, that is, in the measurement of physical characteristics as if non-physical characteristics do not also exist. This is because unlike physical concepts that can be felt, seen, heard, tasted, perceived through smelling, etc, most human traits are latent. Being latent, such

FROM CLASSICAL TEST THEORY (CTT) TO ITEM RESPONSE THEORY (IRT): AN INTRODUCTION TO A DESIRABLE TRANSITION

  • Upload
    ub-bw

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

371

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

________________________________________________________________

33

FROM CLASSICAL TEST THEORY (CTT) TO ITEM RESPONSETHEORY (IRT): AN INTRODUCTION TO A DESIRABLE

TRANSITION

H. Johnson Nenty

________________________________________________________

IntroductionIn spite of Thorndike’s (1918) contention that

“whatever exist at all exists in some amount, to knowit thoroughly involves knowing its quantity as well asquality,” objective measurement has for long beenviewed as an exercise that can only take place in thephysical realm, that is, in the measurement ofphysical characteristics as if non-physicalcharacteristics do not also exist. This is becauseunlike physical concepts that can be felt, seen,heard, tasted, perceived through smelling, etc, mosthuman traits are latent. Being latent, such

372

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

characteristics cannot be measured by bringing aboutsome form of physical or direct contact with themeasurement device during the process of measurement.Latent traits act out as behaviours that can beobserved. To measure these latent characteristicstherefore, we have to provoke them to act out and tryto capture the intensity of their presence bychallenging them with related graded tasks.

___________________________Nenty, H. J. (2004). From classical test theory (CTT) to item response

theory (IRT): An introduction to a desirable transition. In O. A. Afemikhe and J. G. Adewale (Eds.), Issues in educational measurement and evaluation in Nigeria (in honour of ‘Wole Falayajo) (Chapter 33, pp. 371 – 384). Ibadan, Nigeria: Institute of Education, University of Ibadan.

Such tasks are called item, they must be such thatwould illicit or provoke to action or observablebehaviour the exact latent trait under consideration.Items enable us to document the intensity of aprovoked latent trait through measurement.

For measurement to be objective, the result ofmeasuring the amount of a given trait, be it physicalor psychological, possessed by a body must not dependon the specific measurement instrument used, norshould the meaning of that which results from themeasurement depend on the group along with which thebody was measured. Similarly, in psychologicalmeasurement, the amount of what is being measuredinherent in an item should not depend on the abilityof the group to which the item is administered, or onthe set of items along with which it forms measurement

373

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

instrument (Wright, 1967; Nenty, 1985). In otherwords, every attempt at measuring the same trait,should give the same result no matter the specificinstrument used, the person doing the measurement orthe persons along with whom the individual ismeasured.

By its very nature, measurement is an objectiveprocess. If anything other than what is being measuredinfluences the result of measurement, then thatmeasurement, then that measurement is subjective, andcannot give of a valid value of what is beingmeasured. Measurement could be seen as the systematicprocess of locating the amount or quality of a givencharacteristic possessed by a body (an individual oran object) along a metered line that represents ordefines such characteristic or variable. The point onthe metered scale where the individual is locateddepends on the amount or type of the characteristicunder measurement possessed by a body (an individualor an object) along a metered line that represents ordefines such characteristic or variable. The point onthe metered scale where the individual is locateddepends on the amount or type of the characteristicunder measurement possessed by the body. For physicalmeasurement, most of the metered lines or instrumentsare ready-made and available at an appropriate store.This is very rare in education, and hence, educationalpractitioners have to construct almost all instrumentsnecessary for their everyday measurement in theirclassroom or for their research. So, unlike ourfriends in the physical sciences, a behavioural

374

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

scientist or an educationalist does not only have tolearn how to use, but also how to constructmeasurement instruments. In education, measurementtherefore involves three stages: (1) the constructionof a valid metered line represented by a measurementinstrument; (2) the application of the instrument inthe actual process of measurement; and (3) reading thescale, analysing and interpreting the result observedvalidly. The construction of a precise or reliable andvalid metered line is technically a very tasking jobthat calls for a lot of know-how.

When one hears the word ‘measurement’ one quicklyassociates it with the application of one kind ofinstrument or the other to gauge the quantity, rarelythe quality, of something posses by the body beingmeasured. Hence, it is actually the quantity orquality of ‘something’ possessed by the body that isbeing measured, not the body. Therefore, to a varyingdegree, that which is under measurement exist in thebody, or the body possesses that which is undermeasurement. So during the measurement process, thebody brings that ‘thing’ along (Warm, 1978) so thatits quantity or quality could be determined. If thething to be measured is visible, touchable,‘tasteable’, ‘feelable’ etc., and could be measure byhaving the measurement instrument to make some form ofcontact with it, then the measurement is said to bephysical. Most physical measurements are direct butsome are more direct than other. For example while themeasurement of height is more direct that themeasurement of the speed of wind, the measurement of

375

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

the ability to multiply decimal numbers, which ispsychological, is completely indirect.

In the quantification of height, for example,each mark of a measurement stick, a metre rule, posesa challenge to the height under measurement. If theheight under measurement has what is required tosurpass a particular mark on the metre rule, it willovercome the mark and height under measurement and themarks on the measurement stick goes on until a mark onthe measurement stick is reached which the heightunder measurement just matches but cannot surpass,then the height under measurement takes on themagnitude of this mark on the measurement stick posesa challenge to the height under measurement.

According to Nenty (1998), because of itsindirect nature, the measurement of behaviouralcharacteristics is error prone. It is indirect, andhence basically inferential, that is, we use what weobserve during the indirect measurement process, topredict or estimate what we were looking for. Hence weneed theories, functions, models or principles toprovide us the guide and the basis for predicting orestimating what we were looking for from that which weobserve. In educational measurement therefore, unlikein physical measurement, we need a theory ofmeasurement to provide some guide and direction duringour attempt to measure and based on its results,estimate a given trait level, for example, an abilitylevel possessed by a learner. Since the scores thatresult from our measurement efforts, to some varyingdegree, are not errorless in representing the trait

376

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

levels of individuals being measured, we have to beinterested in something other than such scores inother for us to estimate these trait other then suchscores in order for us to estimate these trait levelsmore validly (Lord & Novick, 1968). This means goingbeyond what we observe to predict the amount of whichwe expected. We cannot do this without the guidance ofan operationalizable theory or model (Nenty, 1998).There are currently two of such theories with theiraccompanying models: the classical test theory (CTT),and the item response theory (IRT).

A Brief Summary of Classical Test TheoryThe classical test theory holds that which we

expect during the measurement of any behaviouralcharacteristic is the true score (X), while that whichresults from our measurement is the observed score(X0),but that this observed score should have been thesame value with the true score except for error (Xe)inherent in our measurement. Hence the fundamentalformula of CTT is:

Xo = X + Xe

This links observable test score Xo to twounobservable latent variables, X and Xe. Error score(Xe) is that part of the observed score that they donothing to do with the ability or trait level undermeasurement, hence they do not relate. It mayincrease, decrease or have no effect on the observedscore, and hence makes it bigger, or less than truescore. Therefore it is a random and normallydistributed variable that is “a disturbance…due to a

377

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

compose of a multiple…factors not controlled in themeasurement process” (Lord & Novick, 1968). Theability of the examinee is reflected by this truescore (X), which CTT defines as the expected value ofthe observed score. That is, the average of all thescores an examinee would make during an infinitelymany repeated and independent measurements of the sameability or trait level. During such averaging, theinfinite number of error scores, which is a randomvariable, cancels out and leaves the average of theobserved scores equal to the true score. Hence, theexpected value of error score is zero, and as aconsequence of this, the expected values of observedscore is the true score.

CTT makes several assumptions, some of which areimplied in the above presentations. For example, morethan one version of the sane test could beconstructed. In order words, strictly parallel testscould be constructed, such that scores from the twotests have the same standard deviation, the samecorrelation with the true scores, and the variance oneach test, which is not explainable by true score, isdue purely to random error (Lord, 1980; Nunnally1978). The error scores on the two parallel tests donot relate and hence have a correlation of zero,likewise the error scores on one of the paralleltests, and also the true scores on the other test donot relate. We saw earlier that error and true scoreson the other test have a correlation of zero; hencethe variance of the observed score is equal to the sumof the variances of the true and of the error scores.

378

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

Any test is good to the extent which thevariability in, or the variance of the true scoresexplains the variance of the observed scores from it.It explains all of it, then the test is perfect, theobserved and the true scores are the same, theobserved scores estimates the true score perfectly. Itexplains none, then the test is totally useless andthe two variances are independent, the scores based onthe test are all error scores and share nothing with,and hence cannot estimate, the true score. With one-shot measurement, as in everyday practice, theaccuracy with which our observed score estimates thetrue score depends on the extent to which error isabsence in our measurement. This brings in theimportant idea of reliability which is a central issuein educational measurement. Fundamentally, reliabilityis a measure of the amount of observed score variancethat is, the proportion of the true score variance tothe observed score variance.

While reliability is concerned with the errorinvolved when a test is measuring what it ismeasuring, another important concern of CTT, thevalidity of a test, is concerned with the errorinvolved in a test measuring what it was designed tomeasure. A test can be measuring what it is measuringvery well, without measuring what it was designed tomeasure. According to Nenty (1985a, 1998), reliabilityrepresents the degree of consistency with which a testrepeats itself, or is stable across time, across itsdifferent (parallel or alternate) forms, across partsor portions of itself, or across its items. That is

379

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

the consistency with which a test measures what it ismeasuring. Hence reliability is an index intrinsic orinternal to the test and indicates how well items inthe test hang together to measure whatever the test ismeasuring. Since infinite number of measurements, asimplied in CTT, cannot be made, generally, precisionin measurement is enhanced by increase in the numberof measurements taken, or in a one-shot measurement asit’s often the case in practice, by increase in thenumber of items used. Validity, asks the question: ifthis is, or this is a semblance of, what the test wassuppose to measure, how well does it measure it? This,‘this is’ which is external to the test; it oftencalled a criterion? Hence unlike reliability, validityis an index extrinsic to the test.

Problems with CTTWe have for years been carrying out our

measurement in education based on the CTT, so we knowmost practices based on it. Some of the measurementpractices, for example, our general inability toconstruct good test items. Through this paper, forlack of space, has not gone into these practices, thissection is representing the theory –based problemsassociated with using principles of CTT in ourmeasurement. The most important problem of CTT is thatits characterization of examinee is test-dependent andthe characterization of the items or test is examinee-dependent. For example, with CTT, the difficulty of anitem is not an inherent property of the item but isrelative to the group on which the item is

380

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

administered. For a ‘dump’ group it may be, say .32,while for a ‘smart’ group it might be, say .70. Forexample, the weight of a piece of stone, which is aninherent physical property of the piece of stonechanges with the ability of the person lifting it.Conversely, one’s ability, which is an inherentproperty of the person, changes with the test used tomeasure it. For example, one’s ability to multiplydecimal numbers depends on which test of that abilityhe takes. One test can show him to be a high achievergiven this ability, while another can show him to be avery low achiever given the same ability. That is tosay, two tests designed to measure the same traitlevel for the same person each constitutessignificantly different scales for measuring the sametrait. Furthermore, with CTT, the meaning of the scoreone makes depends on the group along with which onewas measured. A very ‘smart’ child among a ‘dump’group becomes a ‘dump’ child in a ‘smart’ group. Whata measure? For the life of our educational processesin Africa, we have unquestionably accepted this, andhave applied the results from measurement based on theCTT without much caution. Given any human trait, everyindividual has only one trait level (Ɵ); therefore atheory that allows for more than one, sometimessignificantly different, estimates of this trait levelis fundamentally flawed. The same can also be said forthe trait level of any item. Another problem is thatCTT assumes all items are equal; differences indifficulty, discrimination and vulnerability to

381

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

guessing do not play direct role in generating rawscore. Hence the ability or trait level of individualexaminees is determined only by the quantity and notthe quality of items. Performance on any set of tasksthat may be given may not provide an objectiveestimate of examinee’s latent trait level neither canwe determine how effective each task is for measuringat each trait level (Lord, 1980).

A Brief summary of Item Response TheoryDuring the testing process, there is an encounter

between an individual and an item. Item responsetheory is a set of models which, by relating thelikelihood of a particular reaction by an individualwith a given trait level to the characteristics of theitem designed to elicit the level to which theindividual possesses that trait: attempts to estimatethe parameters involved, explain the process andpredict the results of such an encounter. Such an item(task, question, statement, etc) may elicit the levelto which the exhibition of appropriate cognitive,affective or psychomotor trait or attribute. Whilethere is only one parameter ascribed to the traitlevel of the individual, the task or item, dependingon the model, is often characterised by at most threeparameters. The individual trait level is oftendesignated by theta (Ɵ), which represents the amountof ability, trait or attribute level possessed by anindividual. The three parameters associated with theitem are: the item discrimination power (a); thedifficulty parameter (b); and the guessing parameter(c). In a cognitive task, the a-parameter indicates

382

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

the degree to which examinees’ response to an itemvaries with, or relates to their trait level orability. What is often termed item difficulty b is infact the amount of trait inherent in the item.Operationally, the b- parameter represents thecognitive resistance (Nenty, 2000) of the item ortask, which is the amount of trait under measurementjust necessary to overcome the task or item, while thec-parameter represents the probability that a personwho is completely lacking in the trait will overcomeor answer the item correctly or give a desirablereaction to the task the c-parameter could be said tobe the vulnerability-to-guessing index.

Basic Assumption of IRTThe basic assumptions of item response theory

follow from our naïve understanding of physicalmeasurement. For objective and valid measurement,instruments must be designed for, and used to measureone and only one, characteristic or variable. This isthe idea behind IRT’s unidimensional assumption. Thisimplies that all items in a test or a measurementinstrument must be developed to measure or call onone, and only one, trait or attribute. (Might be acomposite of many related traits that maintains equalweightings across all individual responses, that is,if their response data is factor analysed, thiscomposite will come out as a single factor). A moretechnical assumption related to this is that of localindependence.

383

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

Local independence assumption demands thatperformance across items in the same instrument shouldnot be related except as a result of the influence ofthe trait level that they are designed to measure. Inother words, if the influence of the common traitlevel is controlled for, then performance on pairs ofitems in the test should not be related in any way.Technically, this means that no systematic source ofvariation, other than the trait under measurement,should underlie responses to test items. Theseassumptions are basic to the theory’s ability toenable the development of a trait line on which thetrait levels of both the individuals and the itemscould be objectively mapped. Beside these two generalassumptions, each IRT model has its own peculiarassumptions. Generally, the models differ in thenumber of parameters they ascribe to the item.

Briefs on IRT ModelsThe 3-parameter model

This model, as the name implies, ascribes all thethree parameters mentioned earlier to the item. Itassumes that the three parameters are necessary for anestimate of a valid relationship between theprobability of a correct response to an item and thetrait level (ability) of an individual. Hence themodel is often represented by the logistic function:

P (Ɵ) =Ci +[ 1 –Ci]/[1 +e?????????????????????]pg 378

This gives the probability of an individual withability Ɵ responding desirably or correctly to an item

384

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

i with a difficulty of bi, a discrimination index ofai, and a vulnerable-to-guessing index of ci. Theletter e that appears in the formula is the base ofnatural logarithm and is approximately equal to 2.718,while the constant value, 1.7, which appears in theformula, is a scaling factor, applied to scale thelogistic function [L(0, 1.6679)] to approximate thenormal function [(0,1.0)] as closely as possible ( seeNenty, 1998).

The 2-parameter ModelThis model assumes zero vulnerability-to-guessing

parameter for items; hence the parameter ci is notnecessary for the estimation of a valid relationshipbetween the probability of a correct response to anitem and the trait level (ability) of an individual.In that case, the logistic function is:

P(Ɵ) = 1/[1 + e pg 379

This gives the probability of an individual withability Ɵ responding correctly to an item I with adifficulty of bi, and a discrimination index of ai.This implies that, to use this model, items must beconstructed to ensure that they are not significantlyvulnerable to guessing.

The 1-Parameter or Rasch ModelThe 1-parameter model, or more popularly called,

the Rasch-model, ascribes only the difficulty (bi)parameter, or the trait level required to be morelikely correctly answer the question than answer itwrongly (Palmieri, n.d.) to the item. In terms ofassumptions therefore, this is the most stringent of

385

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

the three IRT models. It places more demands on testconstruction effort than the other two models. Itemsthat fit the model must differ only in difficulty, butmust not be significantly different in itemdiscrimination and have vulnerability-to-guessingindex that is not significantly different from zero.Hence the parameters ai and Ci are not necessary forthe estimation of a valid relationship between theprobability of a correct response to an item and thetrait level (ability) of an individual.

P(Ɵ) = 1/[1 + e pg 379

This gives the probability of an individual withability Ɵ responding correctly to an item i with adifficulty index of bi. This implies that, to use thismodel, items must be constructed to ensure that theydo not differ significantly in their discrimination,and are not significantly vulnerable to guessing.

Besides the three fundamental model presentedhere, a variety of IRT models have been developed foritem responses that are discrete or continuous anddichotomously or polytomously scored. In each case,the probability of responding correctly or choosing aparticular response option can be representedgraphically by an item (IRT) or option responsefunction (ORF). These functions represent the non-linear regression of a response probability on alatent trait, such as conscientiousness or verbalability (Hulin, Drasgow, & Parsons, 1983).

A Comparison between IRT and CTT and the Applicationof IRT

386

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

As the names imply, CTT is test centred while IRTis item centred in their fundamental estimation ofperson ability. For example, two examinees that areassigned the same total number of items correct by CTTin the same test may not be assigned the same abilityestimate with IRT. Except for familiarity and ease ofoperation, CTT has not advantage over the former. Thelack of ease of operation stems from the fact that theassumptions underlying the use of IRT models are merestringent than those required of CTT. IRT models alsotend to be mathematically more complex and the outputsmore difficult to understand. It is always held as adisadvantage that IRT models require large sample sizeto obtain accurate and stable parameter estimates, butRasch-or 1-parameter can do with moderate sample size,and by the way, psychometric theory is generally alarge sample theory (Nunnally, 1978). Parameterestimates become more stable as the sizes of testeesand items increase.

On the other hand, IRT compared to CTT offersseveral distinct benefits to the science ofeducational measurement. IRT provides severaladvantages over CTT for constructing and assessingmeasurement instruments, estimating valid andinvariant person and item parameters and flexibilityin using measurement instruments for solving severalmeasurement problems. The most important advantage ofIRT is its ability to estimate the parameters of anitem independent of the characteristics of both theexaminees to which the item is exposed and that ofother items in the test as well as to estimate the

387

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

ability or trait level of an individual independent ofthe characteristics of the particular set items towhich the individual is exposed as well as those ofother examinees along with whom the test was taken.Hence the scientifically desirable person-free itemcalibration and item-free object measurement isachievable wit IRT (see Nenty, 1985b). With CTT, everytest even of the same ability or achievementconstitutes a different measurement scale, but withIRT the trust is to develop a fair and equitablescale- and sample-free metrics for objectivemeasurement (Fisher,2001). IRT enables theconstruction of a measurement scale or a metered traitline on which both item and person trait levels couldbe objectively mapped, along it a point at whichmaximum information about an individual ability ortrait level could be located. Around such a point,more suitable items could be selected to extractmaximum information about, and hence derive moreprecise measurement of, such ability or trait level.Given this setting, a set of items could be developed,or picked from a bank of calibrated items, toaccurately target or match an individual ability ortrait level. Hence we can develop with targetedcharacteristics, for example a test that can target anindividual’s ability with great precision. This goes along way to enhance the solution to many measurementproblems through its application in test equating,tailored testing, adaptive testing, flexileveltesting, testing for mastery, differential itemfunctioning analysis, an objective basis of setting

388

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

cut-off scores, etc (Warm,1978; Lord, 1980; Hambleton& Jones, 1993; Nenty, 1998). For the items too,ability to map several items with varying parametervalues or trait levels on the same scale or meteredtrait line enables the construction or selection ofitems that meet head-on, specific instructionalobjectives given the type of cognitive, affective orpsychomotor behaviours that the teacher or thecurriculum was to be developed among learner. It alsoengenders the construction of parallel test forms aswell as tests with which changes or growth incognitive, affective and psychomotor behaviour acrosstime could be more validly established. Instead of CTT’s reliability which is alsosample-independent, IRT plots an item informationfunction (IIF) to describe how well, or how precisely,an item measures at each trait level being measuredby a given test and determines the amount of usefulinformation the item adds to the test and plots thisas a test information function (TIF). According toHulin, Drasgow and Parson (1983), the invariant natureof IRT item and person parameters, makes it possibleto examine the contribution of each item as they areadded to or removed from a test to meet pre-determinedneeds and specifications. The ‘sum’ of the informationfunctions for all the items in a test gives the testinformation function (TIF), and based on thisconditional standard errors of measurement can becalculated rather than one standard error ofmeasurement for the entire test score. Along the samemetered line, an observation of differences in

389

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

performance on an item by individuals with the sametrait level but from different groups could bedetected, and this signals a differential functioningfor that item. This might be observed for members ofdifferent sex, race, ethnic, location and handicappedgroups, and calls for the revision or removal of suchitems from the test. Hence, IRT can distinguish itembias from true difference on trait level whereas CTTcannot (Kim, Cohen & Park, 1995).

ConclusionsAccording to Nenty (1998), though the CTT has

sustained educational measurement for almost a centurynow, all have been measurement with an elastic rulerthat shrinks and stretches under pressure from knownand unknown extraneous forces and hence produceresults that are at best meaningful only inextremely limited circumstances. But we assigned theseresults invalid meaning and use them to sustain widereaching important and vital decisions on the futureof learners and others. IRT has arrived on time totake us to the next century of educational measurementwith bright hopes and an exciting future. If we can atleast keep on an open mind and make a little trial wewould be contributing towards the realisation of theexcitement of a scientific measurement process ineducation. With the promises of parameter invariance,a common trait scale for measurement and others, andtheir several ramifications, we are well intoachieving objectivity in educational measurement as itis in physical measurement, which as Wright (1967)

390

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

rightly put “is a matter of life and death to thescience of mental measurement.” According to Hays,Morales and Reise (2000), IRT has a number ofpotential advantages over CTT in assessing learning,in developing better measures and in assessing changeovertime. Its models yield invariant item and latenttrait estimates (within a linear transformation),standard errors conditional on trait level, and atrait estimates anchored to item content. It alsofacilitates evaluation of differential itemfunctioning, inclusion of items with differentresponse formats in the same scale, and assessment ofperson fit and is ideally suited for implementingcomputer adaptive testing. In the West, mostexamination bodies have realised and utilised thestrength of IRT to produce valid and defendableexamination results and for the solution of severaland hitherto seemingly un-surmountable testingproblems.

ReferencesFisher, W. P.(2001). Invariant thinking vs. invariant

instrument. Rasch Measurement Transactions, 14 (4),771-774.

Hambleton, R. K. & Jones, R.W.(1993). Comparison ofclassical test theory and item response theoryand their applications to test development. AnNCME Instructional Module 16, Fall 1993.Retrieved on 08/07/04 fromhttp://www.ncme.org/pubs/items/24.pdf

391

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

Hays , R. D., Morales, L.S, & Reise, S.P (2000). Itemresponse theory and health outcomes measurementin the 21st century. Med Care. 38 (9 Sppl): 1128-1142. UCLA, School of Medicine, Los Angeles,California, USA.

Hulin, C.L., Drasgow, F. & Parson, C.K. (1983). Itemresponse theory. Homewood.ILL: Dow Jones-Irwin.

Kim, S.-H., Cohen, A. S., & Park, T.-H. (1995).Detection of differential item functioning inmultiple groups. Journal of Educational Measurement,32,261-276.

Lord, F. M. (1980). Applications of item responsetheory to practical testing problems. Hillsdale,New Jersey: Lawrence Erlbaum Associates, Inc.

Lord, F.M. & Norvick, M. R. (1968). Statisticaltheories of mental test scores. Reading, MA:Addison-Wesley.

Nenty, H.J. (1985a). Fundamentals of measurement andevaluation in education. Unpublished mimeograph,Faculty of Education, University of Calabar,Nigeria.

Nenty, H. J. (1985b). An empirical verification of theRasch model. The West African Journal of Educational andVocational Measurement, 6, 39-48.

Nenty, H. J. (1998). Introduction to item responsetheory. Global Journal of Pure and Applied Sciences, 4(1),93-100.[http://www.inasp.org.uk/ajol/journals/gipas/index.html]

Nenty, H. J. (2000). Some factors that influencestudents’ pattern of responses to mathematics

392

From Classical Test Theory to Item Response Theory: An Introduction toa Desirable Transition – Nenty.

examination items. BOLESWA Educational ResearchJournal, 17, 47-58.

Nunally, J. C. (1978). Psychometric theory (3rd ed.).New York: McGraw-Hill.

Palmieri, P. A. (n.d.).Item response theory method andapplication gaining support as assessment instrument. University of Illinois at Urbana-Champaign. Retrieved on 05/07/04 from http://www.istss.org/publications/TS/Summer02/Item_Response.htm

Thorndike, E. L. (1918). The nature, purposes andgeneral methods of measurements of educationalproducts. In the Measurement of educationalproducts. (The seventeenth yearbook of theNational Society for the study of Education,Part II). Bloomington, Illinois: Public SchoolPublishing.

Warm, T. A. (1978). A primer of item response theory.National Technical Information ServicesDepartment of Commerce, Oklahoma City, OK:U.S.Coast Guard Institute.

Wright, B. D. (1967). Sample-free test calibration andperson measurement. In Proceedings of the 1967Invitational Conference on Testing Problems.Educational Testing Service, Princeton, NJ.