33
Bluepaper Number 2 2002 by Mhairi McAlpine by Mhairi McAlpine by Mhairi McAlpine by Mhairi McAlpine by Mhairi McAlpine Robert Clark Centre for Technological Education, University of Glasgow edited by CAA Centre ISBN 1-904020-02-X A Summary A Summary A Summary A Summary A Summary of of of of of Methods of Methods of Methods of Methods of Methods of Item Analysis Item Analysis Item Analysis Item Analysis Item Analysis A Summary A Summary A Summary A Summary A Summary of of of of of Methods of Methods of Methods of Methods of Methods of Item Analysis Item Analysis Item Analysis Item Analysis Item Analysis

A Summary of Methods of Item Analysis

  • Upload
    lenhi

  • View
    230

  • Download
    2

Embed Size (px)

Citation preview

Page 1: A Summary of Methods of Item Analysis

Bluepaper Number 2

2002

by Mhairi McAlpineby Mhairi McAlpineby Mhairi McAlpineby Mhairi McAlpineby Mhairi McAlpine

Robert Clark Centre for Technological Education,

University of Glasgow

edited by CAA Centre

ISBN 1-904020-02-X

A SummaryA SummaryA SummaryA SummaryA Summaryofofofofof

Methods ofMethods ofMethods ofMethods ofMethods ofItem AnalysisItem AnalysisItem AnalysisItem AnalysisItem Analysis

A SummaryA SummaryA SummaryA SummaryA Summaryofofofofof

Methods ofMethods ofMethods ofMethods ofMethods ofItem AnalysisItem AnalysisItem AnalysisItem AnalysisItem Analysis

Page 2: A Summary of Methods of Item Analysis

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

TTTTTHEHEHEHEHE CAA C CAA C CAA C CAA C CAA CENTREENTREENTREENTREENTRE TL TL TL TL TLTP PTP PTP PTP PTP PRRRRROJECTOJECTOJECTOJECTOJECT

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis is published by members of theImplementation and Evaluation of Computer-assisted Assessment consortium, a projectfunded by the HEFCE and DENI under phase three of the Teaching and Learning TechnologyProgramme (TLTP). The project was led by the University of Luton and includes Glasgow,Loughborough and Oxford Brookes Universities.

CopyrightCopyright of A Summary of Methods of Item Analysis A Summary of Methods of Item Analysis A Summary of Methods of Item Analysis A Summary of Methods of Item Analysis A Summary of Methods of Item Analysis rests with the CAA Centre. However,members of UK higher education institutions can copy parts, in print only, of A Summary ofA Summary ofA Summary ofA Summary ofA Summary ofMethods of Item AnalysisMethods of Item AnalysisMethods of Item AnalysisMethods of Item AnalysisMethods of Item Analysis for training and staff development purposes with permission and aslong as appropriate acknowledgements of authorship are included with all reproduced materials.The storage or distribution of all or part of A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis inelectronic format is prohibited. Anyone wishing to use all or part of A Summary of Methods ofA Summary of Methods ofA Summary of Methods ofA Summary of Methods ofA Summary of Methods ofItem AnalysisItem AnalysisItem AnalysisItem AnalysisItem Analysis for training and staff development, in UK higher education or elsewhere, shouldcontact the CAA Centre.

ISBN 1-904020-02-X

Page 3: A Summary of Methods of Item Analysis

TABLE OF CONTENTSTABLE OF CONTENTSTABLE OF CONTENTSTABLE OF CONTENTSTABLE OF CONTENTS

INTRODUCTIONINTRODUCTIONINTRODUCTIONINTRODUCTIONINTRODUCTIONClassical Test Theory (CTT) 2

Item Response Theory (IRT) 2

Rasch Measurement 2

Classical Test Theory 3

Item Facility 3

Item Discrimination 5

OTHER INDICAOTHER INDICAOTHER INDICAOTHER INDICAOTHER INDICATTTTTORS OF QUESTION PERFORMANCEORS OF QUESTION PERFORMANCEORS OF QUESTION PERFORMANCEORS OF QUESTION PERFORMANCEORS OF QUESTION PERFORMANCEStandard Deviation 7

Reliability (internal consistency) 7

Standard Error of Measurement 9

Implications of Question Choice 9

Choice Index 9

Mean Ability Index 10

Modified Facility Index 11

Discrimination 11

Problems with Classical Test Theory 11

Latent trait models 12

ITEM RESPONSE THEORITEM RESPONSE THEORITEM RESPONSE THEORITEM RESPONSE THEORITEM RESPONSE THEORYYYYYTypes of model 13

Estimation of Parameters 15

Model Data Fit 17

Allocation of Ability to Candidates 19

CHARACHARACHARACHARACHARACTERISTICS OF ITEMS AND TESTSCTERISTICS OF ITEMS AND TESTSCTERISTICS OF ITEMS AND TESTSCTERISTICS OF ITEMS AND TESTSCTERISTICS OF ITEMS AND TESTS 20

Standard Error of Estimation 20

Rasch Measurement 21

Allocation of Ability to Candidates 21

The Rasch Model 22

Advantages of the Rasch Model 23

Parameter Estimation 23

Model data fit 24

Criticisms of the Rasch Model 24

CONCLCONCLCONCLCONCLCONCLUSIONUSIONUSIONUSIONUSION 26

REFERENCESREFERENCESREFERENCESREFERENCESREFERENCES 27

APPENDIXAPPENDIXAPPENDIXAPPENDIXAPPENDIX 30

Page 4: A Summary of Methods of Item Analysis

2 2 2 2 2

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

A SUMMARA SUMMARA SUMMARA SUMMARA SUMMARY OF METHODS OF ITEM ANALY OF METHODS OF ITEM ANALY OF METHODS OF ITEM ANALY OF METHODS OF ITEM ANALY OF METHODS OF ITEM ANALYSISYSISYSISYSISYSIS

INTRODUCTIONINTRODUCTIONINTRODUCTIONINTRODUCTIONINTRODUCTION

Examinations fulfil a variety of functions (Butterfield, 1995), including the measurement ofattainment, accountability (of institutions, staff initiatives etc.), curricular definition andstudent motivation. Depending on the circumstances of the examination, some of thesepurposes are more important than others. When attempting to gauge the quality of theexamination, the uses to which the results will be put must be borne in mind.

Item Analysis is a method of gauging the quality of an examination by looking at itsconstituent parts (items). It seeks to give some idea of how well the examination hasperformed relative to its purposes. The primary purpose of most examinations in highereducation is that of a measurement tool, for assessing the achievements of the examinationcandidates and thus how future learning can be supported and directed. This paper detailsthe methods of item analysis for this purpose, and also considers how they might be used forthe wider functions given above. It is important for academic staff to have an understandingof item analysis - its methods, assumptions, uses and limitations in order that examinationscan be assessed and improved.

There are three major types of item analysis; Classical Test Theory, Item Response Theory and RaschMeasurement. Item Response Theory and Rasch Measurement are both forms of Latent TraitTheory.

Classical TClassical TClassical TClassical TClassical Test Theory (CTT)est Theory (CTT)est Theory (CTT)est Theory (CTT)est Theory (CTT)

CTT is the most widely used in Britain. The basic underlying assumptions come frompsychology and were developed around the turn of the 20th. century. These have, howeverbeen much refined since, particularly to take into account aspects of modern educationaltesting, such as optional questions.

Item Response Theory (IRT)Item Response Theory (IRT)Item Response Theory (IRT)Item Response Theory (IRT)Item Response Theory (IRT)

IRT was originally developed in the 1940’s in Scotland. It was further developed in the US inthe late 1960s/early 1970s and is widely used in American testing organisations. Its mostnotable advocate is the Educational Testing Service Network (ETS).

Rasch Measurement (Rasch)Rasch Measurement (Rasch)Rasch Measurement (Rasch)Rasch Measurement (Rasch)Rasch Measurement (Rasch)

Rasch was developed in the late 60s in Denmark specifically for educational tests, although itis also used to analyse psychological tests. This model gained wide popularity in the US andin Britain in the 70s where it is currently much less popular than the other forms of itemanalysis; although it is still used extensively in Australia.

Page 5: A Summary of Methods of Item Analysis

3 3 3 3 3

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

British public examinations tend to have more extended questions, worth varying amounts ofmarks, than those of the US, where single mark multiple choice testing is the norm. InBritish higher education, single mark multiple choice questions, together with extended markessays feature heavily, although there is a gradual move toward more complex markingsystems. As Classical Test Theory is mainly used in Britain whereas Item Response Theory ismore heavily used in the US, the theory of analysing multiple mark questions is moredeveloped in Classical Test Theory than in Item Response Theory. A model (called theGraded Response model) has however been developed to allow multiple mark data to beanalysed. The Rasch Model was also initially developed for the analysis of single markquestions, however a “partial credit” model has been developed.

Classical TClassical TClassical TClassical TClassical Test Theoryest Theoryest Theoryest Theoryest Theory

Classical test theory concentrates on two main statistics; item facility and item discrimination.Item facility is calculated by dividing the mean mark by the maximum mark, itemdiscrimination is given by the correlation of the item with an other indicator of performance,usually the total mark on the test. Other indicators of question performance are the standarddeviation and the test reliability.

Where choice is allowed, although the main statistics to be considered are still item facilityand discrimination, it may be desirable to modify the calculations of these so that meaningfulcomparisons can be drawn. The question facility index may be modified to reflect thepopulation attempting the question, while the discrimination index may more usefully becalculated by excluding the item under consideration from test total . (Note that this can alsogive a less biased estimate of differentiation for compulsory questions, although it is moreimportant where there is no choice.) Additional question performance indicators for optionalquestions are the questions’ popularities and the abilities of the groups of candidatesattempting each of the questions.

Item FItem FItem FItem FItem Facilityacilityacilityacilityacility

This is essentially a measure of the difficulty of an item, with a high facility indicating an easyitem and a low facility indicating a difficult item. This is given by the formula

On questions that carry a high proportion of the marks of the test (e.g. a 25 mark essayquestion on a paper worth 40 marks), it is desirable for the facility value to be close to 0.5, topromote maximal differentiation.

Fac (X) =X

X max

where Fac (X) = the facility value of question X.

= the mean mark obtained by all candidates attempting question X.

Xmax = the maximum mark available on the question.

X

Page 6: A Summary of Methods of Item Analysis

4 4 4 4 4

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

Where an individual question is worth a lower proportion of the test marks (e.g. a two markquestion on the above paper), it is quite acceptable for it to have a higher or lower facilityvalue, although a high facility value on one question should be balanced by a lower facilityvalue on another, ensuring that the paper as a whole has a mean mark of around 50%.

Where a test is comprised of many questions each worth a low proportion of the total marksavailable overall, it is desirable to have questions which vary in their difficulty, so thatcandidates at all points of the ability stratum may be fully tested . It is however, undesirablefor questions to have facility values above 0.85 or below 0.15. Although on occasion this canbe justified for reasons of curricular coverage or criterion referencing, at this level they arecontributing little to overall measurement. The closer the questions come to having a facilityvalue of 0.5, the more they are contributing to the measurement of the candidates.

For example, imagine 50 candidates taking a 40 mark multiple choice test. where thequestions are arranged in difficulty order. If the facility value of all of the items was 0.5 youmight expect the 25 strongest candidates to get 40 while the 25 weakest candidates get 0(assuming high discrimination). Where there is a range of facility values across the items, youare more likely to get a range of marks, as candidates fail to achieve on questions which aretoo difficult for them. An example of this can be seen in Test 1 in Figure 1. The thinkingbehind this interpretation can be related to the concept of ‘information’ in IRT.

Figure 1Imagine three tests each with ten questions. The questions in each of the tests have facilityvalues as follows.

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10Test 1 .50 .50 .50 .50 .50 .50 .50 .50 .50 .50Test 2 .89 .75 .10 .02 .98 .15 .88 .22 .23 1.00Test 3 .20 .45 .51 .72 .39 .44 .56 .61 .55 .48

The aim of any test is to separate out the candidates as much as possible so that decisionscan be made on as reliable evidence as possible.

Test 1 would not be a terribly good test. There is no way for the most able students todistinguish themselves, as you would expect someone with little more than average ability toscore very highly, neither is there any way to identify students who are struggling, as thosewho are just below average will score very little. Although each question may separate outable and weak students effectively, the test as a whole would not discriminate very wellthrough the whole candidate range.

Test 2 would also not be an effective test. Only 2/100 students would be expected to answerQ4 correctly, while 98/100 would correctly answer Q5 and all would correctly answer Q10.Clearly this is not providing very much information. In a class of 30 students it would bereasonable to expect that no student would answer Q4 correctly, or Q5 and Q10 wrongly, inwhich case the test is relying on data from only 7 questions to separate out the candidates,giving a question range of 2 to 9 rather than 0 to 10. Most candidates would score 5 marks,answering questions 1, 2, 5,7 and 10 correctly

Test 3 would work well. Although some of the questions are a bit hard (e.g. Q1) or easy (e.g.Q4) most of them are close to 0.5, but allow students at the ends of the ability range todistinguish themselves. In general however it may be better to have easier questions at thebeginning (to build confidence) with more difficult ones toward the end.

Page 7: A Summary of Methods of Item Analysis

5 5 5 5 5

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

Item DiscriminationItem DiscriminationItem DiscriminationItem DiscriminationItem Discrimination

Item discrimination is a measure of how the candidates perform on this question as opposedto another measure of performance. There are several methods used to calculate thediscrimination of items, the most common being the Pearson product-moment correlationbetween the item and the total test score. Another common measure is the correlationbetween the item and the total test score minus that item. This is particularly useful where theitem concerned carries a heavy weighting (e.g. a 25 mark item in a 40 mark test). Thesemeasures assumes unidimensionality, where all questions are testing a single content area orskill. Where this is not the case and the test is designed to examine more than one contentarea or skill, it may be better to use an alternative measure of discrimination.

Where the test is in separate sections, and it might not be expected that students wouldperform equally on all, the correlation between the item and the other items in the samesection might be more appropriate. Finally, where the test is made up of items of quitedifferent characters, or content, it may be more appropriate to consider the correlationbetween the items and external measures of performance, such as correlating an examinationessay with the candidates’ essay marks over the year, and their examination practical markswith their performance in non-examination practicals.

Being essentially a correlation, item discrimination can vary from +1 (where there is a perfectrelationship between those who score high marks on the item and those who score highmarks on the test) to -1 (where there is a perfect inverse relationship between those scoringhigh marks on the item and on the test).

In general item discrimination should always be positive, unless there is good reason tosuppose that the assumption of unidimensionality has been violated. In such a case itemdiscrimination should be positive within the sub-domain that the item tests, or (if it is the onlyitem representing the sub-domain), with another more representative indicator ofperformance.

Negative item discriminations with a valid criterion should always be regarded as suspect,however, there is no upper limit for this statistic: the higher the correlation the better the itemdiscrimination, the better the item. Massey (1995) suggests that values below 0.2 are weak,and values above 0.4 are desirable. He also points out the effect of extreme facility values onitem discrimination, where the reduced variance of these questions lowers the ceiling valuesfor item discrimination.

Page 8: A Summary of Methods of Item Analysis

6 6 6 6 6

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

The statistic for item discrimination is given by the formula

where rxy = the correlation between the item (x) and the test total (y)

= the sum of the products of the deviations of the items and the totals

N = the number of observationsSx = the standard deviation of the item

Sy = the standard deviation of total marks

xy∑

The example in Figure 2 shows that test total, being the most common, has been used.However, where another measure has been substituted (test total - item; section total, externalmark), these can be used in the equation in exactly the same way.

It should also be noted that questions with lower maximum marks have less potential forvariance than those with higher maximum marks, and hence are likely to have lower itemdiscriminations.

Figure 2

rxy =NSxSy

xy∑

Look at the example below of a test of eight questions taken by ten candidates.

q1 q2 q3 q4 q5 q6 q7 q8 total 1 1 2 3 5 0 10 10 32

Ann 1 1 4 3 0 0 9 10 28 Bill 1 1 2 3 0 1 8 10 26

Colin 0 1 4 3 0 0 7 10 25 David 0 0 0 3 0 1 6 10 20 Edna 1 1 3 2 0 2 5 2 16 Fred 1 1 3 2 0 0 4 4 15 Grant 0 1 5 2 0 0 3 2 13 Helen 0 1 1 2 0 0 2 6 12 Iain 0 1 1 2 0 2 0 6 12

Max 1 1 5 5 5 5 10 10 42 fac 0.50 0.90 0.50 0.50 0.10 0.12 0.54 0.70 disc1(r:item,total) 0.50 0.00 0.12 0.90 0.57 -0.31 0.96 0.79 disc2(r:item, total-item) 0.45 -0.04 -0.09 0.89 0.41 -0.41 0.88 0.46 It can be seen that items with large amounts of marks tend to correlate better with the total mark, however once the item is subtracted (e.g. disc2, Q8), the correlation can fall substantially. Questions with facility values close to 0.5 tend to correlate better than those with extreme correlations.

Page 9: A Summary of Methods of Item Analysis

7 7 7 7 7

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

OTHER INDICAOTHER INDICAOTHER INDICAOTHER INDICAOTHER INDICATTTTTORS OF QUESTION PERFORMANCEORS OF QUESTION PERFORMANCEORS OF QUESTION PERFORMANCEORS OF QUESTION PERFORMANCEORS OF QUESTION PERFORMANCE

Standard deviationStandard deviationStandard deviationStandard deviationStandard deviation

The standard deviation is a way of looking at question spread, and is given by the formula:

where sd = the standard deviation of the itemX = the deviation of an observation from the item meanN = number of candidates

As it is conditional on maximum marks, the sd can be expressed as a percentage ofmaximum. This eases comparison when questions have differing maximum marks. A highsd suggests that the question is spreading candidates out more effectively but where itemdiscrimination is poor, this may not be desirable.

Uneven item variances (sd2) imply uneven achieved weightings for the questions and may notbe altogether desirable. It should be noted, however, that high variance does encourage highitem discrimination.

Reliability (internal consistency)Reliability (internal consistency)Reliability (internal consistency)Reliability (internal consistency)Reliability (internal consistency)

Reliability is the extent to which the measurements that we obtain are consistent. Anymeasurement (not just in testing) is made up of an true measure and an error. The reliabilitycan thus be thought of as being the amount of variation in the measure. The higher thereliability, the lower the amount of error variance in the test (as error variance is theproportion of test variance which is not legitimate). Thus the higher the reliability, the betterthe test as a whole has performed and consequently the items within it.

There are three ways to estimate the reliability of a test. The first, test-retest (where thereliability is the correlation between the candidates’ first score and their second score),assesses the stability of the examination. Another method of estimating the reliability is thatof parallel forms (where the reliability is the correlation between the scores on one test andthe other), which assesses both the stability and the equivalence of the examination.Obviously both these procedures are expensive and time consuming, thus ways of estimatingthe internal consistency (correlations between the items) of the test were developed. Thesecan be thought of as the correlation between the test and all other possible tests which mightbe constructed from the hypothetical universe of questions measuring the same trait (Massey,1995).

∑sd =

N

x2

Page 10: A Summary of Methods of Item Analysis

8 8 8 8 8

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

The desirable level of reliability is dependent on the type of examination being considered.In general the more distinct items within an examination, the higher the internal consistencywill be1 . For a multiple choice test, an internal consistency measure of over 0.90 is achievableand desirable. For multiple mark, short answer questions measures in the range 0.65 - 0.80are to be expected. In longer length essay type examinations and more practical examinations,reliability may be as low as 0.40 without concern being raised.

Although reliability is traditionally thought to provide an upper limit for validity (Cronbach,1971), this view has been rather subsumed by modifications (Moss, 1995) and challenges(Messick, 1989) which suggest that validity is a product of the uses to which the test is put,rather than an inherent quality. Taking this view, it is more appropriate to aim to providevalid, high quality assessment, with a reliability within the range of error usual for that typeof examination, than to choose an examination type merely to achieve high internalconsistency

There are two formulas generally used for calculating reliability (internal consistency)coefficients. Cronbach’s alpha2 , a generalised form of the Kudor-Richardson 20 formula(Cronbach, 1951), and Backhouse’s P, a specific form of the alpha coefficient designed tocope with optional questions (Backhouse, 1972a, 1972b). Note that where n = nj for all j,Backhouse’s formula simplifies to that of Cronbach.

k = number of itemsn = number of people taking the testnj = number of people attempting question jnj,t = number of people attempting both questions j and tsi = standard deviation of item iS = standard deviation of the test

1 There is an interesting paradox inherent in the use of internal consistency estimates for reliability.Reliability increases with test length (Traub, 1994), however, the internal consistency of a test can beimproved by removing the least discriminating items to make a shorter test!

2 It has been suggested (Traub, 1994) that this should be replaced by L2 (Guttman, 1945) which provides alarger lower bound to reliability than co-efficient α, where the assumption of essentially τ equivalence doesnot hold. (Where this does hold, both statistics will be equal.)

Cronbach’s Alpha(for a test with

compulsory questions)

Backhouse’s P(for a test with optional

questions)

rααααα =

kkkkk - 1

kkkkk

1-S2

S2

∑rααααα = (λ+λ+λ+λ+λ+11111)))))

∑1-1-1 -1 -1 -

k

nsx2

-λλλλλ∑ kkkkk

nsx2j=1 j,t=1

njsj2

nj,tmj.tmt.j-nM2

Page 11: A Summary of Methods of Item Analysis

9 9 9 9 9

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

M = mean of test scoremj = mean of scores on question jmj,t mj = mean of score on question j of those who also answered question t

Standard Error of MeasurementStandard Error of MeasurementStandard Error of MeasurementStandard Error of MeasurementStandard Error of Measurement

We can also obtain a measure of how closely related the candidates observed scores on thetest are to their true scores by computing the standard error of measurement (SEmeas) by theformula below:

Where sd = the standard deviation of the testr = the reliability of the test

95% of candidates have a true score falling within ±2 SEmeas of their observed scores. Withthe latent trait models described later in this paper, although there is no way of directlyassessing the reliability of the examination, a reliability estimate can be calculated from thestandard errors of the estimates by essentially reversing the procedures described above.

Implications of Question ChoiceImplications of Question ChoiceImplications of Question ChoiceImplications of Question ChoiceImplications of Question Choice

Where candidate choice of questions is allowed in a test, problems are caused for thetraditional statistics of item facility and discrimination. There are also additionalconsiderations to be taken into account, such as the number and abilities of the candidateschoosing each of the questions.

Choice IndexChoice IndexChoice IndexChoice IndexChoice Index

One of the considerations that must be taken into account when analysing choice typequestions is how effective they are at attracting candidates. This can be defined as theproportion of the total population that attempted the question. Due to marker unfamiliarity,unpopular questions may be marked less consistently than popular questions, as examinersmay find it difficult to internalise the standard of a less common question.

SEmeas =sd 1 - r

nj

λλλλλ=

∑kkkkk

j=1

∑kkkkk

nj,t

j≠≠≠≠≠ t=1

Page 12: A Summary of Methods of Item Analysis

10 10 10 10 10

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

A choice index for each question may be calculated by the following method:

Where N N N N Niiiii is the number of candidates attempting question iiiii

N N N N N is the number of candidates

Massey (no date) noted examiners, in general, considered that less popular questions wereintrinsically more difficult than more popular questions, and stated that they marked lenientlyto compensate for this, however, the marks candidates obtained on these questions stillindicated that there had not been sufficient compensation. He suggests that if more popularquestions were seen by candidates and teachers to be ‘easier’ this could lead to an undesiredemphasis on certain frequently tested topics. It could also contribute to curriculum shrinkageas McAlpine and Massey (1998) noted in the Core paper of a History syllabus. Thus, optionalquestions should ideally attract roughly equal numbers of candidates.

Mean Ability IndexMean Ability IndexMean Ability IndexMean Ability IndexMean Ability Index

A question may attract candidates within a particular range of ability. The mean ability indexis a measure of the group of candidates attempting a question, and is given by the mean oftheir total percentage marks. The results of this index may have implications for curriculumcoverage.

For example, should all of the weaker candidates opt for questions on a particular topic, itmay imply that curricular differentiation is occurring. This may or may not be desirable, butquestion setters should be aware of this when question setting.

C =C =C =C =C = N N N N Niiiii

NNNNN

Page 13: A Summary of Methods of Item Analysis

11 11 11 11 11

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

Modified FModified FModified FModified FModified Facility Indexacility Indexacility Indexacility Indexacility Index

Where the ability indexes of the candidates attempting different questions are uneven, thefacility index (as calculated above) may be biased. We might expect an able sub-group ofcandidates to fare better on a question than the population as a whole does, thus a questionattempted by a group of candidates who are more able than average will tend to have a higherfacility value, implying that it is less difficult than a question attempted by a group of less ablecandidates. Morrison (1972) suggests modifying the facility value by the means below tomake it easier to directly compare the difficulty of the questions by taking into account theabilities of the candidates that attempt each of the optional questions.

Mfac = 50 + MQ - MTwhere Mfac =the modified facility value

MQ =mean question markMT =Mean ability index

This is an attempt to get round the limitations of classical item analysis in terms of thesample dependence of items. Although it does correct to some extent for the abilities of thecandidates taking each of the optional questions, it should be noted that these are stilldependent on the population of test takers.

DiscriminationDiscriminationDiscriminationDiscriminationDiscrimination

Where there are optional questions it is best to calculate the discrimination of items bycorrelating the candidates’ marks on the item with the candidates total marks less their markon that item, to give a less biased measure of discrimination. This increases in importancethe larger percentage of the total marks each optional question occupies.

PPPPProblems with Classical Troblems with Classical Troblems with Classical Troblems with Classical Troblems with Classical Test Theoryest Theoryest Theoryest Theoryest Theory

Some of the problems that have been noted with the use of classical test theory are

• the perceived ability of a candidate is determined by the difficulty of the test takenWhere the test is difficult, the candidate will look as if they are of low ability, but wherethe test is easy, the candidate looks to be of high ability. It is thus difficult to comparethe relative abilities of candidates taking two different tests.

• the attributes of each item are determined by the group of candidates who attemptthemWhere the candidates taking the test are of high ability, facility values will be muchgreater than where the candidates taking the test are of low ability. It is thus difficult toconstruct tests for a population of candidates that are more or less able to those used inthe pre-test. The more homogeneous the candidates, the lower the discrimination.

• test scores are not acknowledged as being differently reliable for different candidatesAlthough the SEmeas is generally assumed to be the same at all points of thedistribution, this is not the case; at the extremes, the estimate of candidate ability is lessreliable than in the middle of the distribution (Lord, 1984). This makes it difficult toadequately compare candidates’ relative abilities.

Page 14: A Summary of Methods of Item Analysis

12 12 12 12 12

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

LLLLLatent Tatent Tatent Tatent Tatent Trait Models (Item Response Theory and Rasch Modelling)rait Models (Item Response Theory and Rasch Modelling)rait Models (Item Response Theory and Rasch Modelling)rait Models (Item Response Theory and Rasch Modelling)rait Models (Item Response Theory and Rasch Modelling)

Item Response Theory and Rasch Modelling are both types of Latent Trait Models. A latenttrait model says that there is a relationship between the observable test performance of acandidates and the unobservable traits or abilities which underlie that performance(Hambleton and Cook, 1977). Latent Trait Theory was developed to get round the first twoproblems described earlier with Classical Test Theory. The item characteristics that thesemodels produce are designed to be independent of the group of candidates that they wereobtained from so that the measurement of the candidates can be adequately equated acrosstest forms that are not parallel. This is called invariance of item and ability parameters.

Invariance is a key aspect of Latent Trait models and supports item banking, adaptive testingand investigations of item bias in a way that classical test theory does not. Obtainingdifficulty estimates for questions that are independent of the sample tested means thatquestions trialled on different groups can be ranked by difficulty (although in IRT2 and IRT3,this ranking may vary depending on the ability of the candidates to be tested - see Appendix).The possibilities for adaptive testing are increased as one can obtain an independent measureof the ability of a candidate and match that with the difficulty of a question set. Theassumption that the only characteristic of an examinee, that will influence item performance,is their ability on the underlying trait can be tested and bias detected as a result.

Some assumptions underlie latent trait theory:

• the test is unidimensionalthere is only one underlying trait or ability being tested

• there is local independence of itemsStrong candidates’ responses to questions are all independent of one anotherWeak pairs of test items are uncorrelated for candidates of the same ability

In effect these are equivalent, as unidimensionality of the latent space leads to candidatesresponses to each item being independent of one another (the strong version of the secondassumption), however Lord (1968) suggests that the unidimensionality assumption is notsatisfied for most tests.

The equivalence of the assumptions both can be tested for simultaneously using factoranalytic techniques. Indeed it has been suggested (McDonald, 1981) that the principle oflocal independence should be used to determine the unidimensionality of a test; and thatwhere co-variance between the items of a set is 0, the test comprised of that sub-set of itemscan be considered unidimensional.

A key notion of latent trait theory is that of item characteristic curves ( ICCs) which map theprobability of success on an item to the ability measured by the test, and it is the non-linearregression function of item scores on the latent trait which is measured by the test(Hambleton and Cook, 1977).

Page 15: A Summary of Methods of Item Analysis

13 13 13 13 13

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

Item Response TheoryItem Response TheoryItem Response TheoryItem Response TheoryItem Response TheoryTTTTTypes of Modelypes of Modelypes of Modelypes of Modelypes of Model

There are three basic models in item response theory, of one, two and three parameters. Inthe one-parameter model, items can only vary in their difficulty, in the two parameter model,items can vary in both difficulty and discrimination, and in the three parameter model, inaddition to varying in discrimination and difficulty, questions can also have a minimalprobability greater than zero.

A four parameter model has also been developed to account for items which are consistentlyanswered incorrectly by high ability candidates who have more advanced knowledge than thetest constructor assumes. In this model the maximal probability of a correct answer may beless than one. This model is not widely used.

In item response theory, each ICC is a member of a family of curves, given by the generalequation

where θθθθθ = the ability level of the candidatePg(θθθθθ) = the probability that any candidate of ability θθθθθ answers question g correctly 1.aaaaag = gradient of the ICC at the point = bg

bbbbbg = the ability level at which ag is maximised (item difficulty)cccccg = probability of minimal ability candidates getting question g correct (pseudo

chance level)D = a constant scaling factor. ( Scaling factor is usually 1.7 to make distribution as

similar to a normal ogive as possible.)(*Note that this is not the same as the level expected by chance. It wouldnormally be lower due to candidates being distracted by the other options)

In the one and two parameter models cg is set at zero, and in the one parameter model, Dag isset to 1, so that all items are assumed to have the same discrimination. In multiple choicetests where there is a substantial possibility of getting the questions correct by chance, use ofthe three parameter model can increase the fit between the data and the model.

1(Hambleton (1989) points out the difficulty with interpreting Pg(θ) as the probability of a particularcandidate of θ ability answering the question correctly by noting that where two examinees of equal abilitydiffer in their knowledge such that one will answer question g1 correctly but g2 wrongly and the other willanswer question g2 correctly, but g1 wrongly, such a definition leads to a contradiction, instead it is better tointerpret Pg(θ) as the probability of a randomly selected candidate of ability level θ.)

eeeeeDDDDDaaaaag(((((θθθθθ-----bbbbbg)))))

1+e1+e1+e1+e1+eDDDDDaaaaag(((((θθθθθ-b-b-b-b-bg)))))PPPPPg (((((θθθθθ)=c)=c)=c)=c)=cg + (1-+ (1-+ (1-+ (1-+ (1-cccccg)))))

Page 16: A Summary of Methods of Item Analysis

14 14 14 14 14

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

Figure 3

Note thata = discriminationb = difficultyc = chance

Choosing different models may give different difficulty estimates based on theparameterisation of the elements. Not only the absolute estimates, but also the rank orderingmay be changed by changing the number of calculated parameters.

The most common programs for calculating IRT parameters are: MULTILOG and XCALIBRE.These will give an estimation of candidates' abilities - as well as identifying the requestedparameters for each of the items. An extract from XCALBRE analysis is given below for both2 and 3 parameter estimation. The sample test was a four item multiple choice test given to400 candidates.

2 Parameter model

Final Parameter Summary Information: Mean SDTheta 0.00 1.00 a 0.79 0.15 b -1.99 0.86 c 0.00 0.00

FINAL ITEM PARAMETER ESTIMATESItem Lnk Flg a b c Resid PC PBs PBt N Item name---- --- --- ----- ----- ----- ----- ----- ----- ----- ------ --------- 1 0.61 -1.70 0.00 0.65 0.79 0.36 0.35 400 2 0.97 -0.75 0.00 0.30 0.69 0.60 0.60 400 3 1.17 -2.32 0.00 0.92 0.94 0.51 0.42 400 4 0.75 -2.45 0.00 0.45 0.91 0.35 0.31 400 5 P 0.70 -3.00 0.00 1.19 0.97 0.11 0.09 400

3 Parameter Model

Final Parameter Summary Information: Mean SDTheta 0.00 1.00 a 0.82 0.13 b -1.64 1.00 c 0.25 0.00

FINAL ITEM PARAMETER ESTIMATESItem Lnk Flg a b c Resid PC PBs PBt N Item name---- --- --- ----- ----- ----- ----- ----- ----- ----- ------ --------- 1 0.67 -1.09 0.25 0.49 0.79 0.36 0.36 400 2 0.98 -0.43 0.23 0.54 0.69 0.60 0.59 400 3 1.11 -2.14 0.25 0.86 0.94 0.51 0.45 400 4 0.75 -2.05 0.25 0.24 0.91 0.35 0.33 400 5 P 0.71 -3.00 0.25 1.06 0.97 0.11 0.10 400

The first table shows the mean and standard deviations of all relevant parameters. Theta, theprobability of a candidate getting a question correct is always has a mean and sd of zero (seeparameter estimation for further explanation).

The second table shows the item parameters (difficulty, discrimination and chance) for eachquestion. The residual tells you how close the item is to the model, values over 2 are a causefor concern, while the PC, PBs and PBt compare the model with classical item analysis. N isthe number of candidates in the sample.

The higher the a parameter, the more the item discriminates. In IRT, the higher thediscrimination the better the item; there is no upper limit, however values below 0.3 should beconsidered suspect. The higher the b parameter the more difficult the item, values in therange -3 to 3 are to be expected, with the easier items toward the beginning of the test.Ideally the average b (table 1) should be 0 indicating that the candidates and the questionsare of roughly equal ability/difficulty. The higher the c parameter, the more guessing isinfluencing the test result, in a multiple choice test, you would expect the c parameter to bearound 0/1 where 0 is the number of options.

Page 17: A Summary of Methods of Item Analysis

15 15 15 15 15

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

Most IRT use is with single mark multiple choice questions, however, a graded responsemodel has been developed (Samejima,1969) to cope with multiple mark questions, as anadaptation of IRT2. It should be noted that the third parameter, the pseudo-chanceparameter, is not really relevant in this case as one would not imagine that candidates ofminimal ability would score This model is given by the formula:

Where the variables are as above, butxi = the score category of item iiiii

bx = the difficulty level of obtaining score x on item IIIII

e = Euler’s constant (2.71)

This generates a great deal of parameter values: one (difficulty) for each change in score level(e.g. a change from 2 marks to 3 marks) for each item of the test, plus another parameter forthe item as a whole (discrimination). It should be noted that each change in score level doesnot have a discrimination, only the item as a whole.

Estimation of the PEstimation of the PEstimation of the PEstimation of the PEstimation of the Parametersarametersarametersarametersarameters

Estimation of the ability and item parameter is the most important step in applying itemresponse theory. This can be likened to the estimation of the co-efficient of a regressionmodel, however q (the regressor variable) is unknown. The parameters are estimated using amaximum likelihood procedure.

In item response models, the probability that a candidate will produce a correct response to aquestion is dependant on the ability of the candidate and the difficulty, discrimination (forIRT2 and IRT3) and pseudo-chance level (for IRT3), of the item. None of these parametersare known to start with, and they are all interlinked. Thus, all parameters must be estimatedsimultaneously. For a test of n items with N candidates, 3n+N parameters must beestimated for IRT3, 2n+N parameters for IRT2 and n+N parameters for IRT1.

Estimation of the parameters is usually done using the Joint Maximum Likelihood Procedure(JMLP).

Step 1

For this procedure to work, the dataset must be pre-edited to exclude candidates who haveanswered all items correctly or all items wrongly and items which all candidates haveanswered correctly, or that all candidates have answered wrongly. Assuming that neither theability estimates of the candidates nor the parameters of the items are known initial estimatesof the parameters must be made in order to start the procedure.

Pxi(((((θθθθθ)=)=)=)=)=eDa

i(((((θθθθθ-----bxi

)))))

1+1+1+1+1+eDai(((((θθθθθ-----bxi

)

Page 18: A Summary of Methods of Item Analysis

16 16 16 16 16

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

Step 2

Ability parameters are initially set as zero or as the log of the ratio of correct responses toincorrect responses, standardised to a mean of 0 and a standard deviation of 1. Once abilityparameters are initially estimated using the above procedure, item parameters for each itemmay be estimated using the Newton-Ralphson multivariate procedure. This will giveestimates of the parameters for each item.

Step 3

Item difficulty parameter estimates are set as the log of the ratio of the number of candidatesanswering the question correctly to those answering wrongly, similarly standardised, or aszero. In an IRT3 model the pseudo-chance level is set to zero and in IRT2 and IRT3, thediscrimination index is set to one. Ability parameters are then estimated for each candidateusing the Newton-Ralphson procedure.

Step 4

Item parameter estimates obtained in Step 2 should now be used to obtain revised abilityparameters for each candidate.

Step 5

Ability parameter estimates obtained in Step 3 should now be used to obtain revised itemparameters (for difficulty, discrimination and pseudo-chance level where appropriate) for eachitem.

Steps 4 and 5 should be repeated until the values of the estimates do not change duringsuccessive estimation stages.

The deletion of candidates who score perfect marks or zero, and of items which allcandidates either answer correctly or all answer wrongly disconcerts some IRT practitioners.Bayesian estimation, where prior distributions are superimposed onto the item and abilityparameters can eliminate these problems (Swaminathan and Gifford, 1986), however otherswould argue that the ability of a candidate who scores zero or full marks cannot be estimatedas it is outwith the measuring potential of the items; similarly the difficulty of an item that allor no-one answers correctly cannot be estimated as it is outwith the capabilities of thecandidates in our sample. Thus, they would argue, it is perfectly good measurement practiceto discard such data rather than being a weakness of the model.

Where there is a discrimination parameter to be estimated, each iteration will increase the aI

(item discrimination) of the highest discriminating items and underestimate the discriminationof the weakest; as each time the discriminatory power is associated with its weight and henceits correlation with total score will increase, leading to a further increase in its discriminatorypower. Obviously this becomes a little circular, and if left for long enough, one item willprovide all of the discrimination in the test (Wright, 1977). To prevent this occurring,estimated bounds may be placed on the data.

Page 19: A Summary of Methods of Item Analysis

17 17 17 17 17

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

Estimates for cg (pseudo-guessing parameter) in both of the JMLP and the MMLP(MarginalMaximum Likelihood Procedure) are sometimes weak. Poor estimation of cg can degrade theestimates obtained for the other parameters in the model and unless limits are put on the itemand ability parameter values, these procedures can fail. In the case of IRT3, Bayesianestimation can avoid the above problem.

Model Data FModel Data FModel Data FModel Data FModel Data Fititititit

Unlike classical test theory, where descriptive statistics are used, the usefulness of an IRTmodel is dependent on the fit of the model to the data. Hambleton and Swaminathan (1985)have suggested that the measure of fit should be based on three types of evidence.

1) Validity of the assumptions of the model for the data set• unidimensionality• the test is not speeded• guessing is minimal (for IRT2 and IRT1)• all items are of equal discrimination (for IRT1)

2) Extent to which the expected properties are obtained• invariance of item parameter estimates• invariance of ability parameter estimates

3) Accuracy of model predictions

Unidimensionality can be checked using the principle of local independence, by examiningthe co-variance of the items of the test (McDonald, 1981). Where the ratio of the variancesof the number of items omitted and the number of items answered incorrectly is close tozero, the assumption that the test is not speeded is met (Gulliksen,1950). If the performancelevel of the lowest ability candidates on the most difficult questions is close to zero, there islikely to be minimal guessing (assumed for IRT1 and IRT2), and to ensure the finalassumption of the model for IRT1, the correlations between the items and the test-scoresshould be relatively homogeneous.

To test for the invariance of ability parameter estimates, the ability estimates obtained usingdifferent sub-sets of test items should be reviewed. These may be items differing in difficultyor in content area within the domain. Where these do not vary beyond the measurementerrors associated with the estimates, ability parameter invariance can be assumed (Wright,1968). To test for the invariance of item parameter estimates, comparisons of the itemparameter estimates obtained for two random sub-groups of the test-taking population.When these estimates are plotted against one another invariance of item parameters can beassumed if the plot is linear of gradient 1 and intercept 0, with minimal scatter (Shepard et al.,1984).

Checking the accuracy of the predictions of the model is usually done by the analysis of itemresiduals. Once ability and item parameters have been estimated, predictions about theperformance of sub-groups of examinees are calculated (assuming that the model is valid).

Page 20: A Summary of Methods of Item Analysis

18 18 18 18 18

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

Sub-groups are normally formed on the basis of ability, with wide enough intervals that thesamples are not too small, yet narrow enough to ensure that the candidates’ abilities aresimilar. To obtain the estimated, the midpoint of the ability category is usually used as anapproximation for θ.

The residual, the difference between the predicted item performance and the observedperformance for the sub-group, is then computed by the formula below:

rij= Pij - E(Pij)Where i = the item

j = the ability categoryrij = the residual for item i in ability category jjjjjPij= the observed proportion of correct responses on item i of candidates of ability jE(Pij) = the expected proportion of correct responses on item i of candidates of ability j

This should be standardised by the formula below to take into account the sampling error ofthe expected proportion correct.

Where the variables are as above andZij = the standardised residualNij = the number of candidates in ability category j

If the data fit the model, the zij statistics should be distributed more or less randomly andhave a mean of 0 and a standard deviation of 1. The fit of the model to the data can betested by using the statistic Q1

Where the variables are as above andm is the number of ability intervals

The distribution of the statistic Q1 is that of a x2. Where the critical value of a x2 with m-kdegrees of freedom (where k is the number of parameters in the model) and a significancelevel of 0.05, is greater than that of the Q1 statistic, the model does not fit the data andanother model should be found.

Zij=====

Pij -E (Pij)

E(Pij) [1-E(Pij)]/Nj

Q1 = ∑m

j =1

2Z ij

Page 21: A Summary of Methods of Item Analysis

19 19 19 19 19

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

Allocation of Ability to CandidatesAllocation of Ability to CandidatesAllocation of Ability to CandidatesAllocation of Ability to CandidatesAllocation of Ability to Candidates

One of the advantages of the IRT models is in tests where choice is allowed. As the rankordering of candidates is generated on the underlying trait rather than on the test scores,candidates should still be placed in the correct rank order irrespective of which questions thatthey chose to answer. In classical test theory, each mark is ‘worth’ exactly the same amount,regardless of whether it is an ‘easy’ mark or a ‘hard’ mark. In IRT, however, the iterationprocedure described above takes this into account in calculating the ability of candidates whochose to answer each mark point.

For reporting purposes, however, the ability measure generated by IRT is rather inconvenient,as θ can vary between ± ∞, and it is conceptually difficult to relate this to test score. Thus alinear transformations are normally applied to q which gives a value for the between 0 and themaximum test mark. This makes the reporting of test scores more comprehensible andeliminates negative scores. In the US where IRT is most popular, this transformationgenerally intends to give a reported score centred around 500 with a high standard deviation,although the exact form varies from examination to examination.

Page 22: A Summary of Methods of Item Analysis

20 20 20 20 20

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

CHARACHARACHARACHARACHARACTERISTICS OF ITEMS AND TESTSCTERISTICS OF ITEMS AND TESTSCTERISTICS OF ITEMS AND TESTSCTERISTICS OF ITEMS AND TESTSCTERISTICS OF ITEMS AND TESTS

Item Information

The information provided by an item is the contribution that an item gives to the estimationof ability at any given point along the ability continuum. It can be gathered from the formula

Where li(((((θ))))) = the information provided by the item i at the point θ on the ability scaleai = the discrimination of the itembi = the difficulty of the itemci = the pseudo chance level of the iteme = Euler’s constant (2.71)

From this it can be seen that the information provided by an item is maximised whendifficulty level (biiiii) is close to the ability level (θ), the information of an item increases as thediscrimination (aiiiii) increases and as the pseudo chance level (ciiiii) approaches 0.

Where the ci i i i i parameter is zero, the point at which the item provides its maximum informationis when q = biiiii Where ciiiii is greater than zero, it provides its maximum information where θ isslightly larger than biiiii (Birnbaum, 1968). The higher the discrimination of an item, the moreinformation provided. The higher the ciiiii parameter, the less information provided.

Standard Error of EstimationStandard Error of EstimationStandard Error of EstimationStandard Error of EstimationStandard Error of Estimation

Test information at a given ability is the sum of the related informations of the items of thetest. As test information increases, the standard error of estimation (at q) decreases. Thestandard error of estimation can be calculated by

Where = the standard error of estimation (SEest) at ability level θ

= the sum of the item informations ability level θ for all items in thetest

This can be thought of as being akin to the SEmeas in classical test theory, however while theSEmeas tells us how confident we can be in the test’s measurement, the SEest tells us howconfident we can be at each ability level.

The SEmeas for each person tested can be averaged to produce an accurate variance, andhence an accurate measure of reliability for the sample tested (Wright and Stone, 1988). Thereliability of the test over any subsequent group of candidates can then be calculated (Wrightand Stone, 1979).

SE(θθθθθ) =

i=1∑ Ii(θθθθθ) n

1∧

i=1

n∑ Ii(θθθθθ)

SE(θθθθθ)∧

)]1][[

)1(89.2)(

2)(7.1)(7.1

2

iiii babai

iii eec

caI

−−− ++−

= θθθ

Page 23: A Summary of Methods of Item Analysis

21 21 21 21 21

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

Rasch MeasurementRasch MeasurementRasch MeasurementRasch MeasurementRasch Measurement

Rasch measurement is another form of latent trait theory. Like IRT, it seeks to explain theability of candidates in terms of their ability on the trait underlying the test. Each item inRasch has only one parameter - difficulty(δδδδδ) There is also a person parameter - ability (βββββ).

Allocation of Ability to CandidatesAllocation of Ability to CandidatesAllocation of Ability to CandidatesAllocation of Ability to CandidatesAllocation of Ability to Candidates

Similar problems apply to the reporting of candidates’ abilities in Rasch as in IRT. Ratherthan using a transformation to a true score scale for reporting purposes, Rasch usually reportsthe candidates’ abilities in ‘logits’.

Suppose a candidate of ability βββββ1 has O1 odds of success on a question of difficulty δδδδδ and acandidate of βββββ2 has O2 odds of success on the same question, as their odds of success istheir ability divided by the difficulty of the question, the ratio of their odds is the same as theratio of their abilities, so a candidate with twice the ability of another has twice the chance ofsuccess.

That is to say by manipulation of the above equation1, it can be shown that the log of theodds is the difference of candidates’ abilities. By similar reasoning it can be shown that thelog of the odds of success for a candidate on two different questions is the difference in thedifficulties of the questions. By reporting the log of the odds of success, ability and difficultycan be reported on a linear equal, interval scale. Most difficulty and ability scores are nowconfined in the region ±3.

In addition, as the origin and scales of the logits is arbitrary, the scale can be pre-set. It isusual to set the scale so that the mean item difficulty is reported as 0.

= β1 O1

β2 O2

1

⇒ ⇒= β1 O1

β2 O2=eβ1 O1

eβ2 O2

O1

O2

In = β1 - β2

Page 24: A Summary of Methods of Item Analysis

22 22 22 22 22

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

The Rasch ModelThe Rasch ModelThe Rasch ModelThe Rasch ModelThe Rasch Model

The model is based on the interaction of each person v with ability βββββ, with each item i withdifficulty δδδδδ. The odds of person v getting item i correct is β−δβ−δβ−δβ−δβ−δ, however as this can rangefrom ±∞, the log of the odds is taken then turned into a probability by dividing by 1+e(β−δ) togive a range between 0 and 1. Thus the probability of person v succeeding on item i is givenby the formula

This is essentially the same equation as IRT1, with θθθθθ set to βββββv and bi set to δδδδδi. As Raschmodelling does not allow items to vary in their discrimination, items can always be put into arank order of difficulty, unlike the IRT2 and IRT3 models where the items can be differentlyranked at different points in the ability range (see Appendix).

Figure 4

PPPPPi,vi,vi,vi,vi,v = = = = =(β(β(β(β(βννννν

-----δδδδδi )))))

(β(β(β(β(βννννν -----δδδδδi )))))

eeeee

1+e1+e1+e1+e1+e

There are a number of specialist programs which conduct Rasch analysis, includingWINSTEPS, RUMM and ConQUEST. The following analysis is an extract from ConQUESTfor a dichotomously scored multiple choice test.

VARIABLES UNWGHTEDFIT WGHTEDFIT

item ESTIMATE ERROR MNSQ

T MNSQ

T

1BSMMA01 0.363 0.050 0.87 -3.0 0.91 -2.52BSMMA02 -0.178 0.052 0.98 -0.3 0.97 -0.73BSMMA03 -0.025 0.051 0.96 -0.8 0.99 -0.34BSMMA04 0.836 0.049 0.96 -1.0 0.96 -1.05BSMMA05 1.179 0.049 1.09 1.9 1.10 2.56BSMMA06 -0.312 0.052 1.04 0.9 1.06 1.67BSMSA07 -0.389 0.053 0.97 -0.7 0.96 -1.08BSMSA08 -0.324 0.053 1.05 1.1 1.05 1.39BSMSA09 -0.966 0.056 0.97 -0.6 0.99 -0.210BSMSA10 -0.391 0.053 1.00 0.1 1.01 0.3

The estimate term is a measure of the difficulty which generally varies from -3 to 3. It can beseen that the easiest question in that test is item 9, while the most difficult is item 5. Negativedifficulty estimates indicate questions which are easier than the average ability of thecandidates.

The error refers to the standard error of calibration measured in logits, which identifies howaccurately the question is measuring the candidates. Error statistics of below 0.25 may beacceptable in certain cases, but measures below 0.12 are desirable (Wright, 1977).

The fit statistics (see section below) are an indication of how well the data fit the Raschmodel. Any standardised fit statistic over 2 (or mean fit + 2 standard deviations of the fit) isconsidered to suggest that the item is misfitting.

Page 25: A Summary of Methods of Item Analysis

23 23 23 23 23

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

This model has been expanded (Wright and Masters, 1982) to cope with multiple markquestions as below (partial credit model).

With variables as above and wheremi = the maximum number of mark points of item ij = the mark points availablek = 1,2, ... 1,2, ... 1,2, ... 1,2, ... 1,2, ...mi

x = the score of the candidate on the item1

eeeee = Euler’s constant (2.71)(2.71)(2.71)(2.71)(2.71)

In the partial credit model, there is a difficulty level associated with each score threshold. Forexample, in a 3 mark test, there are three difficulty levels: the difficulty associated with scoringthe first mark, the difficulty associated with scoring the second mark, once the first has beenachieved and the difficulty of scoring the third mark once the first two have been gained.

Advantages of the Rasch ModelAdvantages of the Rasch ModelAdvantages of the Rasch ModelAdvantages of the Rasch ModelAdvantages of the Rasch Model

In addition to the advantages of the IRT models over the classical models, the Rasch modelyields consistent difficulty rankings for items at different points on the ability continuum; thisis not necessarily the case in IRT2 and IRT3. Rasch also simplifies the model by only havingone parameter to estimate for each item, eradicating some of the problems noted withparameter estimation in IRT2 and IRT3.

PPPPParameter estimationarameter estimationarameter estimationarameter estimationarameter estimation

The procedure for estimating the parameters in a Rasch model is similar to that of IRT1.Bayesian estimation is not encouraged in Rasch measurement, where it is considered moredesirable to exclude data that does not fit the model as unrepresentative rather than changingthe model to accommodate it (Wright, 1979).

In IRT, Bayesian estimation is usually used in IRT2 and IRT3, to prevent the item with thelargest initial discrimination overpowering the other items and to obtain better estimates of ci.As the discrimination is pre-set and there is presumed to be minimal guessing, the use ofBayesian estimation is not so necessary in Rasch.

1 Essentially the numerator is the log of the sum of the (β-δ)s of all marks obtained (Note that to score 2, you

must also have scored 1). The denominator is the sum of all possible numerators.

∑=

=

=

=

i ij

k

jn

ij

x

jn

m

k

iv

e

eP

0

)(

)(

,

0

0

δβ

δβ

Page 26: A Summary of Methods of Item Analysis

24 24 24 24 24

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

Model data fitModel data fitModel data fitModel data fitModel data fit

The fit of the data to the model can be estimated using the same procedures as used for IRT.In IRT however, should the model not fit the data, practitioners would recommend that themodel be exchanged for one which fitted it better, perhaps one with a different number ofparameters. In Rasch, the above is more a test of how well the data fits the model rather thanhow well the model fits the data1. Thus candidates and items who do not fit the model may beexcluded from the estimation of parameters and the estimation procedure performed again(Wright, 1979).

The square of the sum of the residuals over the standard error for each item divided by thenumber of items gives a measure of the fit of each item to the model, these can bestandardised to give a weighted fit statistic. If the model fitted the data perfectly these wouldhave a standard deviations of 1 and means of 0. Any standardised fit statistic over 2 (ormean fit + 2 standard deviations of the fit) is considered to suggest that the item is misfitting.If these are very large, the estimation of parameters procedures may be rerun excluding theseitems, then difficulty estimates for these items can be obtained by substitution of the pre-acquired ability estimates at the final iteration.

Criticisms of the Rasch ModelCriticisms of the Rasch ModelCriticisms of the Rasch ModelCriticisms of the Rasch ModelCriticisms of the Rasch Model

The Rasch model became very popular in Britain in the 1970s mainly through its use in theassessment of performance unit (APU). Concerns were raised about the assumption ofunidimensionality, suggesting that tests are not, and should not be, designed to only test onedomain, but should be a heterogeneous mix of questions (McLean and Ragsdale, 1983).Other criticisms stemmed from its use in trying to monitor standards over time (Rust andGolombok, 1989).

One claim of the Rasch model is that it produces item statistics which are free of thecandidates they were calibrated on. This has obvious implications for the monitoring ofstandards. Unfortunately, using Rasch to monitor standards is not as unproblematic as itwould first appear. The monitoring of standards through Rasch can be a very powerfulconservative influence on the curriculum.

The candidates that attempt the items initially bring certain abilities (both curricular andgeneral) with them to the test. These abilities are socially determined (and thus temporallybound). As society changes, the abilities of candidates will change. The items that werebanked when monitoring started would be based on the educational assumptions of the time.

1 Rasch (1960) argued that when items showed up as misfitting, it was a sign of discord between the model andthe data, it was then an open question whether the problem was with an innappropriate model or withunrepresentative data. He argued, in contrast to IRT practitioners that the model should be taken over the data.

Page 27: A Summary of Methods of Item Analysis

25 25 25 25 25

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

Ten years later, the education that candidates would receive would be different, throughchanges in the curriculum, in educational practice and in society at large. This would causethe candidates to have different knowledge and skills profile than their predecessors. As onlythe domain that was considered important at the time the item bank was constructed wouldfeature in the items, achievements in new curricular innovations would not be valued as highlyas achievements in the traditional areas.

Furthermore, Rasch is an inverted type of statistical modelling in many senses. One of thekey notions is that the model should fit the data rather than the data fitting the model. WhereRasch is used to monitor standards through the construction of item banks (as used by theAssessment of Performance Unit) , only items which fit the model will be incorporated. Thiscould well lead to a worrying washback effect, where methods of learning which maximisethe chance of success on ‘Rasch compatible items’ are more highly valued than other types oflearning (Goldstein and Blinkhorn, 1977).

It must also be appreciated that examinations are used for other purposes apart frommeasurement of candidates’ abilities. Although in certain high stakes examinations maximalaccuracy of measurement may be a priority, this is not the case in most Western post-industrial countries. The use of examinations for accountability purposes is much strongerthan their use for selection purposes. This is not to say that they are not used for selection,only that the wider purposes of their use has overtaken the kind of high stakes measurementseen in Sri Lanka or in the post-war tripartite system The feedback from assessments usingRasch measurement is much less than those using traditional statistics although Lawton (1980)suggests that it was for this purpose that the APU was conceived, and decided to use Raschmeasurement.

Comparison of school facility value with overall facility value gives an important indicator toschools about their areas of relative weakness and strength (Massey et al., 1996). Such atechnique could be adapted to fulfil the needs of higher education institutions, not onlyidentifying areas of strength and weakness but also ensuring that marking was equivalentacross different institutions. Such a comparison would be much more difficult to interpretmeaningfully using Rasch.

Page 28: A Summary of Methods of Item Analysis

26 26 26 26 26

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

ConclusionConclusionConclusionConclusionConclusionWhere tests are used only once, classical test theory would appear adequate for evaluating thequality of the examination. The item/candidate interdependence is not really an issue as thetest is only taken once by a single group of candidates. Where choice is allowed, thetraditional statistics used must be adapted to cope with this.

Latent trait models can be useful where tests and/or items are used with more than onecandidate group, as is more common in higher education, as they overcome the problems ofsample dependence. This leads to an item banking approach, where questions are added to ageneral bank and extracted as needed, although monitoring of the item bank itself must beundertaken to ensure that the items are representative of that which is being taught. LatentTrait models also have the advantage that they give a more precise interpretation to thestandard error, by defining it as an ability dependent measure.

Problems can be encountered in IRT2 and IRT3 when trying to estimate the parameters.These can be overcome by pre-specifying the expected item and ability distributions. Therecan also be problems associated with trying to rank questions in difficulty order as theintroduction of a discrimination parameter changes the rankings at different points in theability scale (see Appendix). Essentially in IRT2 and IRT3, question difficulty rankings areability dependent, somewhat violating the claim that the parameters are sample independent.IRT1 and Rasch do not suffer from these problems, yielding consistent difficulty estimateswhich are relatively easy to extract.

Future developments in assessment, particularly the importance of the standards debate andthe development of computer adaptive testing (CAT) may well lead to further innovations initem analysis. The importance of maintaining standards over time and across institutionsdoes suggest that an item banking approach, either for the examination itself or, more likely,for calibrating other items - perhaps including some banked items into a “disposable” test,may become commonplace. Obviously the caveats given above must be given carefulconsideration, but a Latent Trait approach would be more suitable for these items.

Bearing in mind that as IRT2 and IRT3 do not yield consistent estimates of item difficultyacross the ability range (see Appendix), either IRT1 or Rasch would seem more suitable.Computer adaptive testing, where a large bank of questions can be developed and used overand over again in different combinations, would also seem to favour IRT1 or Rasch as amethod of generating results.

Page 29: A Summary of Methods of Item Analysis

27 27 27 27 27

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

REFERENCESREFERENCESREFERENCESREFERENCESREFERENCES

Birnbaum (1968) Some Latent Trait Models and their use in inferring an examinee’s abilityChapter 17 in Lord and Novick, Statistical Theories of Mental Test Scores, Addison-Wesley, Reading

Backhouse (1972a) Reliability of GCE Examinations: A theoretical and empirical approach,Chapter 7 in Nuttall and Willmott, British Examinations - Techniques of Analysis,NFER, Windsor

Backhouse (1972b) Mathematical Derivations of Formulas P,Q and S, Appendix 2 in Nuttalland Willmott, British Examinations - Techniques of Analysis, NFER, Windsor

Butterfield (1995) Educational Objectives and National Assessment Open University Press

Cronbach (1951) Coefficient alpha and the internal structure of tests, Psychometrika Vol 16, pp297-334

Cronbach (1971). Test validation. In R. L. Thorndike (Ed.). Educational Measurement (2nd Ed.).American Council on Education, Washington, D. C.

Goldstein and Blinkhorn (1977) Monitoring Educational Standards - An Inappropriate ModelBulletin of the British Psychological Society Vol 30 pp 309 - 311

Gulliksen (1950) Theory of Mental Tests, John Wiley, New York

Guttman (1945) A Basis for Analysing Test-Retest Reliability, Psychometrika, Vol 10 pp255-282

Hambleton (1989) Principles and Selected Applications of Item Response Theory, Chapter 4 in Linn,Educational Measurement (third edition), Oryx Press, Phoenix

Hambleton and Cook (1977) Latent Trait Models and their use in the Analysis of Educational TestData Journal of Educational Measurement, Summer 1977

Hambleton and Swaminathon (1985) Item Response Theory: Principles and ApplicationsKluwer, Boston

Hambleton, Swaminathon and Rogers (1991) Fundamentals of Item Response Theory Sage,London

Lord (1968) An analysis of the Verbal Scholastic Aptitude Test using Birnbaum’s three parameter logisticmodel, Educational and Psychological Measurement, Volume 28 pp989-1020

Page 30: A Summary of Methods of Item Analysis

28 28 28 28 28

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

Lord (1984) Standard Errors of Measurement at Different Ability Levels, Journal of EducationalMeasurement, Volume 21, pp 239 - 243

McDonald (1981) The Dimensionality of Tests and Items. British Journal of Mathematical andStatistical Psychology, Volume 34, pp100-117

McLean and Ragsdale (1983) The Rasch Model for Achievement Tests - Inappropriate in the Past,Inappropriate Today, Inappropriate Tomorrow, Canadian Journal of Education, Volume 8Issue 1

Massey (no date) The relationship between the popularity of questions and their difficulty level inexaminations which allow a choice of question, Occasional Publication of the TestDevelopment and Research Unit, Cambridge

Massey (1995) Evaluation and analysis of examination data: Some guidelines for reporting andinterpretation, UCLES internal report, Cambridge

Massey, McAlpine and Pollitt (1996) Schools’ Reactions to Detailed Feedback from a PublicExamination, UCLES internal report, Cambridge

McAlpine and Massey (1998) MEG GCSE History (Syllabus 1607) June 1997: An evaluationof the Measurement Characteristics and Quality of the Examination, UCLES internalreport, Cambridge

Messick (1989). Validity. In R. L. Linn (Ed.),. Educational measurement (3rd ed., pp. 13-103).New York.

Morrison (1972) A method for analysis of choice type question papers, Chapter 5 in Nuttall andWillmott, British Examinations - Techniques of Analysis, NFER, Windsor

Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher, 23, 5-12.

Rasch (1960) Probabilistic Models for Some Intelligence and Attainment Tests, Danish Institute forEducational Research, Copenhagen

Rust and Golombok (1989) Modern Psychometrics Routledge, London

Samejma (1969) Estimation of latent ability using a response pattern of graded scoresPsychometric Monograph No 18, Psychometric Society, Iowa City

Shepard, Camilli and Williams. (1984) Accounting for Statistical Artefacts in Item Bias Research,Journal of Educational Statistics, Volume 9, pp93 - 128

Swaminathan and Gifford (1986) Bayesian Estimation in the three parameter model,Psychometrica Volume 51, pp 589 - 601

Page 31: A Summary of Methods of Item Analysis

29 29 29 29 29

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

Traub (1994) Reliability for the Social Sciences: Theory and Applications, Sage, London

Wright (1968) Sample free test calibration and person measurement, Proceedings of the 1967Invitational Conference on Testing Problems, Educational Testing Services,Princeton

Wright (1977) Solving Measurement Problems with The Rasch Model, Journal of EducationalMeasurement, Summer 1997

Wright (1988) Reliability in Rasch Measurement: Research Memorandum No 53, Alfred AdlerInstitute and MESA Psychometric Laboratory

Wright and Masters (1982) Rating Scale Analysis: Rasch Measurement MESA Press, Chicago

Wright and Stone (1979) Best Test Design: Rasch Measurement, MESA Press, Chicago

Page 32: A Summary of Methods of Item Analysis

30 30 30 30 30

CAA CentreCAA CentreCAA CentreCAA CentreCAA Centre

A Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item AnalysisA Summary of Methods of Item Analysis

APPENDIXAPPENDIXAPPENDIXAPPENDIXAPPENDIX

As the discrimination of items in IRT2 and IRT3 is allowed to differ, these models do notgive consistent rankings of item difficulty across the range of candidates. Figure 5illustrates the Item Characteristic Curve for items 1 and 2 using an IRT1/Rasch model, whilefigure 6 illustrates the ICC for the same items using an IRT2 model.

In Figure 5 it can clearly be seen that item 1 is consistently easier than item 2. In Figure 6however the picture is less clear. It would appear that although item 1 is easier than item 2for most candidates but for the highest ability candidates, item 2 appears easier than item 1.

This causes difficulty in interpreting the bi parameter and challenges the claim that theparameters in all item response models are sample independent. This models suggests thatthe ability of the candidate determines the relative difficulties of questions.

FFFFFigure 5: Item Characteristic Curve; IRT2/Rasch Analysisigure 5: Item Characteristic Curve; IRT2/Rasch Analysisigure 5: Item Characteristic Curve; IRT2/Rasch Analysisigure 5: Item Characteristic Curve; IRT2/Rasch Analysisigure 5: Item Characteristic Curve; IRT2/Rasch Analysis

FFFFFigure 6: Item Characteristic Curve; IRT2/IRT3igure 6: Item Characteristic Curve; IRT2/IRT3igure 6: Item Characteristic Curve; IRT2/IRT3igure 6: Item Characteristic Curve; IRT2/IRT3igure 6: Item Characteristic Curve; IRT2/IRT3

100%

Probability ofansweringquestioncorrectly

Probability ofansweringquestioncorrectly

100%

Ability of candidate

Ability of candidate

0 ∞

0 ∞

Page 33: A Summary of Methods of Item Analysis

Published for the CAA Centre byLearning and Teaching DevelopmentLoughborough UniversityLoughboroughLeicestershireLE11 3TU

Telephone: +44 (0) 1509 222893Fax: +44 (0) 1509 223927

ISBN: 1-904020-02-X

CAA Centre