NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011ORLANDO, FL

2

What is different about the adaptive context?

How do you conceptualize adaptive assessments?

How do you make the transition from fixed form thinking?

How can you evaluate the quality of these tests?

In the fixed form world….Test Blueprint + items = Test Form =

Student Test EventPercent correct is an indicator of

difficultyCommonly accepted criteria for

acceptance

3

In the adaptive context…Test Blueprint is a design for the student

test eventItem pool + test structure + algorithm

determine each test eventVariable linking block (all items)P-values close to .5Metrics not as well-established.

4

Everything supports the test event

5

What’s going on here?You are moving from the concept of a

population responding to a form into the realm of a person responding to an individual item.

Indicators based on sets of people responding to sets of items may be uninformative

The scale representing the latent trait assumes greater importance.

6

Move from population-based thinking to Responses to ItemsForms are not linked to one another. Pool consists of

items linked to the scale. Scores from non-parallel tests are expressed and interpreted on the scale.

Percent correct is not important in assessing ability. The test event establishes the difficulty of the items a student is getting right about half the time.

The goal of the test session is to solve for theta (Use the IRT equation with your favorite number of parameters.)

7

Start with the Test BlueprintWhat do you want every student to get?

Content – categories and proportionsCognitive characteristicsItem types

How many items in each test event?What are you going to report? For individuals? For

groups?Overall scoresSub-scoresAchievement category

8

How do you evaluate pool adequacy?How do you evaluate pool adequacy?

Reckase – P-optimal pool evaluation. Analysis of “bins”. Satisfy some proportion of a fully informative pool.

It’s unrealistic to expect that every value of theta will have a maximally informative item. This method specifies a degree of optimality.

The p-optimal method can be used to evaluate existing pools or specify pool design.

9

How do you evaluate pool adequacy?Veldkamp & van der Linden - Shadow test method –

1. At every point in the test, a test that meets constraints and has maximum information at the current ability estimate is assembled.

2. The item in the shadow test with maximum information is administered

4. Update the ability estimate.5. Return all unused items to the pool.6. Adjust the constraints to allow for the attributes of the

item administered.7. Repeat Steps 2-6 until end of test.

10

Adaptive Test Design-AlgorithmHow will you guarantee that each students gets the

material in your test design?Item selection, scoring, domain sampling

How will you guarantee reliable scores and categories?Overall scoresSub-scoresAchievement category

How do you control for item exposure?

11

12

Adaptive test event - Start

Assumption: you have a calibrated item pool that supports your test purpose

What do you need to know about the examinee?

How will you choose the initial item?

Jumping into the item pool

13

Adaptive test event – Finding ThetaAssumption: you have a

response to the initial item

How do you estimate ability?

How do you estimate error?

How do you choose the next item?

How do you satisfy your test event design?

Progressing through the item pool

14

Adaptive test event – TerminationWhat triggers the end of the test?

Number of itemsError thresholdProctor termination

What is reported to the student at the end?

High achiever getting out of the pool

15

How do I know it’s a good test?• Classical reliability estimates depend on correlation

among items. In CAT, inter-item correlation is low. This is an illustration of local independence.

• In general CATs use the Marginal Reliability Coefficient (Samejima, 1977, 1994). This is based on analysis of the test information function over all values of theta.

• In evaluating tests, it can be interpreted like coefficient alpha.

16

Simulation is your friend•Using the actual pool, test structure and algorithm, simulate student responses at interesting levels of theta.•Compare the test’s estimated thetas with true thetas.

•Bias: Average difference•Fit: Root Mean Squared Error

How do I know it’s a good test before giving it to zillions of students?

17

CAT depends on a calibrated bank• When items are used operationally, responses are gathered

from those with highest info (I.e., ability and difficulty are close)• variance is low so correlational indicators are not appropriate• P-values are around .5

18

Evaluating item technical quality• Calibration depends on common person link to scale• Expose to a representative sample• The trick is to get informative responses

19

Evaluating item technical quality• In calibration, the process is to find difficulty

from responses of examinees with known abilities.

• Look at a vector of p-values across the range of theta. • Evaluate the relationship between observed and

expected p-values for your IRT model; may use chi-square or correlation of p to expected p.

• What value of difficulty maximizes this relationship?

Ask lots of questions.Keep pestering until understanding dawns.

Thank you for your attention!Questions, comments?Contact:[email protected]

20

Documents

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL