20
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

Embed Size (px)

Citation preview

Page 1: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011ORLANDO, FL

Page 2: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

2

What is different about the adaptive context?

How do you conceptualize adaptive assessments?

How do you make the transition from fixed form thinking?

How can you evaluate the quality of these tests?

Page 3: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

In the fixed form world….Test Blueprint + items = Test Form =

Student Test EventPercent correct is an indicator of

difficultyCommonly accepted criteria for

acceptance

3

Page 4: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

In the adaptive context…Test Blueprint is a design for the student

test eventItem pool + test structure + algorithm

determine each test eventVariable linking block (all items)P-values close to .5Metrics not as well-established.

4

Page 5: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

Everything supports the test event

5

Page 6: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

What’s going on here?You are moving from the concept of a

population responding to a form into the realm of a person responding to an individual item.

Indicators based on sets of people responding to sets of items may be uninformative

The scale representing the latent trait assumes greater importance.

6

Page 7: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

Move from population-based thinking to Responses to ItemsForms are not linked to one another. Pool consists of

items linked to the scale. Scores from non-parallel tests are expressed and interpreted on the scale.

Percent correct is not important in assessing ability. The test event establishes the difficulty of the items a student is getting right about half the time.

The goal of the test session is to solve for theta (Use the IRT equation with your favorite number of parameters.)

7

Page 8: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

Start with the Test BlueprintWhat do you want every student to get?

Content – categories and proportionsCognitive characteristicsItem types

How many items in each test event?What are you going to report? For individuals? For

groups?Overall scoresSub-scoresAchievement category

8

Page 9: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

How do you evaluate pool adequacy?How do you evaluate pool adequacy?

Reckase – P-optimal pool evaluation. Analysis of “bins”. Satisfy some proportion of a fully informative pool.

It’s unrealistic to expect that every value of theta will have a maximally informative item. This method specifies a degree of optimality.

The p-optimal method can be used to evaluate existing pools or specify pool design.

9

Page 10: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

How do you evaluate pool adequacy?Veldkamp & van der Linden - Shadow test method –

1. At every point in the test, a test that meets constraints and has maximum information at the current ability estimate is assembled.

2. The item in the shadow test with maximum information is administered

4. Update the ability estimate.5. Return all unused items to the pool.6. Adjust the constraints to allow for the attributes of the

item administered.7. Repeat Steps 2-6 until end of test.

10

Page 11: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

Adaptive Test Design-AlgorithmHow will you guarantee that each students gets the

material in your test design?Item selection, scoring, domain sampling

How will you guarantee reliable scores and categories?Overall scoresSub-scoresAchievement category

How do you control for item exposure?

11

Page 12: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

12

Adaptive test event - Start

Assumption: you have a calibrated item pool that supports your test purpose

What do you need to know about the examinee?

How will you choose the initial item?

Jumping into the item pool

Page 13: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

13

Adaptive test event – Finding ThetaAssumption: you have a

response to the initial item

How do you estimate ability?

How do you estimate error?

How do you choose the next item?

How do you satisfy your test event design?

Progressing through the item pool

Page 14: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

14

Adaptive test event – TerminationWhat triggers the end of the test?

Number of itemsError thresholdProctor termination

What is reported to the student at the end?

High achiever getting out of the pool

Page 15: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

15

How do I know it’s a good test?• Classical reliability estimates depend on correlation

among items. In CAT, inter-item correlation is low. This is an illustration of local independence.

• In general CATs use the Marginal Reliability Coefficient (Samejima, 1977, 1994). This is based on analysis of the test information function over all values of theta.

• In evaluating tests, it can be interpreted like coefficient alpha.

Page 16: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

16

Simulation is your friend•Using the actual pool, test structure and algorithm, simulate student responses at interesting levels of theta.•Compare the test’s estimated thetas with true thetas.

•Bias: Average difference•Fit: Root Mean Squared Error

How do I know it’s a good test before giving it to zillions of students?

Page 17: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

17

CAT depends on a calibrated bank• When items are used operationally, responses are gathered

from those with highest info (I.e., ability and difficulty are close)• variance is low so correlational indicators are not appropriate• P-values are around .5

Page 18: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

18

Evaluating item technical quality• Calibration depends on common person link to scale• Expose to a representative sample• The trick is to get informative responses

Page 19: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

19

Evaluating item technical quality• In calibration, the process is to find difficulty

from responses of examinees with known abilities.

• Look at a vector of p-values across the range of theta. • Evaluate the relationship between observed and

expected p-values for your IRT model; may use chi-square or correlation of p to expected p.

• What value of difficulty maximizes this relationship?

Page 20: NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

Ask lots of questions.Keep pestering until understanding dawns.

Thank you for your attention!Questions, comments?Contact:[email protected]

20