Practical Issues and Challenges in Operationalizing K-12 CAT - the Conference … · 2015. 7. 14. · Presentation at the National Conference on Student Assessment (NCSA) 2015, San

Copyright © 2014 by Educational Testing Service. All rights reserved.

Practical Issues and Challenges in

Operationalizing K-12 CAT

Presentation at the National Conference on Student Assessment (NCSA) 2015, San Diego, CA

Yi Du, Ph.D.

Educational Testing Services

1

Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.

Introduction

• Using computerized adaptive tests (CAT) for the State standard-based assessments becomes quite attractive.

• Theoretical foundation of CAT, such as five basic components of a CAT procedure have been well-researched (Weiss, 1984)

• Implementation Questions for CAT Were Well Studied (Way, 2005; Davey, 2011).

• However, issues addressed from operational CAT practices remain.

2


Issues Addressed from Operational CAT Practices

– How to measure a student accurately when he/she didn’t complete a CAT test?

– Does a selection mechanism make use of an exposure control procedure?

– How to ensure the item selection mechanism meets all required test specifications?

– How to replicate students’ CAT results at State end?– How to ensure the accuracy of the CAT results?– How to ensure the item selection mechanism provides

appropriate difficulty level of items to students, not frustrating both low- or high-achieving students?

– How to communicate CAT results to test users?– Does CAT allow students for skipping items during exams?

3


How to Ensure the Quality of CAT Results?

• Rigorous quality control procedures and post hoc analyses were well implemented throughout the entire assessment process for paper-and-pencil tests (PPT) in most states

• Comprehensive QC procedures and post hoc analysis for online tests, especially for CAT, may not be well established yet

– QC results from CAT may not be as straightforward as those from PPT.

– Most technical characteristics in CAT are not visible to test users.

– Those invisible parts have significant impact on the quality of CAT results.

4


Objectives of the Presentation

• Focus on the issues and challenges from the operational CAT practices

• Provide thoughts for practitioners on operational CAT

• Discuss how post-hoc analysis as a tool can help us understand the issues better

• Provides examples on how to use post-hoc analysis to examine and ensure the quality of CAT scores

5


Issues of the Presentation

Major practical issues we concerned related to items selection and score estimation:

– Test specifications (blueprints)

– Item performance

– Item exposure and overlapped

– Attemptedness of a test

– Incompletion of tests

– Accuracy of the final results

6


Test Specifications (Blueprints)

• Test specifications– Ensure a CAT system accurately assesses a full range of standards.

– Should be determined with the simulation of tests based on the item pools, prior to administering the CAT.

• In CAT, every student has a unique test form; this form is built on the fly.

• Components for specifications: item pool, item selection algorithm, simulations.

7


Test Specifications

• Is it necessary that every student meets the test specifications in CAT?

• Did every student actually meet the test specifications in a CAT?

What analysis can help answer these questions?

• Conventional analysis

• Post-hoc real data simulation

8


Examining Whether the Test Specifications Were Met

Specifications Actually Tested

Content Category ItemsTotal Items by Domain

% Met %Less % More

ReadingVocabulary 7-8

21-2482% 1% 17%

Literary 7-8 89% 0% 11%Informational 7-8 96% 1% 3%

Writing

Organization/Purpose

510

100% 0% 0%Evidence/Elaboration 100% 0% 0%Conventions 5 100% 0% 0%

Listening Listening 10 10 95% 0% 5%Overall 41-44 100% 0% 0%

9


Test Specifications

Advanced approaches to examine the test specifications:

• Post hoc (real data) simulations

• Software

– Open sources

• SimulCAT (Kyung Han, 2012)

• FireStar (Choi, 2009)

• SimuMCAT (Lihua, Yao, 2011)

• Concerto (David Magis, 2014)

– Commercial

• CATSim (David Weiss)10


Attemptedness Status

• Determine whether a student’s score should be counted, valid, and reported

• What is attemptedness in CAT?

– Just login the tests, or

– respond to items?• How many questions responded to qualify for a score?

• How many questions responded to qualify for a valid score?

– Aattemptedness rule for overall scores and for domain scores?

• how many students were effected by the policy and whether the policy is appropriate

11


Attemptedness Analysis

Score # CompletionIncompletion

with Valid Score

Incompletion with Lowest

ScoreNS TotalN

Overall 20529 58 0 12 20599

Domain1 20599 0 0 0 20599

Domain2 20524 0 0 75 20599

Domain3 20599 0 0 0 20599

Domain4 20587 0 0 12 20599

12


Attemptedness Status Analysis

% Incorrect%Correct (1,2,3,4) % Omitted %Not Seen %NS

ELA3 49.29 50.65 0.01 0.02 0.03

ELA4 48.99 50.35 0.03 0.05 0.1

ELA5 47.96 51.97 0.03 0.02 0.02

13


Incompletion of A CAT Test

• Tests are considered “complete” if students respond to the minimum number of operational items. Otherwise, the tests are “incomplete.”

• CAT rules:

– Is a student allowed for skipping an item in the middle of CAT?

• If it is a “Yes” and if a test is considered attempted and scored,

– Omit: Skip items in the middle and complete a test

– Incomplete: Stop in the middle and never complete.

• Items were presented in several ways:

– Seen and responded to

– Seen and not responded to

– Not presented

• Scoring rules and QC procedures should consider those cases

14


Incompletion but Valid Score Adjustment

Several approaches are to score the incomplete tests:

• Score the unanswered items as incorrect.

– The item parameters of all unanswered were imputed by

• The average value of items in the item pool

• The average value of items a student answered

• The range of items a student answered

• Score the actually completed portion, and adjust the incomplete portion proportionally with a student’s ability estimate.

• It is an ongoing research topic.

15


Accuracy of Final Scores

– A detailed technically description of the methodology of the provisional and final scoring computation process should be provided by vendors,

– Statistical properties of the final scores such as test reliability, standard errors of measurement, test information functions, etc. for bias and precision should be computed and provided by vendors,

– At State end, State psychometricians or researchers may need to conduct other analysis to ensure the quality of scoring.

16


Scope of Scoring QC

• Final Ability Estimate Procedures

– MLE, Bayesian (EAP or MAP), Inverse TCC

• Overall scores and domain scores

• Theta to scale scores

– Transformation from theta to scale

– Achievement level

– Range

17


Conventional Analysis

Percentage of Students in Achievement Level

Level 1 Level 2 Level 3 Level 4 TOP TwoEL3 38.6 26.72 18.9 15.78 34.68EL4 43.44 21.89 19.81 14.86 34.67EL5 35.9 22.57 26.93 14.6 41.53EL6 31.32 29.54 27.64 11.5 39.14EL7 31.93 27.11 30.75 10.22 40.97EL8 26.18 30.98 32.74 10.1 42.84

MA3 37.72 27.54 23.97 10.77 34.74MA4 32.96 36.18 20.46 10.41 30.87MA5 45.11 29 13.83 12.07 25.9MA6 40.15 32 16.4 11.44 27.84MA7 40.29 30.45 17.78 11.48 29.26MA8 43.47 27.25 15.7 13.58 29.28

18


Conventional Analysis

Student Mean and Standard Deviation across Grades

OP2015 OP 2014

Test Theta_mean Theta_std Theta_mean Theta_std

EL3 -1.30 1.01 -1.23 1.05

EL4 -0.86 1.05 -0.74 1.11

EL5 -0.32 1.06 -0.31 1.10

EL6 -0.05 1.06 -0.05 1.11

EL7 0.22 1.08 0.12 1.14

EL8 0.47 1.06 0.39 1.14

MA3 -1.39 0.98 -1.27 0.96

MA4 -0.85 0.99 -0.70 1.00

MA5 -0.55 1.10 -0.33 1.07

MA6 -0.28 1.25 -0.08 1.18

MA7 -0.09 1.32 0.03 1.34

MA8 0.13 1.41 0.28 1.33

19


Accuracy of Student Testing Results

• The error variance provides estimates of precision

SEM N %

Grade 3

greater or equal than 2.5 0 0greater or equal than 1.5 but less than 2.5 1 0greater or equal than 0.5 but less than 1.5 281 0.57greater or equal than 0.3 but less than 0.5 11599 23.57greater or equal than 0 but less than 0.3 37330 75.86Total 49211 100.00

Grade 4

greater or equal than 2.5 0 0greater or equal than 1.5 but less than 2.5 0 0greater or equal than 0.5 but less than 1.5 169 0.58greater or equal than 0.3 but less than 0.5 13196 45.18greater or equal than 0 but less than 0.3 15842 54.24Total 29207 10020


Advanced Analysis

• Check the ability estimate procedures: generate a program to compute MLE, EAP, MAP, or Inverse TCC, based on the IRT model used,

• Replicate vendor’s results,

• Track students’ response paths to examine if the item selection algorithm is effective. (Measurement precision, security, content balance, maximum item usage).

21


Conventional Item Analysis

22

ItemID a-Part b_part c_part d_partScore Point

Score of 0

Score of 1

Score of 2

P-valueN Students

Item1 0.7049 -0.06291 0.21 0,1 13% 87% 0.87 23Item2 0.58321 1.83536 0.25 0,1 60% 40% 0.4 3457

Item3 0.65237 0.36911 0.25 0,1 39% 61% 0.61 3457

Item4 0.3744 1.90488 0.20 0,1 84% 16% 0.16 25

Item5 0.57374 1.90652.39802, -.39802 0,1,2 63% 11% 26% 0.31 27

Item6 0.43612 2.75943 0.23 0,1 89% 11% 0.11 27Item7 0.16635 0.9321 0.34 0,1 32% 68% 0.68 19

Item8 0.47194 0.82735 0.25 0,1 42% 58% 0.58 19Item9 0.78146 0.96798 0.27 0,1 53% 47% 0.47 19

Item10 0.36417 1.04496 0.31 0,1 35% 65% 0.65 1383


Item Exposure Control

• Item exposure is an important consideration for test security in the continuous testing environment of CATs.

• High item exposure rates pose a formal threat to test security.

• It is important to check if item exposure control is implemented for security

23


Visually InspectionbValue Nitems >=3000 1000-3000 500-1000 100-500 1-100 0(-2.5,-2.0](-2.0,-1.5] 3 3(-1.5,-1.0] 12 1 4 7(-1.0,-0.5] 48 3 5 8 9 23(-0.5,0.0] 98 11 13 12 18 42 2(0.0,0.5] 149 13 18 24 25 63 6(0.5,1.0] 185 12 14 31 34 86 8(1.0,1.5] 198 11 16 53 41 67 10(1.5,2.0] 270 17 19 33 94 74 33(2.0,2.5] 202 5 9 21 65 69 33(2.5,3.0] 151 8 4 11 56 52 20(3.0,3.5] 109 2 6 1 49 38 13(3.5,4.0] 72 1 1 2 17 44 7(4.0,4.5] 52 1 2 7 35 7(4.5,5.0] 21 2 1 18(5.0,5.5] 7 1 5 1(5.5,6.0] 4 4>6

24


Item Exposure Rate by Item Difficulty Level

0

1

2

3

4

5

6

7

8

9

25


Item Exposure by Content Domains

Domain# ItemsFreq>=3000

1000<=Freq<3000

500<=Freq<1000

100<=Freq<500

1<=Freq<100 Freq=0

1 499 40 22 15 39 276 107

2 437 9 39 96 242 37 14

3 334 26 28 4 6 264 6

4 311 10 18 84 136 50 13

Overall 1581 85 107 199 423 627 140

26


Final Comments

• Post hoc analysis can help:– A State explore the invisible technical characteristics of a CAT

system

– Fully understand the capabilities and limitations of a CAT system in operational tests.

– Efficient QC procedures for CAT.

• Many options in post-hoc analysis for CAT exist.

27


Thanks very much for your

Time and Consideration!

28

Documents

Practical Issues and Challenges in Operationalizing K-12 CAT - the Conference … · 2015. 7. 14. · Presentation at the National Conference on Student Assessment (NCSA) 2015, San