Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Copyright © 2014 by Educational Testing Service. All rights reserved.
Practical Issues and Challenges in
Operationalizing K-12 CAT
Presentation at the National Conference on Student Assessment (NCSA) 2015, San Diego, CA
Yi Du, Ph.D.
Educational Testing Services
1
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Introduction
• Using computerized adaptive tests (CAT) for the State standard-based assessments becomes quite attractive.
• Theoretical foundation of CAT, such as five basic components of a CAT procedure have been well-researched (Weiss, 1984)
• Implementation Questions for CAT Were Well Studied (Way, 2005; Davey, 2011).
• However, issues addressed from operational CAT practices remain.
2
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Issues Addressed from Operational CAT Practices
– How to measure a student accurately when he/she didn’t complete a CAT test?
– Does a selection mechanism make use of an exposure control procedure?
– How to ensure the item selection mechanism meets all required test specifications?
– How to replicate students’ CAT results at State end?– How to ensure the accuracy of the CAT results?– How to ensure the item selection mechanism provides
appropriate difficulty level of items to students, not frustrating both low- or high-achieving students?
– How to communicate CAT results to test users?– Does CAT allow students for skipping items during exams?
3
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
How to Ensure the Quality of CAT Results?
• Rigorous quality control procedures and post hoc analyses were well implemented throughout the entire assessment process for paper-and-pencil tests (PPT) in most states
• Comprehensive QC procedures and post hoc analysis for online tests, especially for CAT, may not be well established yet
– QC results from CAT may not be as straightforward as those from PPT.
– Most technical characteristics in CAT are not visible to test users.
– Those invisible parts have significant impact on the quality of CAT results.
4
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Objectives of the Presentation
• Focus on the issues and challenges from the operational CAT practices
• Provide thoughts for practitioners on operational CAT
• Discuss how post-hoc analysis as a tool can help us understand the issues better
• Provides examples on how to use post-hoc analysis to examine and ensure the quality of CAT scores
5
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Issues of the Presentation
Major practical issues we concerned related to items selection and score estimation:
– Test specifications (blueprints)
– Item performance
– Item exposure and overlapped
– Attemptedness of a test
– Incompletion of tests
– Accuracy of the final results
6
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Test Specifications (Blueprints)
• Test specifications– Ensure a CAT system accurately assesses a full range of standards.
– Should be determined with the simulation of tests based on the item pools, prior to administering the CAT.
• In CAT, every student has a unique test form; this form is built on the fly.
• Components for specifications: item pool, item selection algorithm, simulations.
7
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Test Specifications
• Is it necessary that every student meets the test specifications in CAT?
• Did every student actually meet the test specifications in a CAT?
What analysis can help answer these questions?
• Conventional analysis
• Post-hoc real data simulation
8
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Examining Whether the Test Specifications Were Met
Specifications Actually Tested
Content Category ItemsTotal Items by Domain
% Met %Less % More
ReadingVocabulary 7-8
21-2482% 1% 17%
Literary 7-8 89% 0% 11%Informational 7-8 96% 1% 3%
Writing
Organization/Purpose
510
100% 0% 0%Evidence/Elaboration 100% 0% 0%Conventions 5 100% 0% 0%
Listening Listening 10 10 95% 0% 5%Overall 41-44 100% 0% 0%
9
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Test Specifications
Advanced approaches to examine the test specifications:
• Post hoc (real data) simulations
• Software
– Open sources
• SimulCAT (Kyung Han, 2012)
• FireStar (Choi, 2009)
• SimuMCAT (Lihua, Yao, 2011)
• Concerto (David Magis, 2014)
– Commercial
• CATSim (David Weiss)10
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Attemptedness Status
• Determine whether a student’s score should be counted, valid, and reported
• What is attemptedness in CAT?
– Just login the tests, or
– respond to items?• How many questions responded to qualify for a score?
• How many questions responded to qualify for a valid score?
– Aattemptedness rule for overall scores and for domain scores?
• how many students were effected by the policy and whether the policy is appropriate
11
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Attemptedness Analysis
Score # CompletionIncompletion
with Valid Score
Incompletion with Lowest
ScoreNS TotalN
Overall 20529 58 0 12 20599
Domain1 20599 0 0 0 20599
Domain2 20524 0 0 75 20599
Domain3 20599 0 0 0 20599
Domain4 20587 0 0 12 20599
12
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Attemptedness Status Analysis
% Incorrect%Correct (1,2,3,4) % Omitted %Not Seen %NS
ELA3 49.29 50.65 0.01 0.02 0.03
ELA4 48.99 50.35 0.03 0.05 0.1
ELA5 47.96 51.97 0.03 0.02 0.02
13
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Incompletion of A CAT Test
• Tests are considered “complete” if students respond to the minimum number of operational items. Otherwise, the tests are “incomplete.”
• CAT rules:
– Is a student allowed for skipping an item in the middle of CAT?
• If it is a “Yes” and if a test is considered attempted and scored,
– Omit: Skip items in the middle and complete a test
– Incomplete: Stop in the middle and never complete.
• Items were presented in several ways:
– Seen and responded to
– Seen and not responded to
– Not presented
• Scoring rules and QC procedures should consider those cases
14
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Incompletion but Valid Score Adjustment
Several approaches are to score the incomplete tests:
• Score the unanswered items as incorrect.
– The item parameters of all unanswered were imputed by
• The average value of items in the item pool
• The average value of items a student answered
• The range of items a student answered
• Score the actually completed portion, and adjust the incomplete portion proportionally with a student’s ability estimate.
• It is an ongoing research topic.
15
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Accuracy of Final Scores
– A detailed technically description of the methodology of the provisional and final scoring computation process should be provided by vendors,
– Statistical properties of the final scores such as test reliability, standard errors of measurement, test information functions, etc. for bias and precision should be computed and provided by vendors,
– At State end, State psychometricians or researchers may need to conduct other analysis to ensure the quality of scoring.
16
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Scope of Scoring QC
• Final Ability Estimate Procedures
– MLE, Bayesian (EAP or MAP), Inverse TCC
• Overall scores and domain scores
• Theta to scale scores
– Transformation from theta to scale
– Achievement level
– Range
17
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Conventional Analysis
Percentage of Students in Achievement Level
Level 1 Level 2 Level 3 Level 4 TOP TwoEL3 38.6 26.72 18.9 15.78 34.68EL4 43.44 21.89 19.81 14.86 34.67EL5 35.9 22.57 26.93 14.6 41.53EL6 31.32 29.54 27.64 11.5 39.14EL7 31.93 27.11 30.75 10.22 40.97EL8 26.18 30.98 32.74 10.1 42.84
MA3 37.72 27.54 23.97 10.77 34.74MA4 32.96 36.18 20.46 10.41 30.87MA5 45.11 29 13.83 12.07 25.9MA6 40.15 32 16.4 11.44 27.84MA7 40.29 30.45 17.78 11.48 29.26MA8 43.47 27.25 15.7 13.58 29.28
18
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Conventional Analysis
Student Mean and Standard Deviation across Grades
OP2015 OP 2014
Test Theta_mean Theta_std Theta_mean Theta_std
EL3 -1.30 1.01 -1.23 1.05
EL4 -0.86 1.05 -0.74 1.11
EL5 -0.32 1.06 -0.31 1.10
EL6 -0.05 1.06 -0.05 1.11
EL7 0.22 1.08 0.12 1.14
EL8 0.47 1.06 0.39 1.14
MA3 -1.39 0.98 -1.27 0.96
MA4 -0.85 0.99 -0.70 1.00
MA5 -0.55 1.10 -0.33 1.07
MA6 -0.28 1.25 -0.08 1.18
MA7 -0.09 1.32 0.03 1.34
MA8 0.13 1.41 0.28 1.33
19
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Accuracy of Student Testing Results
• The error variance provides estimates of precision
SEM N %
Grade 3
greater or equal than 2.5 0 0greater or equal than 1.5 but less than 2.5 1 0greater or equal than 0.5 but less than 1.5 281 0.57greater or equal than 0.3 but less than 0.5 11599 23.57greater or equal than 0 but less than 0.3 37330 75.86Total 49211 100.00
Grade 4
greater or equal than 2.5 0 0greater or equal than 1.5 but less than 2.5 0 0greater or equal than 0.5 but less than 1.5 169 0.58greater or equal than 0.3 but less than 0.5 13196 45.18greater or equal than 0 but less than 0.3 15842 54.24Total 29207 10020
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Advanced Analysis
• Check the ability estimate procedures: generate a program to compute MLE, EAP, MAP, or Inverse TCC, based on the IRT model used,
• Replicate vendor’s results,
• Track students’ response paths to examine if the item selection algorithm is effective. (Measurement precision, security, content balance, maximum item usage).
21
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Conventional Item Analysis
22
ItemID a-Part b_part c_part d_partScore Point
Score of 0
Score of 1
Score of 2
P-valueN Students
Item1 0.7049 -0.06291 0.21 0,1 13% 87% 0.87 23Item2 0.58321 1.83536 0.25 0,1 60% 40% 0.4 3457
Item3 0.65237 0.36911 0.25 0,1 39% 61% 0.61 3457
Item4 0.3744 1.90488 0.20 0,1 84% 16% 0.16 25
Item5 0.57374 1.90652.39802, -.39802 0,1,2 63% 11% 26% 0.31 27
Item6 0.43612 2.75943 0.23 0,1 89% 11% 0.11 27Item7 0.16635 0.9321 0.34 0,1 32% 68% 0.68 19
Item8 0.47194 0.82735 0.25 0,1 42% 58% 0.58 19Item9 0.78146 0.96798 0.27 0,1 53% 47% 0.47 19
Item10 0.36417 1.04496 0.31 0,1 35% 65% 0.65 1383
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Item Exposure Control
• Item exposure is an important consideration for test security in the continuous testing environment of CATs.
• High item exposure rates pose a formal threat to test security.
• It is important to check if item exposure control is implemented for security
23
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Visually InspectionbValue Nitems >=3000 1000-3000 500-1000 100-500 1-100 0(-2.5,-2.0](-2.0,-1.5] 3 3(-1.5,-1.0] 12 1 4 7(-1.0,-0.5] 48 3 5 8 9 23(-0.5,0.0] 98 11 13 12 18 42 2(0.0,0.5] 149 13 18 24 25 63 6(0.5,1.0] 185 12 14 31 34 86 8(1.0,1.5] 198 11 16 53 41 67 10(1.5,2.0] 270 17 19 33 94 74 33(2.0,2.5] 202 5 9 21 65 69 33(2.5,3.0] 151 8 4 11 56 52 20(3.0,3.5] 109 2 6 1 49 38 13(3.5,4.0] 72 1 1 2 17 44 7(4.0,4.5] 52 1 2 7 35 7(4.5,5.0] 21 2 1 18(5.0,5.5] 7 1 5 1(5.5,6.0] 4 4>6
24
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Item Exposure Rate by Item Difficulty Level
0
1
2
3
4
5
6
7
8
9
25
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Item Exposure by Content Domains
Domain# ItemsFreq>=3000
1000<=Freq<3000
500<=Freq<1000
100<=Freq<500
1<=Freq<100 Freq=0
1 499 40 22 15 39 276 107
2 437 9 39 96 242 37 14
3 334 26 28 4 6 264 6
4 311 10 18 84 136 50 13
Overall 1581 85 107 199 423 627 140
26
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Final Comments
• Post hoc analysis can help:– A State explore the invisible technical characteristics of a CAT
system
– Fully understand the capabilities and limitations of a CAT system in operational tests.
– Efficient QC procedures for CAT.
• Many options in post-hoc analysis for CAT exist.
27
Copyright © 2014 by Educational Testing Service. All rights reserved.Copyright © 2014 by Educational Testing Service. All rights reserved.
Thanks very much for your
Time and Consideration!
28