Smarter Pilot and Field Test Studies
• Moved the field forward–Big data sets–Many methods, spectacular
researchers
• Immediate, practical results–Field test improved on pilot findings–We learned a lot about advantages
and limitations
The Field Test Items in Study
• 683 English language arts (ELA)/literacy short-text, constructed-response, items– Reading short text - CAT– Writing brief writes - CAT– Research PT questions
• 238 mathematics short-text, constructed-response items– Includes 40 mathematical reasoning items
• 66 ELA/literacy essay items.
Criteria
• Quadratic weighted kappa for engine score and human score less than 0.70
• Pearson correlation between engine score and human score less than 0.70
• Standardized difference between engine score and human score greater than 0.12 in absolute value
• Degradation in quadratic weighted kappa or correlation from human-human to engine-human >= 0
• Standardized difference between engine score and human score for a subgroup greater than 0.10 in absolute value
• Notable reduction in perfect agreement rates from human-human to engine-human equal to or greater than 0.05
Read-Behind Studies
• Costs limit the number of responses getting a second human read.
• Can using scoring engines as a second rater improve scoring?
• Results: Scoring scenarios where an Automated Scoring system serves as a second rater (“read-behind”) behind a human rater produce high quality scores. M-H and H-H results are similar.
Targeting Responses for Human Review
• Can scoring engines detect responses most likely to be rated differently by humans and machines so they can be routed to second raters?
• Result: Using scoring engines to identify candidates for a second human read yielded major reliability improvements over random assignment of responses .
Item Characteristics that Correlate with Agreement for Human and Automated Scoring - ELA
• Reading short text items– item specific rubrics yield higher reliability than generic
rubrics – There was higher agreement when the text was fictional
• Essays– generic rubrics are associated with higher reliability for the
conventions trait– For the other traits, prompt specific engine training is
preferred
• Brief Writes: Significantly higher agreement for narrative stimuli
• All trends above were true for both human and machine scoring.
Item Characteristics that Correlate with Agreement for Human and Automated Scoring
Mathematics
• Using an Automated Scoring system as a read-behind improves score quality, provided non-exact adjudication is used.
• In mathematics, hand-scoring agreement was statistically significantly higher than the best engine scores. – Mathematics responses could be expressed in a
large number of ways. – Student responses tended to be short.
Moving forward
• Summative tests– Use as second rater– target second human reads– Smarter rules allow vendors to use scoring
engines, but none are currently doing so
• Interim – Provide to teachers to score specific tasks
• Classroom Assessment– Provide to teachers to allow assignment of
more writing tasks
Policy Issues
• Resistance to AI use: –The Chinese Room–Threat to training, understanding
• Inflated expectations lead to disappointment–Doesn’t always work–Requires planning and coordination–Is not cheap
Moving forward
• Platform integration– Current engines use batch or stand-alone
processing– Need trained engine apps that work with
online delivery engines in real time
• Item development – Studies gave better info about what kinds
of items are likely to succeed – It is desirable to have scoring engine
experts involved in task development
The Field Test Scoring Study and appendices have been posted to SmarterApp: http://www.smarterapp.org/deployment/FieldTest_AutomatedScoringResearchStudies.html An updated version of the pilot study is on the Smarter website: http://www.smarterbalanced.org/pub-n-res/pilot-test-automated-scoring-research-studies/
Want Details?