Automated Scoring: Smarter Balanced Studies CCSSO- NCSA San Diego, CA June, 2015

Automated Scoring: Smarter Balanced Studies

CCSSO- NCSASan Diego, CA

June, 2015

Smarter Pilot and Field Test Studies

• Moved the field forward–Big data sets–Many methods, spectacular

researchers

• Immediate, practical results–Field test improved on pilot findings–We learned a lot about advantages

and limitations

The Field Test Items in Study

• 683 English language arts (ELA)/literacy short-text, constructed-response, items– Reading short text - CAT– Writing brief writes - CAT– Research PT questions

• 238 mathematics short-text, constructed-response items– Includes 40 mathematical reasoning items

• 66 ELA/literacy essay items.

Criteria

• Quadratic weighted kappa for engine score and human score less than 0.70

• Pearson correlation between engine score and human score less than 0.70

• Standardized difference between engine score and human score greater than 0.12 in absolute value

• Degradation in quadratic weighted kappa or correlation from human-human to engine-human >= 0

• Standardized difference between engine score and human score for a subgroup greater than 0.10 in absolute value

• Notable reduction in perfect agreement rates from human-human to engine-human equal to or greater than 0.05

Read-Behind Studies

• Costs limit the number of responses getting a second human read.

• Can using scoring engines as a second rater improve scoring?

• Results: Scoring scenarios where an Automated Scoring system serves as a second rater (“read-behind”) behind a human rater produce high quality scores. M-H and H-H results are similar.

Targeting Responses for Human Review

• Can scoring engines detect responses most likely to be rated differently by humans and machines so they can be routed to second raters?

• Result: Using scoring engines to identify candidates for a second human read yielded major reliability improvements over random assignment of responses .

Item Characteristics that Correlate with Agreement for Human and Automated Scoring - ELA

• Reading short text items– item specific rubrics yield higher reliability than generic

rubrics – There was higher agreement when the text was fictional

• Essays– generic rubrics are associated with higher reliability for the

conventions trait– For the other traits, prompt specific engine training is

preferred

• Brief Writes: Significantly higher agreement for narrative stimuli

• All trends above were true for both human and machine scoring.

Item Characteristics that Correlate with Agreement for Human and Automated Scoring

Mathematics

• Using an Automated Scoring system as a read-behind improves score quality, provided non-exact adjudication is used.

• In mathematics, hand-scoring agreement was statistically significantly higher than the best engine scores. – Mathematics responses could be expressed in a

large number of ways. – Student responses tended to be short.

Moving forward

• Summative tests– Use as second rater– target second human reads– Smarter rules allow vendors to use scoring

engines, but none are currently doing so

• Interim – Provide to teachers to score specific tasks

• Classroom Assessment– Provide to teachers to allow assignment of

more writing tasks

Policy Issues

• Resistance to AI use: –The Chinese Room–Threat to training, understanding

• Inflated expectations lead to disappointment–Doesn’t always work–Requires planning and coordination–Is not cheap

Moving forward

• Platform integration– Current engines use batch or stand-alone

processing– Need trained engine apps that work with

online delivery engines in real time

• Item development – Studies gave better info about what kinds

of items are likely to succeed – It is desirable to have scoring engine

experts involved in task development

The Field Test Scoring Study and appendices have been posted to SmarterApp: http://www.smarterapp.org/deployment/FieldTest_AutomatedScoringResearchStudies.html An updated version of the pilot study is on the Smarter website: http://www.smarterbalanced.org/pub-n-res/pilot-test-automated-scoring-research-studies/

Want Details?

http://www.smarterapp.org/deployment/FieldTest_AutomatedScoringResearchStudies.html

http://www.smarterapp.org/deployment/FieldTest_AutomatedScoringResearchStudies.html

http://www.smarterbalanced.org/pub-n-res/pilot-test-automated-scoring-research-studies/

http://www.smarterbalanced.org/pub-n-res/pilot-test-automated-scoring-research-studies/

Thank you for your attention

Documents

Automated Scoring: Smarter Balanced Studies CCSSO- NCSA San Diego, CA June, 2015