Smarter Balanced StudiesCCSSO- NCSASan Diego, CAJune, 2015
Smarter Pilot and Field Test StudiesMoved the field forwardBig data setsMany methods, spectacular researchersImmediate, practical resultsField test improved on pilot findingsWe learned a lot about advantages and limitationsThe Field Test Items in Study683 English language arts (ELA)/literacy short-text, constructed-response, itemsReading short text - CATWriting brief writes - CATResearch PT questions238 mathematics short-text, constructed-response itemsIncludes 40 mathematical reasoning items66 ELA/literacy essay items.CriteriaQuadratic weighted kappa for engine score and human score less than 0.70 Pearson correlation between engine score and human score less than 0.70 Standardized difference between engine score and human score greater than 0.12 in absolute value Degradation in quadratic weighted kappa or correlation from human-human to engine-human >= 0Standardized difference between engine score and human score for a subgroup greater than 0.10 in absolute valueNotable reduction in perfect agreement rates from human-human to engine-human equal to or greater than 0.05 Read-Behind Studies Costs limit the number of responses getting a second human read.Can using scoring engines as a second rater improve scoring?Results: Scoring scenarios where an Automated Scoring system serves as a second rater (read-behind) behind a human rater produce high quality scores. M-H and H-H results are similar.
Targeting Responses for Human Review Can scoring engines detect responses most likely to be rated differently by humans and machines so they can be routed to second raters?Result: Using scoring engines to identify candidates for a second human read yielded major reliability improvements over random assignment of responses .
Item Characteristics that Correlate with Agreement for Human and Automated Scoring - ELA Reading short text itemsitem specific rubrics yield higher reliability than generic rubrics There was higher agreement when the text was fictionalEssays generic rubrics are associated with higher reliability for the conventions traitFor the other traits, prompt specific engine training is preferredBrief Writes: Significantly higher agreement for narrative stimuli All trends above were true for both human and machine scoring.Item Characteristics that Correlate with Agreement for Human and Automated ScoringMathematics Using an Automated Scoring system as a read-behind improves score quality, provided non-exact adjudication is used. In mathematics, hand-scoring agreement was statistically significantly higher than the best engine scores. Mathematics responses could be expressed in a large number of ways. Student responses tended to be short. Using an Automated Scoring system as a read-behind improves score quality, provided non-exact adjudication is used. 8Moving forwardSummative testsUse as second ratertarget second human readsSmarter rules allow vendors to use scoring engines, but none are currently doing soInterim Provide to teachers to score specific tasksClassroom AssessmentProvide to teachers to allow assignment of more writing tasks Policy IssuesResistance to AI use: The Chinese RoomThreat to training, understanding Inflated expectations lead to disappointmentDoesnt always workRequires planning and coordinationIs not cheap
Moving forwardPlatform integrationCurrent engines use batch or stand-alone processingNeed trained engine apps that work with online delivery engines in real timeItem development Studies gave better info about what kinds of items are likely to succeed It is desirable to have scoring engine experts involved in task developmentThe Field Test Scoring Study and appendices have been posted to SmarterApp: http://www.smarterapp.org/deployment/FieldTest_AutomatedScoringResearchStudies.htmlAn updated version of the pilot study is on the Smarter website: http://www.smarterbalanced.org/pub-n-res/pilot-test-automated-scoring-research-studies/Want Details?Thank you for your attention