Susan Lottridge NCSA, June 2014 Automated Scoring in the Next Generation World © Copyright 2014 Pacific Metrics Corporation

Susan Lottridge

NCSA, June 2014

Automated Scoring in the Next Generation World

© Copyright 2014 Pacific Metrics Corporation

Next Generation World(Winter, Burkhardt, Freidhoff, Stimson, & Leslie, 2013)

• Computer-based testing• Adaptive testing• Potentially large item pools• Technology-enhanced items• Automated item generation• Personalized learning/formative uses

Key Topics in Today’s Presentation

• Cataloguing Constructed Response items• Combining scoring sources• Adaptive Testing• TEIs/Automated Item Generation

Cataloguing CR Items

Item Type & Scoring

• Knowing item type can help to determine appropriate scoring approach and whether item is ‘score-able’ (Lottridge, Winter & Mugan, 2013)

• There are almost an infinite ways to create a Constructed Response Item!

Constructed Response Items

• Covers a very broad range of item types– When different from an essay?– When different from a performance event or task?

• Need better definition around these items– Types– Structural considerations– Content– Score points/rubric

Types

• Technology-enhanced items• Text-based– One-word to phrasal typed response – Single sentence – Multiple sentence

• Constrained entry– Numbers– Equations/expressions

Structural Considerations(Ferrara et al., 2003; Scalise & Gifford, 2006)

• CR items are often multi-part– What defines a part?– Parts can differ from entry boxes– How are parts scored in the rubric?– How many points per part?

• CR items consist of multiple types– TEI + text ‘explain your answer’– Solution + Equation (+ text ’explain your answer’)– Solution + text ‘explain your answer’

Math CR Items

CR Item Structure CountSolution + explanation 28[Solution + explanation] + [solution + explanation] 10[N] Solution + explanation 9[N] Solution 5Equation + solution 4Equation + solution + explanation 4TEI with or without labeling 8TEI + solution/expression/equation 4Other 3

Science CR Items

CR Item Structure CountIdentify 1 + explain/describe 7Identify 2 + explain/describe 7Identify 4 + explain/describe 1Identify [N] 11Explain/describe [N] 5[M] [identify [N] + explain/describe [N]] 4TEI + explain/describe [N] + identify [N] 3Solution + explain/describe [N] + identify [N] 3Solution [N] 1

Reading CR Items

CR Item Structure CountIdentify 1 + 2 details 22Identify [N] + [N] details 92 details 7[N] details 5[N] List 1 + explain/describe 5Summary with specifications 5Inference/generalization + [N] details 5

Content Considerations• Math– What is required in ‘explain your answer’ or ‘show

your work’ responses?• Science (Baxter & Glaser, 1998)– Content lean to Content rich– Process constrained to Process rich

• Reading– Detail versus explanation– Prediction/generalization– Summarization

Combining Scoring Sources

Combining Scoring Sources

• There are many ways to leverage different scoring sources (Lottridge, Winter & Mugan, 2013)– 100% human + N% computer second read– Complementary human and computer scoring– 100% computer with N% human second read– Blended human and computer scoring

• But can we use different computer scoring models to produce better results?– Adjudication Model– Ensemble Model

Adjudication Model

• Two engines independently trained on 5 math CR items– Each engine underperforms relative to humans– One engine is ‘rule-based;’ the other is heavily

NLP/Machine Learning based• Restrict computer scoring to those responses

in which engines agree– The remaining results go to humans for scoring

• Results show promise

Exact Agreement Rates (Complete Validation Sample)

Item NHuman 1 – Human 2

Engine 1 – Human 2


1 341 91% 76% 80%2 261 88% 79% 85%3 340 85% 77% 77%4 298 96% 88% 89%5 183 94% 74% 70%

Adjudication Proportions

Item NEngine Assigns Score

Humans Assign Score

1 676 71% 29%2 539 80% 20%3 706 74% 26%4 598 90% 10%5 372 62% 38%

Engine Assigns Score Condition (Exact Agreement Performance)


Engine 1/2– Human 2

1 246 72% 74%2 209 90% 90%3 253 86% 86%4 274 96% 92%5 106 97% 91%

Humans Assign Score Condition(Exact Agreement Performance)




1 95 86% 39% 52%2 52 79% 31% 59%3 87 85% 48% 51%4 24 92% 46% 54%5 77 90% 52% 42%

Adjudication Summary

• When we restrict scoring to when the engines agree, they perform similarly to humans. When they do not agree, the engines perform poorly relative to humans.

• This suggests the adjudication criteria are adequate for retaining the responses that should be scored by automated scoring.

Ensemble Model

• Combining scores from two different engines to produce a score– Weighted average– Optimization via regression– Other methods (decision trees, etc.)

Results

Exact Agreement Rate with Human RatersItem Engine 1 Engine 2 Ensemble Improvement

1 59% 61% 66% 5%

2 75% 67% 76% 1%

3 57% 58% 63% 5%

• 13 Reading CR items

• Ensembling by averaging scores from two engines

• 10 items exhibited no improvement

• 3 items exhibited some improvement

Adaptive Testing and CRs

• Item pools – Potentially large number of CRs (thousands)– Low number of examinees per CR (if any)

• Impacts on hand scoring and engine scoring– Training readers and engines– Requires large AS staff to train or a shift from

‘expert-based’ to ‘automated’ training models

TEIs and Automated Item Generation

• Many TEIs/AIG templates are scored 0-1 or are multiple choice (Winter et al., 2013)– But often require multiple steps by examinee

• Can we involve item authors in the configuring scoring rules to enable partial-credit scoring? – Expands usefulness of item to examinees– Removes expert scoring labor from training

process

ReferencesBaxter, G., & Glaser, R. (1998). Investigating the cognitive complexity of science assessments. Educational Measurement: Issues & Practice, 17(3), 37-45. Ferrara, S., Duncan, T., Perie, M., Freed, R., McGovern, J., & Chilukuri, R. (2003, April). Item construct validity: Early results from a study of the relationship between intended and actual cognitive demands in a middle school science assessment. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL.Lottridge, S., Winter, P., & Mugan, L. (2013). The AS Decision Matrix: Using Program Stakes and Item Type to Make Informed Decisions about Automated Scoring Implementations. Pacific Metrics Corporation. Retrieved from http://http://www.pacificmetrics.com/white-papers/ASDecisionMatrix_WhitePaper_Final.pdf. Scalise, K., & Gifford, B. R. (2006). Computer-Based Assessment in E-Learning: A Framework for Constructing "Intermediate Constraint" Questions and Tasks for Technology Platforms. Journal of Teaching, Learning and Assessment, 4(6).Winter, P. C., Burkhardt, A. K., Freidhoff, J. R., Stimson, R. J., & Leslie, S. C. (2013). Astonishing impact: An introduction to five computer-based assessment issues. Michigan Virtual Learning Research Institute. Retrieved from http://media.mivu.org/institute/pdf/astonishing_impact.pdf

Questions?

Pacific Metrics Corporation

1 Lower Ragsdale Drive, Building 1, Suite 150, Monterey, CA 93940

www.pacificmetrics.com

Thank You

Documents

Susan Lottridge NCSA, June 2014 Automated Scoring in the Next Generation World © Copyright 2014 Pacific Metrics Corporation