Humans and Machines: Modeling the Stochastic Behavior of Raters in Educational Assessmentbearcenter.berkeley.edu/sites/default/files/Patz... · 2018-02-17 · assessment program (Patz,

Humans and Machines: Modeling the Stochastic Behavior of Raters

in Educational Assessment

Richard J. PatzBEAR Seminar

UC Berkeley Graduate School of EducationFebruary 13, 2018

Outline of Topics• Natural human responses in educational assessment

• Technology in education, assessment and scoring

• Computational methods for automated scoring (NLP, LSA, ML)

• Rating information in statistical and psychometric analysis: Challenges

• Unreliability and bias

• Combining information from multiple rating

• Hierarchical rater model (HRM)

• Applications

• Comparing machine to humans

• Simulating human rating errors to further related research

Why natural, constructed response formats in assessment?

• Learning involves constructing knowledge and expressing through language (written and/or oral)

• Assessments should consist of ‘authentic’ tasks, i.e., of a type that students encounter during instruction

• Artificially contrived item formats (e.g., multiple-choice) advantage skills unrelated to the intended construct

• Some constructs (e.g., essay writing) simply can’t be measured through selected-response formats

Disadvantages of Constructed-Response formats

• Time consuming for examinees (fewer items per unit time)

• Require expensive human ratings (typically)

• Create delay in providing scores, reports

• Human rating is error-prone

• Consistency across rating events difficult to maintain

• Inconsistency impairs comparability

• Combining multiple ratings creates modeling, scoring problems

Practical Balancing of Priorities

• Mix constructed-response formats with selected response formats, to realize benefits of each

• Leverage technology in the scoring of CR items

• Rule-based scoring (exhaustively enumerated/constrained)

• Natural language processing and subsequent automated rating for written (sometimes spoken) responses

• Made more practical with computer-based test delivery

Technology For Automated Scoring• Ten years ago there were relatively few providers

• Expensive, proprietary algorithms

• Specialized expertise (NLP, LSA, AI)

• Laborious, ‘hand crafted’, engine training

• Today solutions are much more ubiquitous

• Students fitting AS models in CS, STAT classes

• Open source libraries abound

• Machine learning, neural networks, accessible, powerful, up to job

• Validity and reliability challenges remain

• Impact of algorithms on instruction, e.g., in writing? Also threat of gaming strategies

• Managing algorithm improvements, examinee adaptations, over time

• Quality human scores needed to train the machines (supervised learning)

• Biases or other problems in human ratings ‘learned’ by algorithms

• Combining scores from machines and humans

Machine Learning for Automated Essay Scoring

Example characteristics:

• Words processed in relation to corpus for frequency, etc., etc.

• N-grams (word pairs, triplets, etc.)

• Transformations (non-linear, sinusoidal), and dimensionality reduction

• Iterations improving along gradient, memory of previous states

• Data split into training and validation; maximizing predication accuracy on validation; little else “interpretable” about parameters

Taghipour & Ng (2016)

Example architecture:

Focus of Research• Situating rating process within overall measurement and statistical

context

• The hierarchical rater model (HRM)

• Accounting for multiple ratings ‘correctly’

• Contrast with alternative approaches, e.g., Facets

• Simultaneous analysis of human and machine ratings

• Example from large-scale writing assessment

• Leveraging models of human rating behavior for better simulation, examination of impacts on inferences

Hierarchical Structure of Rated Item Response Data

• If all levels follow normal distribution, then Generalizability Theory applies• Estimates at any level weigh data mean and prior mean, using ‘generalizability coefficient’

• If ‘ideal ratings’ follow IRT model and observed ratings a signal detection model, HRM

Patz, Junker & Johnson, 2002

θ i ~ i.i.d. N (µ,σ 2 ), i = 1,…,N

ξij ~ an IRT model (e.g. PCM), j = 1,…, J ,for each i

Xijr ~ a signal detection model r = 1,…,R, for each i, j

⎫

⎬⎪⎪

⎭⎪⎪

HRM levels

Hierarchical Rater Model• Raters detect true item score (i.e., ‘ideal rating’) with a degree of bias

and imprecision

φr = −.2

ψ r = .5

ξ = 3

p32r = .08

p33r = .64

p34r = .27

Hierarchical Rater Model (cont.)• Examinees respond to items according to a polytomous item response

theory model (here PCM; could be GPCM, GRM, others):

P ξij = ξ |θ i ,β j ,γ jξ⎡⎣ ⎤⎦ =

exp θ i − β jk=1

ξ

∑ − γ jk

⎧⎨⎩⎪

⎫⎬⎭⎪

exph=0

K−1

∑ θ i − β jk=1

h

∑ − γ jk

⎧⎨⎩

⎫⎬⎭

θ i ~ i.i.d. N (µ,σ 2 ), i = 1,…,N ,

HRM Estimation

• Most straightforward to estimate using Markov chain Monte Carlo• Uninformative priors specified in Patz et al 2002,

Casablanca et al, 2016• WinBugs/JAGS (may be called from within R)

• HRM has been estimated using maximum likelihood and posterior modal estimation (Donoghue and Hombo, 2001; DeCarlo et al, 2011)

Facets Alternative

• Facets (Linacre) models can capture rater effects:

where 𝜆rjk is the effect rater r has on category k of item j. Note: rater effects 𝜆 may be constant for all levels of an item, all items at a a given level, or for all levels of all items.

Every rater-item combination has unique ICC

Facets models have proven highly useful in the detection and mitigation of rater effects in operational scoring (e.g., Wang & Wilson, 2005; Myford & Wolfe, 2004)

P ξij = ξ |θ i ,β j ,γ jξ ,λrjk⎡⎣ ⎤⎦ = exp θ i − β j

k=1

ξ

∑ − γ jk − λrjk⎧⎨⎩⎪

⎫⎬⎭⎪

exph=0

K−1

∑ θ i − β jk=1

h

∑ − γ jk − λrjk⎧⎨⎩

⎫⎬⎭

Dependence structure of Facets models

• Ratings are directly related to proficiency

• Arbitrarily precise 𝜃 estimation achievable by increasing ratings R

• Alternatives (other than HRM) include:

• Rater Bundle Model (Wilson & Hoskens, 2001)

• Design-effect-like correction (Bock, Brennan, Muraki, 1999)

Applications & Extensions of HRM

• Detecting rater effects and “modality” effects in Florida assessment program (Patz, Junker, Johnson, 2002)

• 360-degree feedback data (Barr & Raju, 2003)

• Rater covariates, applied to Golden State Exam (image vs. paper study) (Mariano & Junker, 2007)

• Latent classes for raters, applied to large-scale language assessment (DeCarlo et al, 2011)

• Machine (i.e., automated) and human scoring (Casabianca et al, 2016)

HRM with rater covariates

• Introduce design matrix 𝛶 associating individual raters to their covariates

• Bias and variability of ratings vary according rater characteristics

Bias:

Variability:

Application with Human and Machine Ratings

• Statewide writing assessment program (provided by CTB)

• 5 dimensions of writing (“items”); each on 1-6 rubric

• 487 examinees

• 36 raters: 18 male, 17 female, 1 machine

• Each paper scored by four raters (1 machine, 3 humans)

• 9740 ratings in total

Results by “gender”

• Males and female very similar (and negligible on average) bias

• Machine less variable (esp. than males) more severe (not sig.)

• Individual rater bias and severity is informative (next slide)

Most lenient, r=11

Most harsh and least variable, r=20 (problematic pattern confirmed)

Most variable (r=29)

Individual rater estimate may be diagnostic

Continued Research• HRM presents systematic way to simulate rater behavior

• What range of variability and bias are typical? Good? Problematic?

• Realistic simulations yielding predictable agreement rates, quadratic weighted kappa statistics, etc.?

• What are the downstream impacts of rater problems on: Measurement accuracy? Equating? Engine training?

• To what degree and how might modeling of raters (often unidentified or ignored) improve machine learning results in the training of automated scoring engines

• Under what conditions should different (esp. more granular) signal detection models be used within the HRM framework?

Quadratic Weighted Kappa

• Penalizes non-adjacent disagreement more than unweighted kappa or linearly (i-j) weighted kappa

• Widely used as a prediction accuracy metric in machine learning

• Kappa statistics are an important supplement to rates of agreement (exact/adjacent) in operational rating

κ = 1−wi, jOi, ji, j∑wi, jEi, ji, j∑

wi, j =(i − j)2

(N −1)2

where

Oi, j =Ei, j =

Observed count in cell i,j

Expected count in cell i,j

HRM Rater Noise• How do HRM signal detection accuracy impact reliability and agreement

statistics for rated items?

• Use HRM to simulate realistic patterns of rater behavior

• Example

• For 10,000 examinees with normally distributed proficiencies

• True item scores (ideal ratings) from PCM/RSM: 10 items, 5-levels per item

• Vary rater variability parameter 𝛹, with rater bias 𝟇=0

Ideal ratings follow PCM

ψ r = 0

Ideal Ratings:

Results

Continued Research• HRM presents systematic way to simulate rater behavior

• What range of variability and bias are typical? Good? Problematic?

• Realistic simulations yielding predictable agreement rates, quadratic weighted kappa statistics, etc.?

• What are the downstream impacts of rater problems on: Measurement accuracy? Equating? Engine training?

• To what degree and how might modeling of raters (often unidentified or ignored) improve machine learning results in the training of automated scoring engines

• Under what conditions should different (esp. more granular) signal detection models be used within the HRM framework?

Summary• Constructed response item formats remain important

• Technology making these formats more feasible

• Modeling rater behavior is important

• HRM provides useful framework for characterizing rater error patterns, from humans and/or machines

• HRM signal detection model layer useful in simulation

• Modeling raters may improve machine scoring solutions (TBD)

Documents

Humans and Machines: Modeling the Stochastic Behavior of Raters in Educational Assessmentbearcenter.berkeley.edu/sites/default/files/Patz... · 2018-02-17 · assessment program (Patz,