Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Humans and Machines: Modeling the Stochastic Behavior of Raters
in Educational Assessment
Richard J. PatzBEAR Seminar
UC Berkeley Graduate School of EducationFebruary 13, 2018
Outline of Topics• Natural human responses in educational assessment
• Technology in education, assessment and scoring
• Computational methods for automated scoring (NLP, LSA, ML)
• Rating information in statistical and psychometric analysis: Challenges
• Unreliability and bias
• Combining information from multiple rating
• Hierarchical rater model (HRM)
• Applications
• Comparing machine to humans
• Simulating human rating errors to further related research
Why natural, constructed response formats in assessment?
• Learning involves constructing knowledge and expressing through language (written and/or oral)
• Assessments should consist of ‘authentic’ tasks, i.e., of a type that students encounter during instruction
• Artificially contrived item formats (e.g., multiple-choice) advantage skills unrelated to the intended construct
• Some constructs (e.g., essay writing) simply can’t be measured through selected-response formats
Disadvantages of Constructed-Response formats
• Time consuming for examinees (fewer items per unit time)
• Require expensive human ratings (typically)
• Create delay in providing scores, reports
• Human rating is error-prone
• Consistency across rating events difficult to maintain
• Inconsistency impairs comparability
• Combining multiple ratings creates modeling, scoring problems
Practical Balancing of Priorities
• Mix constructed-response formats with selected response formats, to realize benefits of each
• Leverage technology in the scoring of CR items
• Rule-based scoring (exhaustively enumerated/constrained)
• Natural language processing and subsequent automated rating for written (sometimes spoken) responses
• Made more practical with computer-based test delivery
Technology For Automated Scoring• Ten years ago there were relatively few providers
• Expensive, proprietary algorithms
• Specialized expertise (NLP, LSA, AI)
• Laborious, ‘hand crafted’, engine training
• Today solutions are much more ubiquitous
• Students fitting AS models in CS, STAT classes
• Open source libraries abound
• Machine learning, neural networks, accessible, powerful, up to job
• Validity and reliability challenges remain
• Impact of algorithms on instruction, e.g., in writing? Also threat of gaming strategies
• Managing algorithm improvements, examinee adaptations, over time
• Quality human scores needed to train the machines (supervised learning)
• Biases or other problems in human ratings ‘learned’ by algorithms
• Combining scores from machines and humans
Machine Learning for Automated Essay Scoring
Example characteristics:
• Words processed in relation to corpus for frequency, etc., etc.
• N-grams (word pairs, triplets, etc.)
• Transformations (non-linear, sinusoidal), and dimensionality reduction
• Iterations improving along gradient, memory of previous states
• Data split into training and validation; maximizing predication accuracy on validation; little else “interpretable” about parameters
Taghipour & Ng (2016)
Example architecture:
Focus of Research• Situating rating process within overall measurement and statistical
context
• The hierarchical rater model (HRM)
• Accounting for multiple ratings ‘correctly’
• Contrast with alternative approaches, e.g., Facets
• Simultaneous analysis of human and machine ratings
• Example from large-scale writing assessment
• Leveraging models of human rating behavior for better simulation, examination of impacts on inferences
Hierarchical Structure of Rated Item Response Data
• If all levels follow normal distribution, then Generalizability Theory applies• Estimates at any level weigh data mean and prior mean, using ‘generalizability coefficient’
• If ‘ideal ratings’ follow IRT model and observed ratings a signal detection model, HRM
Patz, Junker & Johnson, 2002
θ i ~ i.i.d. N (µ,σ 2 ), i = 1,…,N
ξij ~ an IRT model (e.g. PCM), j = 1,…, J ,for each i
Xijr ~ a signal detection model r = 1,…,R, for each i, j
⎫
⎬⎪⎪
⎭⎪⎪
HRM levels
Hierarchical Rater Model• Raters detect true item score (i.e., ‘ideal rating’) with a degree of bias
and imprecision
φr = −.2
ψ r = .5
ξ = 3
p32r = .08
p33r = .64
p34r = .27
Hierarchical Rater Model (cont.)• Examinees respond to items according to a polytomous item response
theory model (here PCM; could be GPCM, GRM, others):
P ξij = ξ |θ i ,β j ,γ jξ⎡⎣ ⎤⎦ =
exp θ i − β jk=1
ξ
∑ − γ jk
⎧⎨⎩⎪
⎫⎬⎭⎪
exph=0
K−1
∑ θ i − β jk=1
h
∑ − γ jk
⎧⎨⎩
⎫⎬⎭
θ i ~ i.i.d. N (µ,σ 2 ), i = 1,…,N ,
HRM Estimation
• Most straightforward to estimate using Markov chain Monte Carlo• Uninformative priors specified in Patz et al 2002,
Casablanca et al, 2016• WinBugs/JAGS (may be called from within R)
• HRM has been estimated using maximum likelihood and posterior modal estimation (Donoghue and Hombo, 2001; DeCarlo et al, 2011)
Facets Alternative
• Facets (Linacre) models can capture rater effects:
where 𝜆rjk is the effect rater r has on category k of item j. Note: rater effects 𝜆 may be constant for all levels of an item, all items at a a given level, or for all levels of all items.
Every rater-item combination has unique ICC
Facets models have proven highly useful in the detection and mitigation of rater effects in operational scoring (e.g., Wang & Wilson, 2005; Myford & Wolfe, 2004)
P ξij = ξ |θ i ,β j ,γ jξ ,λrjk⎡⎣ ⎤⎦ = exp θ i − β j
k=1
ξ
∑ − γ jk − λrjk⎧⎨⎩⎪
⎫⎬⎭⎪
exph=0
K−1
∑ θ i − β jk=1
h
∑ − γ jk − λrjk⎧⎨⎩
⎫⎬⎭
Dependence structure of Facets models
• Ratings are directly related to proficiency
• Arbitrarily precise 𝜃 estimation achievable by increasing ratings R
• Alternatives (other than HRM) include:
• Rater Bundle Model (Wilson & Hoskens, 2001)
• Design-effect-like correction (Bock, Brennan, Muraki, 1999)
Applications & Extensions of HRM
• Detecting rater effects and “modality” effects in Florida assessment program (Patz, Junker, Johnson, 2002)
• 360-degree feedback data (Barr & Raju, 2003)
• Rater covariates, applied to Golden State Exam (image vs. paper study) (Mariano & Junker, 2007)
• Latent classes for raters, applied to large-scale language assessment (DeCarlo et al, 2011)
• Machine (i.e., automated) and human scoring (Casabianca et al, 2016)
HRM with rater covariates
• Introduce design matrix 𝛶 associating individual raters to their covariates
• Bias and variability of ratings vary according rater characteristics
Bias:
Variability:
Application with Human and Machine Ratings
• Statewide writing assessment program (provided by CTB)
• 5 dimensions of writing (“items”); each on 1-6 rubric
• 487 examinees
• 36 raters: 18 male, 17 female, 1 machine
• Each paper scored by four raters (1 machine, 3 humans)
• 9740 ratings in total
Results by “gender”
• Males and female very similar (and negligible on average) bias
• Machine less variable (esp. than males) more severe (not sig.)
• Individual rater bias and severity is informative (next slide)
Most lenient, r=11
Most harsh and least variable, r=20 (problematic pattern confirmed)
Most variable (r=29)
Individual rater estimate may be diagnostic
Continued Research• HRM presents systematic way to simulate rater behavior
• What range of variability and bias are typical? Good? Problematic?
• Realistic simulations yielding predictable agreement rates, quadratic weighted kappa statistics, etc.?
• What are the downstream impacts of rater problems on: Measurement accuracy? Equating? Engine training?
• To what degree and how might modeling of raters (often unidentified or ignored) improve machine learning results in the training of automated scoring engines
• Under what conditions should different (esp. more granular) signal detection models be used within the HRM framework?
Quadratic Weighted Kappa
• Penalizes non-adjacent disagreement more than unweighted kappa or linearly (i-j) weighted kappa
• Widely used as a prediction accuracy metric in machine learning
• Kappa statistics are an important supplement to rates of agreement (exact/adjacent) in operational rating
κ = 1−wi, jOi, ji, j∑wi, jEi, ji, j∑
wi, j =(i − j)2
(N −1)2
where
Oi, j =Ei, j =
Observed count in cell i,j
Expected count in cell i,j
HRM Rater Noise• How do HRM signal detection accuracy impact reliability and agreement
statistics for rated items?
• Use HRM to simulate realistic patterns of rater behavior
• Example
• For 10,000 examinees with normally distributed proficiencies
• True item scores (ideal ratings) from PCM/RSM: 10 items, 5-levels per item
• Vary rater variability parameter 𝛹, with rater bias 𝟇=0
Ideal ratings follow PCM
ψ r = 0
Ideal Ratings:
Results
Continued Research• HRM presents systematic way to simulate rater behavior
• What range of variability and bias are typical? Good? Problematic?
• Realistic simulations yielding predictable agreement rates, quadratic weighted kappa statistics, etc.?
• What are the downstream impacts of rater problems on: Measurement accuracy? Equating? Engine training?
• To what degree and how might modeling of raters (often unidentified or ignored) improve machine learning results in the training of automated scoring engines
• Under what conditions should different (esp. more granular) signal detection models be used within the HRM framework?
Summary• Constructed response item formats remain important
• Technology making these formats more feasible
• Modeling rater behavior is important
• HRM provides useful framework for characterizing rater error patterns, from humans and/or machines
• HRM signal detection model layer useful in simulation
• Modeling raters may improve machine scoring solutions (TBD)