Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
RESEARCH REPORT June 2003 RR-03-15
Research & Development Division Princeton, NJ 08541
Improving the Statistical Aspects of E-rater®: Exploring Alternative Feature Reduction and Combination Rules
Xin Feng Neil J. Dorans Liane N. Patsula Bruce Kaplan
Improving the Statistical Aspects of E-rater®:
Exploring Alternative Feature Reduction and Combination Rules
Xin Feng
Department of Statistics, Columbia University
Neil J. Dorans, Liane N. Patsula, and Bruce Kaplan
Educational Testing Service, Princeton, New Jersey
June 2003
Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from:
Research Publications Office Mail Stop 10-R Educational Testing Service Princeton, NJ 08541
Abstract
This study explores alternative ways of reducing the number of variables/features and additional
ways of combining information across features to produce more stable and accurate e-rater®
scores. Following an explanation of the statistical aspects of the process is a description of
alternatives to the process. Our explorations resulted in certain conclusions and directions for
future research. We have examined enough e-rater data to conclude that stepwise regression
seems to be effective as a feature reduction procedure. However, this may be attributed to the
consistently strong relationship with essay score that is observed for the content vector analysis
(CVA) variables and the two variables used to approximate word length (number of auxiliary
verbs and the ratio of the number of auxiliary verbs to the number of words). To yield better
validation results, we also suggest that the hold-out method for evaluating validity should replace
the current two-stage approach of first developing a model in a quasi-uniform training sample
and then validating these results in a target cross-validation sample. More research is needed in
several areas. First, explicit modeling of the part of essay scores that is unrelated to word length
is warranted. The POM (Proportional Odds Model) approach should be investigated in greater
depth. Also needed is a statistical justification for using essay scores to score CVA variables.
Algorithmic approaches to prediction/classification problem, such as boosting, may prove
fruitful. Further investigation of quantile regression and ridge regression should be conducted.
Key words: e-rater®, automated essay scoring, prediction, classification
i
Acknowledgements
This research occurred while the first author was an ETS Harold Gulliksen Psychometric Fellow.
This research was guided by the advice of a research team that included Paul Holland, Nan
Kong, and Sandip Sinharay. The authors are grateful to these colleagues for their advice and
critiques. Dan Eignor, Shelby Haberman, and Hariharan Swaminathan also provided helpful
comments. Any opinions expressed in this article are those of the authors and not necessarily of
Educational Testing Service.
ii
Table of Contents
Page
1. Introduction................................................................................................................................. 1
2. Statistical Aspects of the Process................................................................................................ 2
2.1 Data....................................................................................................................................... 2
2.1.1 Variables ..................................................................................................................... 2
2.1.2 Sample......................................................................................................................... 3
2.2 Methods ................................................................................................................................ 3
2.2.1 Content Vector Analysis ............................................................................................. 4
2.2.2 Forward Stepwise Version of OLS Regression .......................................................... 5
3. Alternatives to the Process.......................................................................................................... 6
3.1 Alternative Data Source for Model Development ................................................................ 6
3.2 Alternatives to Stepwise Regression for Variable Selection/Reduction............................... 8
3.2.1 No Selection/No Reduction ........................................................................................ 8
3.2.2 Equal Weights Regression .......................................................................................... 9
3.2.3 Principal Component Analysis.................................................................................. 10
3.2.4 A Novel Application of Content Vector Analysis .................................................... 11
3.2 Alternatives to Stepwise Version of OLS Regression for Model Development ................ 12
3.3.1 Quantile Regression .................................................................................................. 12
3.3.2 Ridge Regression ...................................................................................................... 13
3.3.3 Classification and Proportional Odds Model ............................................................ 14
4. Results....................................................................................................................................... 16
4.1 How to Assess Adequacy of Models .................................................................................. 16
4.1.1 Kappa and Weighted Kappa...................................................................................... 17
4.1.2 Average Loss............................................................................................................. 19
4.2 Results Comparison ............................................................................................................ 20
4.2.1 OLS, PCA, and EW .................................................................................................. 20
4.2.2 Ridge Regression and a Novel Application of CVA ................................................ 22
4.2.3 POM and Quantile Regression.................................................................................. 23
4.3 Other Results....................................................................................................................... 26
4.3.1 Effects of the Number of Words ............................................................................... 26
iii
5. Conclusions, Recommendations, and Future Research ............................................................ 28
5.1 Conclusions......................................................................................................................... 28
5.2 Future Research .................................................................................................................. 30
References..................................................................................................................................... 32
List of Appendixes........................................................................................................................ 34
iv
List of Tables
Page
Table 1. Essay Score Distribution for Training and Cross-validation Data ................................. 3
Table 2. Confusion Matrix.......................................................................................................... 17
Table 3. Loss Matrix ................................................................................................................... 19
Table 4. Exact Agreement Rate and Kappa for a GMAT Essay Prompt.................................... 24
Table 5. Hold-out Exact Agreement Rate and Kappa for a GMAT Essay Prompt .................... 24
Table 6. Average Loss for a GMAT Essay Prompt .................................................................... 25
Table 7. Asymmetric Loss Matrix .............................................................................................. 25
Table 8. Adjacent Loss Matrix.................................................................................................... 25
Table 9. Generalized Kappa for the Same GMAT Essay Prompt in Table 8 ............................. 26
Table 10. Exact Agreement of Two-step Regression on a Combined Sample............................ 27
Table 11. Kappa of Two-step Regression on a Combined Sample ............................................. 28
v
1. Introduction
Educational Testing Service (ETS) developed the e-rater® (electronic essay rater) system
to address the practical concerns that limit the number of essay questions on current tests, such as
the time and costs associated with human scoring. The system, which became operational in
1998, scores essays automatically.
E-rater, by design, provides a distinct scoring model for each new essay topic. Each
topic-specific model is built on a sample of scored essays, distributed across the set of all non-
zero rating points ranging from 1 to 6. As originally formulated, e-rater model development
involved using a hybrid feature methodology that incorporated several variables that are either
derived statistically, extracted through Natural Language Processing (NLP) techniques, or
measured by simple counting procedures. The e-rater approach is constantly changing. At one
time, there were 56 variables, which were referred to as features. These 56 features would be
entered into a stepwise linear regression analysis that is performed in a training dataset to
identify a subset of features that parsimoniously explains observed variation in the human rater
consensus scores. There was a regression model/equation developed for each essay topic.
Finally, final score predictions for a cross-validation dataset were obtained by using the
estimated equation from the training sample in a validation sample. When the resulting score
predictions exhibited sufficient agreement with corresponding human ratings, the scoring/e-rater
model was considered to have been validated. For more details, see Burstein et al. (1998).
As already alluded to, there are both linguistic and statistical components to the e-rater
process. We focused on the statistical component. As noted, there is a separate e-rater model
developed for each essay topic/prompt. Stepwise linear regression plays a prominent role in this
process. It reduces the number of features associated with a prompt by, in effect, giving zero
weight to most of the predictors. As a feature reduction procedure, stepwise regression may
capitalize on chance relationships in the data. In one sample, a large regression coefficient
estimate may be assigned to a chosen variable, while the estimated coefficients of variables that
are highly correlated to that chosen variable are near zero in value. Yet in another sample, the
previously chosen variable may be excluded in favor of one of the other variables that had been
excluded in the first sample. In addition, as a combination rule for available data, stepwise
regression, by discarding much of the data, ignores a large amount of potentially useful
information.
1
We explore alternative ways of reducing the number of variables/features and alternative
ways of combining information across features to produce more stable and accurate e-rater
scores. The remainder of this paper is divided into four sections. Section 2 offers a description of
the statistical aspects of the e-rater process. Section 3 follows with by a description of
alternatives to the process circa 2001. Section 4 is a presentation of results for the various
approaches. And finally, conclusions and recommendations for future research are made.
2. Statistical Aspects of the Process
In this section, the data and methods used in the process circa 2001 are described. Within
each description, shortcomings of the process are highlighted.
2.1 Data
The datasets are collected for each essay prompt. Training data are used to build the
model and then the model is cross-validated on another independent dataset to evaluate model fit
and stability. Each dataset contains the same variables collected on different samples. First, the
variables are described, followed by a description of the sample in each of the two datasets.
2.1.1 Variables
The dependent/response variable is the essay score. In the training dataset, the essay
score represents the combined judgments of two human raters. Two trained human raters score
each candidate’s essay. If the two resulting scores are no more than a single score point apart,
both scores are accepted and their rounded average is taken as the consensus score. In cases
where the two initial scores are not within a single point of each other, a third human rating is
obtained. When the third rating is equidistant to each of the two initial ratings, the average of all
the three scores is taken as the final score. When the third rating is not equidistant to each other
of the two initial ratings the single most discrepant score is excluded and the average of the two
remaining scores is taken as the consensus score. In the cross-validation data set, only the score
of one human rater is recorded.
For the version of e-rater we studied, there were 56 feature variables (see Appendix A).
These features are divided into three general classes: 1) discourse-related, word counts done by
NLP techniques, or by simple “counting” procedures; 2) syntax-related, word counts or ratios;
2
and 3) topic-related, categorical variables, derived statistically by content vector analysis (CVA)
using a “cosine correlation” technique that has the same scale values as essay scores.
2.1.2 Sample
As shown in Table 1, scores in the training sample are almost uniformly distributed and
the score distribution in the cross-validation sample is unimodal with a mild skewness. Later on,
we refer to the training dataset as the “uniform sample” and the cross-validation dataset as the
“pseudo-normal” sample.
Neither of the two datasets are random samples from the population. Here “population”
refers to all responses for a particular essay prompt when the prompt is active. Every year there
are approximately 200,000 GMAT® examinees and more than 80 active prompts. Thus, each
prompt is administered to approximately 2,500 examinees. For both the training and cross-
validation datasets, processing waits until the pre-assigned sample size is reached. Because
sample selection is not random, the results based on such datasets may not necessarily generalize
to the population of examinees for which e-rater is used. This is described further in section 3,
“Alternatives to the Process.”
Table 1
Essay Score Distribution for Training and Cross-validation Data
Score level Training Cross-validation 1 15 19 2 50 59 3 50 124 4 50 154 5 50 99 6 50 39
Total 265 494
2.2 Methods
Two statistical methods used by e-rater are content vector analysis and the forward
stepwise version of ordinary least squares (OLS) regression. A description of each follows.
3
2.2.1 Content Vector Analysis
Content vector analysis (CVA) is a statistical technique used to identify relationships
between words and documents. With regard to the approximate specifications in the rubric about
essay content, CVA can identify language (content words) in essays that appear to contribute to
essay score. The description of CVA below is based on Burstein et al. (1998).
Standard CVA characterizes each text document (essay) at the lexical (word) level. The
document is transformed into a list of (word, frequency) pairs, where frequency is simply the
number of times that word appeared in the document. This list constitutes a vector, which
represents the lexical content of the document. Morphological analysis then can be optionally
used to combine the counts of inflectionally related forms. For example, the words “talk,”
“talks,” and “talking” all contribute to the frequency of their stem, “talk.” In this way, a degree
of generalization is realized across morphological variants and the number of (word, frequency)
pairs is reduced. To represent a whole class of documents, such as a score level for a set of
essays, the documents in the score level are merged and concatenated and a single vector is
generated to represent the score level.
CVA as employed by e-rater refines this basic approach by assigning a weight to each
word in the vector based on the word’s salience. Salience is defined by the frequency of the word
in the document (or score level) multiplied by the inverse of its frequency over all documents.
For example, “of” may be very frequent in a given document, but its salience will be low
because it appears in all documents. If the word “infinitesimal” appears even a few times in a
document, it will likely have high salience because there are relatively few documents that
contain this word.
Finally, an essay is compared to a score level by computing a cosine correlation between
their weighted vectors. The cosine correlation is the same as Tucker’s coefficient of congruence
(Tucker, 1951). The coefficient of congruence between two columns x (words counts for a class)
and y (words counts for an individual document), is defined as the normalized inner product
between the columns x and y, namely as
|| || || ||⋅∑ xyx y
and || ||x 2 = . The larger the value of the cosine correlation, the closer the essay is to the
class. The class that is closest to the test item is selected.
∑ 2x
4
Summarized below are the steps applied to e-rater.
1. Vector construction for each essay (or score level):
a. Extract words from essay (or combined essays at a given score level)
b. Apply morphological analysis (optional)
c. Construct frequency vector
d. Apply weights based on salience to words to form weighted vector
2. Scoring:
a. Compute cosine correlation between essay vector and the vector of each score
level
b. Assign score to essay equal to the score level with the highest cosine correlation
2.2.2 Forward Stepwise Version of OLS Regression
Forward stepwise version of (OLS) regression is used to reduce the number of variables
and build e-rater models. Essentially, this search method develops a sequence of regression
models, at each step adding or deleting a predictor variable. The criterion for adding or deleting a
predictor variable can be expressed in terms of residual sum of squares reduction, coefficient of
partial correlation, t-statistic or F-statistic. The full description of the algorithm can be found in
standard linear regression textbooks, for example, chapter 8 in Weisberg (1985) and chapter 8 in
Neter, Kutner, Nachtsheim, and Wasserman (1996).
The forward stepwise regression methods are easy to explain, and inexpensive to
compute, as compared with the all-possible-regressions procedures. The comparative simplicity
of the results from stepwise regression seems to appeal to many analysts. It is probably the most
widely used automatic search methods. But stepwise methods must be used with caution. The
model selected in a stepwise fashion need not optimize any reasonable criterion function for
choosing a model. The apparent ordering of the predictors is an artifact of the method and need
not reflect relationships of substantive interest. Finally, forward stepwise regression may
seriously overstate significance of results. See chapter 8 in Weisberg (1985) for an example.
The forward selection procedure is a simplified version of forward stepwise regression in
which the statistical test of whether an already-entered variable should be eliminated from the
prediction equation is bypassed. In a pilot study, we applied both forward selection and forward
stepwise schemes to some essay prompts. The variables selected by both search algorithms were
almost identical; some “dominant predictors” were always selected by both procedures and the
5
differences were at most two variables. Given this finding and the large number of predictors, we
used the forward selection scheme to approximate e-rater in this study.
Discussed in the next section are alternatives to the use of forward stepwise regression for
variable selection and feature reduction.
3. Alternatives to the Process
In this section are descriptions for alternatives to uniform sampling for model
development and alternatives using forward stepwise regression to reduce the number of
variables for the model.
3.1 Alternative Data Source for Model Development
For the process described in section 2.1.2, sample selection is not random. The results
based on such datasets will not necessarily generalize to the population of examinees for which
e-rater is used. To demonstrate that the samples used are potentially problematic, we reversed
the roles of training (uniform distribution) and cross-validation (pseudo-normal distribution)
data. Figure 1 describes the typical situation in which distributions of multiple Rs in the training
sample are higher than their cross-validated correlations. If samples were chosen from the same
population and the sampling was not problematic, we would expect lower correlations in cross-
validation than in the training sample (Dorans & Drasgow, 1978). In Figure 2, however, where
the pseudo-normal sample was used to build the model and the uniform sample was used to
validate the model, this result did not occur. Instead of finding the decrease of correlations, we
observed increases from the model-building to the model-testing sample in most cases (see
Figure 2 and numerical results in Appendixes B and C). This increase in correlations informs us
that the non-random non-representative uniform samples typically employed in the training
sample are likely to produce optimistic estimates of the performance in the population.
The difference in results when the roles of the samples are reversed is caused by direct
selection on the criterion or response variable. Least squares regression estimates the conditional
mean function E(Y| X = x) = ap + bp x, where the subscript p indicates “population.” When
the samples are selected based on response variable Y, least squares regression then estimates
E(Yu | X = x) = au + bu x, where “u” represents the uniform-like sample distribution. Usually ap
a≠ u and bp b≠ u. Thus the resulting multiple-R will be misleading. In the case of e-rater, the
multiple-Rs in the training samples are higher than what is found in the full population. Hence,
6
the sample estimates are too optimistic, even more optimistic than what sampling error for
random samples suggests.
0.
750.
800.
85
Uniform Pseudo Normal
Cor
rela
tion
Figure 1. Boxplots of correlations betwprompts for current practice (uniformsample used to cross-validate the mode
An alternative to developing an e-
sample is to use a hold-out procedure (La
(training and cross-validation samples), w
procedure proceeds as the following:
(a) Omit one observation from th
remaining data.
(b) Classify (i.e., predict essay sc
constructed in Step (a).
(c) Repeat Steps (a) and (b) until
(d) Compute Kappa and average l
section 4).
Sample
een essay score and predicted essay score of 20 essay sample used to build the model and pseudo-normal l).
rater model using the uniformly distributed training
chenbruch & Mickey, 1968) on the combined sample
hich is a jackknife approach. Lachenbruch’s hold-out
e combined data set and develop the model based on the
ore) the hold-out observation using the estimated model
all of the observations are classified.
oss (more details on Kappa and loss function are in
7
While this procedure takes advantage of larger sample sizes for model building, a shortcoming is
that by holding out one observation at a time, we need to build the model hundreds of times,
which might cause practical computing time concerns.
0.78
0.80
0.82
0.84
0.86
Pseudo Normal Uniform
Cor
rela
tion
Sample
Figure 2. Boxplots of correlations between essay score and predicted essay score of 20 essay prompts for effect of reversing roles of samples (pseudo-normal sample used to build the model and uniform sample used to cross-validate the model).
3.2 Alternatives to Stepwise Regression for Variable Selection/Reduction
In this section, four variable selection/reduction alternatives to forward stepwise
regression are presented: no selection/no reduction, equal weights regression, principal
components analysis, and a novel application of content vector analysis.
3.2.1 No Selection/No Reduction
One alternative to forward stepwise regression for variable selection/reduction is to use
all variables, which implies no feature selection of reduction. The regression model that retains
8
all 56 features, i.e., that does not use any variable selection scheme, is labeled “LSall” in the
tables that follow.
3.2.2 Equal Weights Regression
An alternative for variable selection/reduction is equal weights (EW) regression. EW
regression (Einhorn & Hogarth, 1975; Dorans & Drasgow, 1978) is widely used in psychology
and social science because of its stability for cross-validation. In the EW method, the criterion
(response variable) is regressed onto a composite (centroid) of standardized predictors.
Procedurally, the first step is to standardize the predictors. Then, the standardized scores are
summed to obtain a composite. A least squares regression of the criterion onto this composite is
performed. Then, through the relationship of the composite to the original predictors, this single
regression weight is converted into raw score weights for predictors. See Dorans and Drasgow
(1980) for more details.
In addition to the usual EW method, labeled “Ewall,” we also tried two variations of EW
methods. Before computing the centroid, we eliminated some features that had very low
correlations with essay scores. We calculated the correlation matrix of features for each essay
prompt, so we had 20 matrices in total. For variation “EWmin00max03,” we included features
that had minimum correlations greater than 0 across all 20 prompts and whose maximum
correlations all exceed 0.3. For variation “EWmin03,” we only included features with minimum
correlations greater than 0.3 across all 20 prompts. Figure 3 is a visual summary of correlations
for the three EW-type methods applied to four GMAT essay prompts (see Appendix D for more
correlations for other prompts). Pairs of boxplots of correlations across 20 GMAT essay prompts
were used to evaluate model fit. The first boxplot of each pair, indexed by “Unif,” is for the
correlation coefficient between the essay scores in the “uniform sample” and the prediction built
in the same sample. The second boxplot, indexed by “PN,” is for the correlation coefficient
between the essay scores in the “pseudo normal sample” and the estimated scores using the
model built on the “uniform sample.” The central location of the boxplots shows the predictive
power of each regression method. The drop in location between the boxplots for a given method
shows the stability of that method. The steeper the drop in location, the less stable the method;
the higher the central location, the more predictive the model.
As can be inferred from Figure 3, the two methods that did not use features that have lower
correlations with the essay score give better results. Based on this finding, we did not consider
9
EWall or EWmin00max03 further. Several of the e-rater features had low, possibly negative,
correlations with human ratings.
Figure 3. Boxplots of correlations between essay score and predicted essay score of 20 essay
prompts for three EW methods.
Cor
rela
tion
3.2.3 Principal Component Analysis
Another alternative to forward stepwise regression for reducing the number of variables
is principal component analysis (PCA). The central idea of PCA is to reduce the dimensionality
of a dataset, which consists of a large number of interrelated variables, while retaining as much
as possible of the variation present in the dataset. This is achieved by transforming the original
features into a new set of variables, the principal components, which are uncorrelated, and
ordered so that the first few retain most of the variation present in all of the original variables.
10
3.2.4 A Novel Application of Content Vector Analysis
Finally, CVA was considered as an alternative to forward stepwise regression for
reducing the number of variables. In particular, the cosine correlation method of CVA was used
to construct a few variables out of many variables. Some e-rater features have the same string of
characters in variable names (see Appendix A). Some have the phrase “init,” while others had
the phrase “dev.” We constructed three vectors from the 56 features: “init” (all features with
“init,” which flags a word or phrase that marks the beginning of an argument), “dev” (all features
with “dev,” which flags a word or phrase that marks argument development), and “other” (all the
rest). We also tried another partition scheme. We constructed four vectors: “words” (all features
with “W”), “phrases” (all features with “P”), “claim” (all features with “claim,” which flag
words or terms related to a claim), and “other” (the rest). We identified these two methods as
“I&D” and “W&P,” respectively.
Next, we constructed the typical pattern vector, used by the cosine correlation method, by
merging the vectors of all the examinees in each score level. After getting the typical pattern
vectors, we followed the steps of CVA, calculated cosine correlations and assigned a score based
on those values. Using this CVA approach, we reduced these 56 features to either 3 or 4
variables. Least squares regressions were then run on the two reduced predictor sets.
There are at least two potential problems with cosine correlation. First, cosine correlation
is used to measure the distance between two vectors, by taking the cosine value of the angle
between the two vectors regardless of their lengths, which ignores information about the
distribution of the groups. For example, as shown in Figure 4, two groups indexed by squares
and circles have different variability. When using cosine correlation, we assign a new
observation, shown by an arrow, to the densely packed circle group. However, if we take the
shape of the two groups into account, we would assign it to the highly dispersed square group.
Of a more serious nature is the fact that score levels of the dependent variable, essay
scores, are used to define the categories of vectors against which individual score vectors are
compared and then scored. The “score” assigned to an individual’s vector of (word, frequency)
pairs is the essay score of the group most like (in a cosine correlation sense) that vector of scores.
This CVA “score” is then used to predict that person’s actual essay score. This functional
dependency may yield an artificially high level of association with essay score for variables
defined by the cosine correlation approach.
11
Figure 4. An illustration of a drawback of classification using cosine correlation.
3.2 Alternatives to Stepwise Version of OLS Regression for Model Development
Currently, the stepwise version of OLS regression is used by e-rater to build topic-
specific models. Although it is computationally convenient, in the context of e-rater, OLS has
several shortcomings, as outlined previously. Alternatives include quantile regression, ridge
regression, classification models, and proportional odds models. Each is described, respectively.
3.3.1 Quantile Regression
Quantile regression (Koenker & Bassett, 1978) is an alternative to OLS regression that
may better meet the needs of e-rater for the primary reason that different users may be interested
in different parts of the examinees’ score distribution. For example, a graduate school admission
committee may be interested in highly competitive examinees, the higher end of the conditional
score distribution, say 90 or 95 percentile of the conditional distribution. On the other hand, a
remedial program may focus at the lower end of the conditional score distribution. In addition,
one of the effective ways to provide diagnostic or instructional information is to compare the
differences in essays written by examinees with different writing proficiencies. Typically, the
regression relationships at different percentiles (or quantiles) of the conditional score
12
distributions are different. So, a simple location estimate, which is what least squares mean
regression does, may not be sufficient.
It also may be the case that a school may want to penalize underestimation and
overestimation differently. The consequences of each are often quite different. Underestimation
makes the type-I error of judging qualified examinees as less than qualified; while
overestimation makes the type-II error of misclassifying less than proficient examinees as
proficient. The effects of type-II error are bad, and those of type-I may be worse. An asymmetric
penalty would allow underestimation and overestimation to be penalized differently.
The quadratic loss function, which is the objective function of the least squares method,
is very sensitive to outliers. It penalizes each misclassification by the square of the distance
between the human rater score and predicted score. Some milder loss function, say absolute loss,
may be more appropriate.
Quantile regression can be related to OLS. As shown below, it can depict a more
complete picture of the conditional score distributions by changing a tuning parameter τ
accordingly. The objective function it minimizes is an asymmetric absolute loss function, which
is robust to outliers and puts on different weights on underestimation and overestimation.
The linear conditional quantile function, QY(τ| X = xi) = xi'β(τ) , where x i' and β(τ) are
column vectors of the same length, can be estimated by solving
Min ( )τρ∑ 1
n
ii
y=
− 'ix β
where ρ τ(u) = u {τ- I(u < 0)} and I(.) is an indicator function.
When we chose τ > 0.5, the check function ρ τ(u) will assign τ | u | loss to
underestimation and (1-τ) | u | to overestimation. In the special case τ = 0.5, the fitted surface is
the regression median of y on x that minimizes the absolute deviations. By linear programming,
the resulting minimization problem can be solved very efficiently (Koenker & D’Orey, 1987).
3.3.2 Ridge Regression
Ridge regression (Hoerl & Kennard, 1970) is a method for overcoming serious
multicollinearity problems of OLS regression by modifying the method of least squares. When a
lot of predictors are used, collinearity may occur. That is, the observations may fall near an
(n-1)-dimensional super plane in an n-dimensional space. In this case, the sample covariance
13
matrix is singular or almost singular. If OLS is used, the resulting coefficient estimates will not
be stable and will usually be exaggerated. Ridge regression is one of several methods that have
been proposed to remedy multicollinearity problems by modifying the method of least squares to
allow biased estimators of regression coefficients. Ridge regression overcomes multicollinearity
by adding a small number, k (between 0 and 1), to the diagonal elements of the covariance
matrix. The constant k reflects the amount of bias in the estimators. When k = 0, the estimators
are reduced to the OLS regression coefficients in standard form. When k > 0, the ridge regression
coefficients are biased but tend to be more stable (i.e. less variable) than OLS estimators.
When an estimator has only a small bias and is substantially more precise than an
unbiased estimator, it may well be the preferred estimator since it will have a larger probability
of being close to the true parameter value. A measure of the combined effect of bias and
sampling variation is the mean square error. This criterion can be used to select k and the ridge
traces can be used for variable selection purpose (Neter et al., 1996). Methods of estimating k
have been suggested by many writers, including Horel and Kennard (1970); a survey of
suggested methods is given by Vinod and Ullah (1981).
3.3.3 Classification and Proportional Odds Model
Finally, classification (CLFN) methods and proportional odds models (POM) may be an
improvement over OLS in estimating essay scores. Each is described in the following.
3.3.3.1 Classification. Due to the discrete nature of the response variable (essay scores),
e-rater modeling can be conceptualized as a classification problem. The goal is to find a
classification scheme to minimize the expected cost of misclassification (ECM). Let
1) Pi = prior probability of score level i, where i = 1, 2, …, 6
2) Lki = loss associated with allocating an essay to score level k when, in fact, it belongs
to score level i and
3) fi be the density associated with score level i.
The function to minimize is to assign an essay to score k, for which
P∑≠=
6
1ki
ii Lki fi
is the smallest (see Johnson & Wichern, 1992).
14
3.3.3.2 Proportional odds model. E-rater scores can be thought of as ordered categories.
One of the most effective ways to construct a model for ordinal responses such as this is to
invoke the concept of a latent variable, Z, which we will term writing ability. The actual recorded
response Y is envisaged as a discrete manifestation of the continuous latent variable in such a
way that the relationship is monotone:
α j-1 < Z ≤ α j ===> Y = j
The “cutting-points,” α j , are envisaged as unknown points on the latent ability scale.
With e-rater, the z-interval (- ,∞ α 1] is interpreted as the lowest level, whose score is 1; the
interval (α j-1 , α j) is assigned a score j, j =2, … , 6. Unless the latent abilities are close to one
of the boundaries, similar values of the latent variable are not distinguished and give rise to
identical observed grades.
The mechanism of the model requires knowledge of latent ability. Like all mathematical
models of behavior, this is an idealization of what really happens, and is not to be taken literally,
particularly at the edges. The important part is not so much the mechanism but the prediction. If
the model predictions are sufficiently close to observations and known limiting behaviors, the
model is consider to be working.
When assuming a linear relationship between the latent ability and the covariates, we have
Z = x' β o + ε , where ε is a random variable with cumulative distribution function F(.) and x
and β o are column vectors of the same length. Then the probability Pr(Z ≤ z) is F(z- x' β o ).
Relationship (1) specifies the implied model for Y in the form
θ j = Pr(Y j) = Pr(Z≤ ≤ α j ) = F(α j - x' β o ), (1)
or equivalently F –1(θ j) = α j - x' β o .
When we choose F(x) = ex/(1+ex), which is a logistic density for ε , this scheme produces
the proportional odds model.
logit(θ j ) = α j + x'β 1 (2)
where logit(p) = log(p/(1-p)), β 1 = – β o . The proportional odds model is a linear logistic
model in which the intercepts depend on category j, but the slopes are all equal. The graph of the
five cumulative logits against x is a series of parallel planes with intercepts α 1 , …, α 5 .
15
“Proportional odds” refers to the fact that the odds of event Y≤ j satisfies
odds(Y j) = Pr(Y j) / (1-Pr(Y ≤ ≤ ≤ j)) = exp(α j + x' β 1 ).
Consequently, the ratio of the odds of the event Y≤ j for x1 and x0 is
odds(Y j | x≤ 1) / odds(Y ≤ j | x0)) = exp((x1 - x0 )' β 1 )
which is a constant independent of j.
3.3.3.3 Connection between classification and POM. In the special situation, where a uniform
prior is assumed, i.e. P1 = … = P6 , and Cik = 0, when i = k; Cik = 1, when i k, the
classification scheme based on ECM in section 3.3.3.1 is the same as assigning group
membership with the largest P(Y = j ). In addition, the response variable is ordinal and POM
takes the ordering into account. There are several other different methods that can be used to
model ordinal data, as seen in Agresti (1990).
≠
4. Results
In describing the results, we first discuss how we assess the adequacy of models. This is
followed by comparisons of OLS, PCA, and EW; ridge regression and a novel application of
CVA; and POM and quantile regression. Finally, other results are presented, which include the
affect of the number of words and more on quantile regression.
4.1 How to Assess Adequacy of Models
We can evaluate the model fit and stability by R, the correlation between the response
variable and the predictions from a certain model. We use notation R(. | .), where the second dot
represents the dataset on which our model is built, and the first dot is the dataset on which the
developed model is justified or validated. A better model needs to have
(a) higher R(Unif | Unif) or training sample correlation;
(b) smaller decline from R(Unif | Unif) to R(PN | Unif), the validation sample
correlation,
where “Unif” stands for the “uniform sample” and “PN,” the “pseudo normal sample.”
This R is appropriate for the mean regression analysis where the response variable takes on
continuous values. For the classification nature of the problem, two additional measurements are
proposed to assess the fit of the model. One is Cohen’s Kappa (Cohen, 1960) between human
rater and e-rater; the other is the average loss of misclassification. Due to the discrete nature of
16
the score, the assessment depends on the confusion matrix (Johnson & Wichern, 1992; Kohavi
& Provost, 1998), which shows predicted versus actual group membership.
Table 2
Confusion Matrix Target score
1 2 3 4 5 6
1 C11 C12 C13 C14 C15 C16
2 …
3
4
5
6 C61 C62 C63 C64 C65 C66
Predicted score
The columns are human scores, which are assumed to be the target scores, and rows are
predicted scores (rounded to the nearest integers, with truncation at scores 1 and 6). Cij is the
count of essays whose target scores are j, but are classified as i. So the diagonal elements
represent the successful classifications, and the off diagonals, represent misclassification.
4.1.1 Kappa and Weighted Kappa
Exact agreement rate was used to evaluate the agreement between two raters because it is a
popular single summary index of the confusion matrix, which is easily calculated by
Po =
∑∑
∑
= =
=6
1
6
1
6
1
iij
j
ii
C
Ci = ∑
=
6
1iiiP
A shortcoming of Po is that it does not compare the agreement to that expected if the
ratings were independent. So it is an inflated measure of agreement between two raters. To
address this problem, Powers (2000) recommended the use of Cohen’s Kappa as a measure of
agreement. More recently, Dorans and Patsula (2003) proposed confusion reduction and
confusion infusion, which adjust for the expected agreement using trivial assignment rules, such
17
as rules for uniform random assignment, modal assignment, and matched proportion random
assignment.
Let Pe denote the value expected on the basis of chance alone. Under the independence
assumption, it can be computed by . The obtained excess beyond chance is Pii
i PP .
6
1.∑
=
o-Pe,
whereas the maximum possible excess is 1-Pe. The ratio of these two differences is denoted
Kappa,
e
eo
PPP
−−
=1
κ
In our study, we choose the matched proportion random (MPR) assignment rule (Dorans &
Patsula, 2003) as the baseline needing improvement. So Pe is calculated using the score
distribution in cross-validation sample shown in Table 1.
Kappa is designed for nominal classification. In the context of e-rater, where the
categories are ordered, the seriousness of a disagreement depends on the difference between the
ratings. Cohen (1968) generalized his Kappa measure to the case where the relative seriousness
of each possible disagreement could be quantified. The measure weighted Kappa (Spitzer,
Cohen, Fleiss, & Endicott, 1967) use weights wij to describe closeness of agreement. The weights
are restricted to lie in the interval 0 w≤ ij≤1 and wij = wji, i.e. the two raters are considered
symmetrically.
The observed weighted proportion of agreement is, say,
Po(w) = ∑∑ = =
6
1
6
1iij
jijPw
where the proportions Pij are in confusion matrix. The chance-expected weighted proportion of
agreement is,
Pe(w) = ∑∑= =
6
1)(
6
1iMPRij
jij Pw
Weighted Kappa is then given by
)(
)()(
1 we
wewow P
PP−
−=κ
Note that, when wij = 0 for all i j, and w≠ ij = 1 for all i= j, then weighted Kappa becomes
identical to the commonly used Kappa. Though developed from different perspectives in
18
statistics, decisions on the weights are closely related to the choices of loss function, as we will
show in the rest of this section.
4.1.2 Average Loss
We use the loss function that takes into account correct and incorrect classifications to evaluate
model fit. We can express the loss function in a format that is similar to the confusion matrix:
The columns are human scores categories, and the rows are rounded predicted scores with
truncation at scores 1 and 6.
Table 3
Loss Matrix Target score
1 2 3 4 5 6 1 L11 L12 L13 L14 L15 L16
2 … 3 4 5 6 L61 L62 L63 L64 L65 L66
Predicted score
Let L11= L22 = … = L66 =0 and let Lij be the loss that occurs when score I is assigned to
an examinee whose target score is j. The average loss, denoted by L, is computed by the
following formula:
∑
∑
=
== 6
1,
6
1,
jiij
jiijij
C
LCL .
The exact agreement rate can be expressed by the average loss, where Lii = 0, Lij=1, i≠ j.
The quadratic loss function is Lij = | i – j | 2, which is the minimand for OLS estimator; the
absolute loss function is Lij = | i – j | which yields the MAD estimator. Both of the two loss
functions are symmetric.
19
Fleiss and Cohen (1973) used the weights taken as
wij = 1 2
2
)1()
−ki( −
−j
Independent of Cohen (1968), Cicchetti and Allison (1971) proposed a statistic for measuring
inter-rater reliability that is formally identical to weighted Kappa. They suggested using the
weights
wij = 11−
−−
ji k
The two weights considered above are naturally connected to the quadratic loss and
absolute loss, for example,
Po(sq) = 1 2)1( −k1
− Lo(sq).
and some simple algebra shows that
)(
)()(
)(
)()(
1 sqo
sqosqMPR
sqe
sqesqosq L
LLP
PP −=
−
−=κ
where LMPR(sq) = 1 - Pe(sq) and Lo(sq) = 1 - Po(sq).
For the asymmetric loss function, which typically requires Lij > Lji when j > i. In this
study, we use asymmetric loss function Lij= | i – j | I(i>j) + | i – j | 2 I( i≤ j), which is a combination
of the upper triangle of quadratic loss and lower triangle of absolute loss. The currently chosen
asymmetric loss function is best viewed as a starting point.
4.2 Results Comparison
4.2.1 OLS, PCA, and EW
Figure 5 shows the regression results of six methods:
i) EWmin03: equal weight regression using features whose minimum correlations with essay
scores are greater than 0.3 across all 20 prompts
ii) LSall: least squares regression using all the linguistic features
iii) LS18: least squares regression using 18 features selected by the forward selection scheme
iv) LS8: least squares regression using 8 features selected by the forward selection scheme
v) PC40: PCA regression with first 40 principal components
vi) PC8: PCA regression with first 8 principal components
20
Two boxplots of correlations across 20 GMAT essay prompts were used to evaluate
model fit. The first boxplot of the two, indexed by “Unif,” is for the correlation coefficient
between the essay scores in the “uniform sample” and the prediction built in the same sample.
The second boxplot, indexed by “PN,” is for the correlation coefficient between the essay scores
in the “pseudo normal sample” and the estimated scores using the model built on the “uniform
sample.” The central location of the boxplots shows the predictive power of each regression
method. The drop in location between the boxplots for a given method shows the stability of that
method. The steeper the drop in location, the less stable the method; the higher the central
location, the more predictive the model.
Here are the most obvious findings drawn from Figure 5:
1. One dominant feature in the figure is the overfitting of LSall to the data—it has the
steepest location drop among all the methods, which means it is the least stable
method. This finding was expected because LSall keeps all the features no matter
how relevant they are to the response variable.
2. PCA, in general, is more stable than LSall, but it did not account for as much of the
variation of the response variable as LSall does. We are not surprised to see that the
Rs of LSall are generally higher than those of PCA. This result occurs because the
two methods, PCA and LSall, seek different goals. The predictive components of the
PCA regression are linear combination of the predictor variables that are chosen to
maximize the variance of the linear combination (the principal component) making
sure at the same time that each successive linear combination is independent of prior
ones. The coefficients of LS are computed to minimize the sum of squares of the
difference between the observed response variable and the predicted values. Thus,
PCA may not be as predictive as LS.
3. EW does not work well in e-rater modeling. Though EWmin03 was a better predictor
than EWall, it still failed to compete with the LSall and PCA methods.
4. LS8, which is our surrogate for forward stepwise OLS, is the best performer in Figure
5 in terms of both predictive power and stability.
5. The variation of results across different methods is greater in the Unif training
samples than it is in the PN validation samples.
21
Figure 5. Boxplots of correlations between essay score and predicted essay score based on six regression methods for 20 GMAT essay prompts.
4.2.2 Ridge Regression and a Novel Application of CVA
Figure 6 compares ridge regression with LS8, LSall, I&D, and W&P, where LSall and
LS8 are as defined in section 4.2.1; and I&D and W&P are two LS methods that use CVA as
defined in section 3.2.4. In this ridge regression method, we use a formula proposed by Hoerl,
Kennard, and Baldwin (1975) to estimate the ridge parameter k: 2ˆˆ
ˆ ˆpk σ
=β'β
where p is the number of predictors, is the OLS coefficient estimates and is the error
variance estimate that would be obtained when the OLS method is used with all the feature
variables. It can be shown that with this choice of ridge parameter estimates can yield predictions
with smaller mean square error than those produced by OLS method. In addition, the
corresponding ridge estimates using the above-mentioned ridge parameter estimate is a special
case of so-called “double h-class ridge estimates.” See Vinod and Ullah (1981) for details.
β̂ 2σ̂
22
From the comparisons in Figure 6 we see:
1. The partition scheme using “init” and “dev” (I&D) gave a slightly better result than
that using “words” and “phrases” (W&P) when evaluated in terms of both
predictability and stability.
2. The two CVA methods (W&P and I&D) gave comparable results to LS8, which was
our proxy for e-rater.
3. Ridge regression cross-validated better than LSall, but did not cross-validate as well
as the other three methods.
Figure 6. Boxplots of correlations between essay score and predicted essay score for 20 GMAT essay prompts comparing three regression methods with two methods using cosine correlation.
4.2.3 POM and Quantile Regression
Both POM and quantile regression used the same subset of features as selected by forward least
squares method. As mentioned in section 3.3.3, the quantile regression method—unlike OLS,
23
which only provides a single location estimation, the conditional mean function—gives a series
of estimates of conditional quantile functions. For prediction, a single quantile regression, which
yields the highest Kappa or lowest loss, needs to be selected from among this series. To pick the
correct regression quantile, 19 quantile regression procedures, with τ ranges from 5% to 95% in
steps of 5%, were computed on the combined dataset (combining Unif and PN) with 749
examinees. For each quantile regression method, Kappa, weighted Kappa, and several average
losses were computed. We chose the τth regression quantile with highest Kappa.
For example, Table 4 shows the exact agreement rates and Kappas of different regression
quantiles for GMAT essay prompt 2434. Only the 45th to 70th regression quantiles are shown
because this range contained the highest Kappa. The 55th regression quantile (in bold) was
selected. As a reference, we did a hold-out procedure using 50th, 55th, and 60th regression
quantile. These results are shown in Table 5.
Table 4
Exact Agreement Rate and Kappa for a GMAT Essay Prompt
Quantile 45% 50% 55% 60% 65% 70% Agreement rate .501 .503 .520 .518 .495 .485 Kappa .358 .361 .383 .380 .351 .338
Table 5
Hold-out Exact Agreement Rate and Kappa for a GMAT Essay Prompt
Method OLS POM CLFN 50% 55% 60% Agreement Rate .490 .511 .503 .499 .502 .497 Kappa .344 .371 .361 .356 .360 .353
Table 6 contains the hold-out exact agreement rates and corresponding Kappas for four
methods. For CLFN, we assumed a uniform prior. For the other two components, fi and Lki as
defined in section 3.3.3.1, we used POM to fit fi, the density associated with score level i, and
24
used a combination of an upper triangle of squared loss and lower triangle of absolute loss as
seen in Table 7.
Table 6
Average Loss for a GMAT Essay Prompt
OLS POM CLFN 50% 55% 60%Sq .640 .603 .634 .615 .636 .652Ab .553 .527 .542 .539 .544 .552Eq .553 .527 .540 .539 .544 .551Aj .582 .548 .557 .560 .563 .565
Table 7
Asymmetric Loss Matrix
1 2 3 4 5 6 1 0 1 4 9 16 252 1 0 1 4 9 163 2 1 0 1 4 9 4 3 2 1 0 1 4 5 4 3 2 1 0 1 6 5 4 3 2 1 0
Table 6 shows four different average losses, where “Sq” means squared loss; “Ab”—absolute
loss; “Eq”—equal loss and “Aj”—adjacent loss, which is shown in Table 8.
Table 8
Adjacent Loss Matrix
1 2 3 4 5 61 0 1 2 2 2 22 1 0 1 2 2 23 2 1 0 1 2 24 2 2 1 0 1 25 2 2 2 1 0 16 2 2 2 2 1 0
25
The corresponding weighted Kappas are listed in Table 9.
Table 9
Generalized Kappa for the Same GMAT Essay Prompt in Table 8
OLS POM CLFN 50% 55% 60%
Sq .791 .803 .793 .799 .792 .787
Ab .597 .616 .605 .607 .603 .597
Eq .289 .322 .306 .307 .300 .291
Aj .504 .533 .525 .523 .520 .518
POM had the highest Kappa and weighted Kappa and lowest loss; it outperformed the
other methods in this prompt. The 55th regression quantile and CLFN gave basically the same
results. These results are typical, as seen in Appendix E.
4.3 Other Results
4.3.1 Effects of the Number of Words
In the past, empirical studies (Breland, Bonner, & Kubota, 1995; Page & Petersen, 1995;
Sheehan, 2003) have shown that the number of words is highly correlated with essay scores. We
noticed that two features always selected by forward selection across the 20 GMAT essay
prompts were 1) the number of auxiliary verbs and 2) the ratio of the number of auxiliary verbs
to the number of words. If we divide 1) by 2), we get total number of words. So the effects of the
number of words deserves in depth study.
We fit a quadratic regression using number of words to essay scores of Prompt 2438. In
this regression, we predicted essay score from the total number of words, and that total squared.
The dataset on which we built the model combined the uniform sample and pseudo-normal
sample. The Kappa between human scores and the predicted scores based on words and words
squared was .344. When OLS with forward selection was used on this prompt, Kappa was .354.
The quadratic regression using the number of words did a very good job of predicting, as has
26
been demonstrated by Sheehan (2003) and others. But there remained aspects of the essay score
that the quadratic regression using the number of words could not capture.
To see what lies beyond the number of words and that number squared, we fit a “two-step
regression” to the essay scores. The two-step regression proceeded as follows:
i) fit a quadratic regression using the number of words to essay scores, yi
ii) apply OLS with forward selection to fit the residuals, ri , from step i
The corresponding hold-out procedure for cross-validation involved three steps. For each
examinee in the sample, we:
a) predicted yi using the estimated quadratic regression in step i
b) predicted ri using the estimated regression equation from step ii
c) summed the predicted values from a) and b) to produce a final prediction
Tables 10 and 11 show that, when taking the effects of the number of words into account
and using OLS, we could improve the prediction accuracy; the Kappa increased from .344 to
.378 (the exact agreement rate rose from .490 to .516). In addition, we looked at using the
quantile regression approach in place of the OLS method as the second step of the two-step
regression. The 60th regression quantile yielded the highest Kappa in the hold-out sample, .396,
which is higher than the OLS two-step regression.
Table 10
Exact Agreement of Two-step Regression on a Combined Sample
Fitted values Hold-out cross-validation
OLS with forward selection .498 .490 Two-step OLS regression .524 .516
50th .526 .520 55th .526 .521 60th .534 .530
Two-step quantile regression
65th .510 .509
27
Table 11
Kappa of Two-step Regression on a Combined Sample
Fitted values Hold-out cross-validation
OLS with forward selection .354 .344 Two-step OLS regression .388 .378
50th .390 .383 55th .390 .384 60th .401 .396
Two-step quantile regression
65th .370 .369
Another interesting finding was that the two variables, the number of auxiliary verbs and
ratio of the number of auxiliary verbs to the number of words, were not entered in the regression
equation in the second step. We expected to see this because the first step already took into
account of the effects of number of words.
Our conclusion is that number of words is a powerful predictor. However, there remains
something number of words cannot capture—predicting the part not accounted for by words
from other e-rater features and adding that to the part predicted by words yields a better score
prediction.
5. Conclusions, Recommendations, and Future Research
Our explorations to date have resulted in certain conclusions and directions for future
research.
5.1 Conclusions
We have examined enough e-rater data to conclude that:
1. Stepwise regression seemed to be an effective feature reduction procedure with e-rater
data. Not suited were equal weights regression and principal components regression as
these approaches assume dimensional structures that were not consistent with the e-rater
data. Equal weights regression works best when the predictor variables exhibit
collinearity and each predictor is related to the criterion (essay score in this case) to about
the same degree. Principal components regression works best when most of the criterion
28
variable projects onto the reduced feature space defined by the dimensions that maximize
variance in that space. With e-rater data, there are a few relatively strong predictors.
2. The effectiveness of stepwise regression may be attributed to the consistently strong
relationship with essay score that is observed for the content vector analysis (CVA)
variables and the two variables that are surrogates for word length (number of auxiliary
verbs and ratio of the number of auxiliary verbs to the number of words). Note that
because performance on the criterion variable is used to develop scores on the CVA
variables, there is a functional dependency between essay score and the scores on these
CVA variables that boosts the predictive power of these CVA variables. Furthermore,
word length has been shown to be a powerful predictor of essay score. The ratio of two e-
rater features, (the number of auxiliary verbs / the ratio of the number of auxiliary verbs
to the number of words), is the number of words.
3. Replacing the current two-stage approach of first developing a model in a quasi-uniform
training sample and then validating these results in a target cross-validation sample with
the hold-out method for evaluating validity will yield better validation results. One
possibility is the K-fold cross-validation method (Hastie, Tibshirani, & Friedman, 2001)
in which part of the available data is used to fit the model and a different part to test it.
This approach entails splitting the data into K roughly equal-sized parts. For example,
when K=10, the scenario looks like—
1 2 3 4 5 6 7 8 9 10
Train Train Train Train Test Train Train Train Train Train
For the kth part (5th above), we fit the model to the other K-1 parts of the data, and
calculate the average losses or confusion reductions of the fitted model when predicting
the kth part of the data. We do this for k = 1, 2, …, K and combine the K estimates of the
average loss or confusion reductions. The hold-out procedure is a special case of this K-
fold method.
29
5.2 Future Research
More research is needed to:
1. Explicitly model the part of essay scores that is unrelated to word length.
2. Investigate the POM approach in greater depth because the ordered categorical nature of
essay scores means that automated scoring of essays is essentially a classification
problem. Viewing the essay-grading problem as a classification problem will enable us to
explicitly consider the consequences of attaching different loss functions to different
kinds of misclassifications.
3. See if it matters much whether the response variable is treated as if it is continuous, as is
the case with model building for the current approach, or as a category as is the case
currently for model evaluation. One could argue that the grades assigned by raters to
essays are actually a course discrete version of a continuous variable of essay quality. In
that case, a classification model may not be appropriate because the score categories are
not really categories in the sense that gender and/or national origin are categorical. Then
measures of association other than agreement may be of greater interest.
4. Develop a statistical justification for using essay scores to score CVA variables, and if
such a justification can be developed, determine if CVA can be used as part of a model
building process that uses representative samples. For example, it might be possible that
CVA could use a uniformly distributed sample, while a representative sample is used to
build the model.
5. Investigate algorithmic approaches to prediction/classification problem, such as boosting.
The applications mentioned in this report tend to operate within what Breiman (2001)
calls a data modeling approach. Response variables are viewed as functions of predictor
variables, random noise and parameters. For example, the prediction model for e-rater
states that essay scores for a given prompt are a linear combination of about ten variables
that are selected from a set of more than 50 candidate predictor variables. A subset of
variables is selected and parameters are estimated. The resultant model is then used to
score future essays. An alternative to this traditional data modeling approach that is used
increasingly in prediction settings is what Breiman (2001) calls the algorithmic approach.
Instead of postulating a model for predicting response variable from a string of
30
predictors, fitting the model, and evaluating its fit, the algorithmic approach tries to find a
function that does the best job of predicting response variable in the population.
6. Investigate quantile regression further because it appears to give better agreement rates
than OLS and is more flexible to differential treatment of types of error. Feng, Ying, and
Dorans (2002) have begun explorations along these lines (see Appendix F).
7. Investigate whether ridge regression can be used more effectively once changes are made
in the data employed by e-rater.
8. Due to the large pool of potential predictor variables, use of a “best” subsets algorithm
may not be feasible. Approaches that overcome the shortcomings of automatic search
procedures should be investigated. For example, we might use the subset identified by the
automatic search procedures as a starting point for search for other “good” subsets. One
option might be is treat the number of predictor variables in the regression model
identified by the automatic search procedure as being about the right subset size and then
use the all-possible-regressions procedures for subsets of this and nearby sizes. Another
possibility is to use a logical analysis of the predictors’ space in conjunction with
statistics to weight variables.
9. Investigate whether incorporating prior information about coefficients based on available
models for other prompts can be used to more stable e-rater models.
31
References
Agresti, A. (1990). Categorical data analysis. New York: Wiley.
Breiman, L. (2001). Statistical modeling: The two cultures (with discussion). Statistical Science,
16(3), 199-231.
Breland, H. M., Bonner, M. W., & Kubota, M. Y. (1995). Factors in performance on brief,
impromptu essay examinations (College Board Report No. 95-4 and ETS RR-95-41).
New York: College Entrance Examination Board.
Burstein, J., Braden-Harder, L., Chodorow, M., Hua, S., Kaplan, B., Kukich, K., Lu, C., Nolan,
J., Rock, D., & Wolff, S. (1998). Computer analysis of essay content for automated score
prediction: A prototype automated scoring system for GMAT analytical writing
assessment essays (ETS RR-98-15). Princeton, NJ: Educational Testing Service.
Cicchetti, D. V., & Allison, T. (1971). A new procedure for assessing reliability of scoring EED
sleep recordings. American Journal of EEG Technology, 11, 101-109.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological
Measurement, 20, 37-46.
Cohen, J. (1968). Weighted Kappa: Nominal scale agreement with provision for scaled
disagreement or partial credit. Psychological Bulletin, 70, 213-220.
Dorans, N. J., & Drasgow, F. (1978). Alternative weighting schemes for linear prediction.
Organizational Behavior & Human Performance, 21(3), 316-345.
Dorans, N. J., & Drasgow, F. (1980). A note on cross-validating prediction equations. Journal
of Applied Psychology, 65(6), 728-730.
Dorans, N. J., & Patsula, L. N. (2003). Using confusion infusion and confusion reduction indices
to compare alternative essay scoring rules (ETS RR-03-09). Princeton, NJ: Educational
Testing Service.
Einhorn, H. J., & Hogarth, R. M. (1975). Unit weighting schemes for decision making.
Organizational Behavior & Human Performance, 13(2), 171-192.
Feng, X., Ying, Z., & Dorans, N. J. (2002). Using quantile regression to model simulated
essay scores. Paper presented at the Annual Meeting of the Psychometric Society,
Chapel Hill, NC.
32
Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted Kappa and the intraclass
correlation coefficient as measure of reliability. Educational and Psychological Measurement,
33, 613-619.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data
mining, inference, and prediction, New York: Springer
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics, 12(3), 55-67.
Hoerl, A. E., Kennard, R. W., & Baldwin, K. F. (1975). Ridge regression: some simulations.
Communications in Statistics, 4, 105-23.
Johnson, R. A., & Wichern, D. W. (1992). Applied multivariate statistical analysis. Englewood
Cliffs, NJ: Prentice Hall.
Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrica, 46, 33-50.
Koenker, R., & D’Orey, V. (1987). Computing regression quantiles. Applied Statistics, 36, 383-393.
Kohavi, R., & Provost. F. (1998). Glossary of terms. Journal of Machine Learning, 30, 271-274.
Lachenbruch, P. A., & Mickey, M. R. (1968). Estimation of error rates in discriminant analysis.
Technometrics, 10(1), 1-11.
Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996). Applied linear regression
models (3rd ed.). Chicago, IL: Irwin.
Page, E. B., & Peterson, N. (1995). The computer moves into essay grading: updating the ancient
test. Phi Delta Kappa, 76, 561-565.
Powers, D. E. (2000). Computing reader agreement for the GRE writing assessment (ETS RM-
00-8). Princeton, NJ: Educational Testing Service.
Sheehan, K. M. (2003) Discrepancies in human and computer-generated essay scores for
TOEFL-CBT. Manuscript submitted for publication.
Spitzer, R. L., Cohen, J., Fleiss, J. L., & Endicott, J. (1967). Quantification of agreement in
psychiatric diagnosis. Archives of General Psychiatry, 17, 83-87.
Tucker, L. R. (1951). A method for synthesis of factor analysis studies (Personnel Research
Section Report No.984). Washington, DC: Dept. of the Army.
Vinod, H. D., & Ullah, A. (1981). Recent advances in regression methods. New York: Marcel,
Dekker.
Weisberg, S. (1985). Applied linear regression (2nd ed.). New York: Wiley.
33
List of Appendixes
Page
A. E-rater® Feature Variables Circa 2001................................................................................... 35
B. Correlations Between Essay Score and Predicted Essay Score for Nine Methods Given
Practice of Using Uniform Sample to Build the Model and Pseudo-normal Sample to Cross-
validate the Model................................................................................................................... 38
C. Correlations Between Essay Score and Predicted Essay Score for Nine Methods After
Reversing Roles of Samples (Pseudo-normal Sample Used to Build the Model and Uniform
Sample Used to Cross-validate the Model) ............................................................................ 40
D. Correlations Between Essay Score and Predicted Essay Score for Three Equal Weights
Methods................................................................................................................................... 42
E. Cross-validated Exact Agreement Rates and Kappas for Four Methods on All 20 Prompts.. 43
F. More on Quantile Regression .................................................................................................. 45
34
Appendix A E-rater® Feature Variables Circa 2001
Discourse-related Features
Note:
Arg_init = word (W) or phrase (P) that flags the beginning of an argument
Arg_dev = word or phrase that flags argument development
Capitalized terms indicate a “class” of discourse, e.g., PARALLEL_W = words or terms like,
First, Second, that indicate “parallel” argument structure.
1. total number of paragraphs
2. sentences in a paragraph 1
3. number of arg_inits in paragraph 1
4. number of arg_devs in paragraph 1
5. number of arg_aux in paragraph 1
6. total_arg_init in essay
7. total_arg_dev in essay
8. total_arg_init_CLAIM_N
9. total_arg_init_CLAIM_PRO
10. total_arg_init_CLAIM_THAT
11. total_arg_init_CLAIM_THAT_LESS
12. total_arg_init_CLAIM_Ving
13. total_arg_init_PARALLEL_W
14. total_arg_init_PARALLEL_P
15. total_arg_init_SUMMARY_W
16. total_arg_init_SUMMARY_P
17. total_arg_init_TRANSITION_P
18. total_arg_init_To_INFCL
19. total_arg_init_D_SPECIFIC
20. total_arg_init_NEW_PARAGRAPH
21. total_arg_dev_CLAIM_THAT
22. total_arg_dev_CLAIM_THAT_LESS
35
23. total_arg_dev_CLAIM_Ving
24. total_arg_dev_To_INFCL
25. total_arg_dev_SAME_TOPIC
26. total_arg_dev_BELIEF_W
27. total_arg_dev_BELIEF_P
28. total_arg_dev_CONTRAST_W
29. total_arg_dev_CONTRAST_P
30. total_arg_dev_DETAIL_W
31. total_arg_dev_DETAIL_P
32. total_arg_dev_DISBELIEF_W
33. total_arg_dev_EVIDENCE_W
34. total_arg_dev_EVIDENCE_P
35. total_arg_dev_INFERENCE_W
36. total_arg_dev_INFERENCE_P
37. total_arg_dev_REFORMULATION_W
38. total_arg_dev_REFORMULATION_P
39. total_arg_dev_RHETORICAL_W
40. total_arg_dev_RHETORICAL_P
41. total words in 1st argument
42. total sentences in 1st argument
43. total arg_devs in 1st argument
Syntax-related
1. COMPCL= total number of complement clauses
2. INFCL = total number of infinitive clauses
3. RELCL = total number of relative clauses
4. SUBCL = total number of subordinate clauses
5. AUX_VERB = total number of auxiliary verbs
6. R-COMPCL= ratio of complement clauses based on total number of sentences
7. R-INFCL = ratio of infinitive clauses based on total number of sentences
8. R-RELCL = ratio of relative clauses based on total number of sentences
36
9. R-SUBCL = ratio of subordinate clauses based on total number of sentences
10. R-AUX_VERB = ratio of auxiliary verbs based on total number of sentences
Topic-related
1. Analysis of all essay vocabulary
2. Analysis of essay vocabulary within each argument (the essay is partitioned into by the e-
rater discourse analyzer)
3. Analysis of essay vocabulary in the first and final arguments of the essay
37
Appendix B Correlations Between Essay Score and Predicted Essay Score for Nine Methods Given Practice of Using Uniform Sample to
Build the Model and Pseudo-normal Sample to Cross-validate the Model
38
Lsall LS8 LS18 Ridge I&D W&P PC8 PC40 Ewmin03
Prompt Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN
2399 0.841 -0.094 0.816 -0.069 0.834 -0.082 0.835 -0.077 0.826 -0.053 0.828 -0.063 0.781 -0.027 0.834 -0.092 0.749 -0.028
2400 0.831 -0.043 0.811 -0.019 0.824 -0.036 0.824 -0.023 0.818 -0.017 0.823 -0.037 0.784 -0.006 0.822 -0.037 0.771 0.005
2420 0.852 -0.095 0.823 -0.044 0.841 -0.078 0.843 -0.068 0.824 -0.032 0.831 -0.052 0.773 -0.006 0.837 -0.078 0.780 -0.026
2421 0.898 -0.142 0.873 -0.109 0.890 -0.125 0.894 -0.129 0.878 -0.107 0.880 -0.111 0.832 -0.080 0.877 -0.126 0.811 -0.077
2427 0.861 -0.132 0.831 -0.053 0.850 -0.093 0.855 -0.101 0.834 -0.079 0.834 -0.068 0.792 -0.033 0.847 -0.104 0.766 -0.032
2428 0.852 -0.075 0.825 -0.028 0.839 -0.052 0.845 -0.054 0.832 -0.032 0.842 -0.039 0.786 0.011 0.851 -0.073 0.778 -0.016
2429 0.889 -0.098 0.870 -0.072 0.884 -0.086 0.885 -0.090 0.878 -0.075 0.880 -0.077 0.852 -0.061 0.886 -0.094 0.822 -0.067
2431 0.906 -0.156 0.881 -0.131 0.899 -0.138 0.903 -0.149 0.887 -0.117 0.889 -0.112 0.831 -0.098 0.903 -0.142 0.823 -0.113
2434 0.901 -0.132 0.864 -0.063 0.880 -0.093 0.897 -0.117 0.872 -0.068 0.875 -0.073 0.818 -0.022 0.888 -0.112 0.788 -0.027
2438 0.899 -0.151 0.873 -0.081 0.889 -0.120 0.895 -0.140 0.883 -0.105 0.889 -0.112 0.821 -0.082 0.891 -0.142 0.792 -0.060
2441 0.872 -0.088 0.848 -0.055 0.866 -0.075 0.866 -0.078 0.852 -0.053 0.852 -0.058 0.822 -0.055 0.857 -0.084 0.803 -0.059
2448 0.878 -0.119 0.852 -0.056 0.868 -0.086 0.872 -0.095 0.856 -0.049 0.858 -0.055 0.821 -0.050 0.861 -0.094 0.800 -0.060
2450 0.869 -0.080 0.844 -0.040 0.862 -0.070 0.864 -0.072 0.850 -0.027 0.857 -0.036 0.816 -0.037 0.862 -0.080 0.792 -0.043
2554 0.885 -0.099 0.873 -0.087 0.880 -0.093 0.881 -0.091 0.873 -0.090 0.877 -0.097 0.847 -0.079 0.878 -0.093 0.824 -0.08
(Table continues)
Appendix B (continued)
39
Lsall LS8 LS18 Ridge I&D W&P PC8 PC40 Ewmin03
Prompt Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN
2557 0.887 -0.146 0.858 -0.103 0.877 -0.117 0.883 -0.131 0.856 -0.083 0.869 -0.114 0.835 -0.067 0.881 -0.142 0.802 -0.064
2689 0.908 -0.123 0.888 -0.081 0.903 -0.106 0.904 -0.115 0.893 -0.080 0.896 -0.083 0.861 -0.077 0.896 -0.128 0.837 -0.053
2690 0.832 -0.106 0.799 -0.039 0.821 -0.092 0.825 -0.078 0.809 -0.060 0.808 -0.058 0.765 -0.018 0.826 -0.092 0.755 -0.016
2702 0.893 -0.130 0.873 -0.104 0.888 -0.122 0.888 -0.123 0.866 -0.087 0.872 -0.087 0.837 -0.081 0.883 -0.133 0.820 -0.078
2703 0.854 -0.134 0.830 -0.103 0.845 -0.129 0.848 -0.117 0.834 -0.091 0.836 -0.093 0.804 -0.078 0.847 -0.125 0.772 -0.039
0626 0.901 -0.081 0.873 -0.029 0.889 -0.063 0.895 -0.070 0.888 -0.044 0.884 -0.040 0.850 -0.041 0.888 -0.083 0.824 -0.026
Note. Unif denotes uniform sample; PN denotes pseudo-normal sample.
Appendix C Correlations Between Essay Score and Predicted Essay Score for Nine Methods After Reversing Roles of Samples (Pseudo-
normal Sample Used to Build the Model and Uniform Sample Used to Cross-validate the Model)
40
Lsall LS8 LS18 Ridge I&D W&P PC8 PC40 Ewmin03
Prompt PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif
2399 0.833 -0.070 0.800 -0.018 0.821 -0.044 0.830 -0.063 0.813 -0.020 0.811 -0.031 0.776 -0.013 0.825 -0.067 0.768 -0.004
2400 0.833 -0.037 0.810 -0.021 0.821 -0.031 0.830 -0.033 0.817 -0.014 0.812 -0.010 0.787 -0.011 0.826 -0.033 0.767 0.014
2420 0.830 -0.029 0.803 -0.002 0.816 -0.015 0.826 -0.027 0.815 -0.005 0.811 -0.001 0.774 0.003 0.811 -0.030 0.756 0.007
2421 0.824 0.022 0.801 0.047 0.811 0.035 0.821 0.025 0.806 0.046 0.807 0.045 0.771 0.053 0.814 0.030 0.755 0.062
2427 0.824 -0.017 0.803 0.009 0.811 0.000 0.821 -0.010 0.810 -0.006 0.815 -0.012 0.768 0.024 0.817 -0.013 0.717 0.048
2428 0.838 -0.035 0.819 -0.008 0.825 -0.020 0.835 -0.033 0.820 -0.003 0.826 -0.001 0.803 -0.013 0.825 -0.031 0.762 0.001
2429 0.837 0.016 0.817 0.029 0.825 0.025 0.835 0.022 0.823 0.037 0.825 0.037 0.797 0.047 0.834 0.021 0.755 0.063
2431 0.821 0.042 0.798 0.063 0.810 0.051 0.818 0.041 0.809 0.052 0.812 0.054 0.763 0.055 0.813 0.036 0.726 0.086
2434 0.845 -0.002 0.826 0.021 0.840 0.000 0.842 -0.006 0.835 0.011 0.835 0.014 0.805 -0.005 0.828 -0.003 0.757 0.013
2438 0.819 0.037 0.801 0.049 0.812 0.045 0.816 0.040 0.808 0.051 0.809 0.054 0.758 0.053 0.808 0.042 0.737 0.076
2441 0.827 0.000 0.810 0.021 0.820 0.010 0.824 0.012 0.813 0.023 0.813 0.020 0.768 0.053 0.814 0.013 0.751 0.040
2448 0.840 -0.017 0.828 -0.004 0.831 -0.008 0.837 -0.007 0.829 0.010 0.828 0.012 0.788 0.038 0.836 -0.010 0.754 0.051
2450 0.852 -0.032 0.834 -0.017 0.843 -0.020 0.850 -0.030 0.844 -0.010 0.851 -0.018 0.790 0.011 0.841 -0.032 0.753 0.043
2554 0.843 -0.005 0.815 0.034 0.837 0.003 0.841 0.001 0.826 0.015 0.832 0.004 0.787 0.045 0.840 -0.004 0.766 0.050
(Table continues)
Appendix C (continued)
41
Lsall LS8 LS18 Ridge I&D W&P PC8 PC40 Ewmin03
Prompt PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif
2557 0.817 0.019 0.790 0.044 0.809 0.028 0.814 0.027 0.799 0.036 0.797 0.041 0.768 0.066 0.813 0.020 0.733 0.066
2689 0.844 0.014 0.832 0.032 0.841 0.020 0.841 0.022 0.835 0.039 0.837 0.040 0.792 0.060 0.829 0.022 0.789 0.047
2690 0.817 -0.049 0.791 -0.023 0.814 -0.044 0.815 -0.047 0.801 -0.031 0.803 -0.041 0.778 -0.030 0.814 -0.045 0.744 0.011
2702 0.818 0.024 0.800 0.043 0.814 0.027 0.815 0.026 0.803 0.041 0.806 0.039 0.774 0.054 0.812 0.021 0.739 0.083
2703 0.810 -0.014 0.785 0.004 0.803 -0.007 0.806 -0.006 0.789 0.012 0.794 0.006 0.755 0.035 0.801 -0.008 0.739 0.014
0626 0.875 -0.027 0.855 -0.001 0.870 -0.020 0.872 -0.015 0.864 0.000 0.863 -0.003 0.814 0.034 0.864 -0.015 0.786 0.040
Note. Unif denotes uniform sample; PN denotes pseudo-normal sample.
Appendix D Correlations Between Essay Score and Predicted Essay Score for Three Equal Weights
Methods
Ewall Ewmin03 Ewmin00max03
Prompt Unif PN Unif PN Unif PN
2399 0.655 0.629 0.749 0.721 0.696 0.702
2400 0.699 0.688 0.771 0.776 0.727 0.747
2420 0.732 0.699 0.780 0.754 0.771 0.764
2421 0.748 0.688 0.811 0.734 0.797 0.723
2427 0.717 0.660 0.766 0.734 0.754 0.710
2428 0.709 0.704 0.778 0.762 0.746 0.716
2429 0.762 0.709 0.822 0.755 0.798 0.744
2431 0.729 0.599 0.823 0.710 0.789 0.646
2434 0.739 0.736 0.788 0.761 0.779 0.767
2438 0.734 0.638 0.792 0.732 0.751 0.699
2441 0.714 0.671 0.803 0.744 0.752 0.713
2448 0.698 0.644 0.800 0.740 0.748 0.681
2450 0.697 0.675 0.792 0.749 0.737 0.711
2554 0.744 0.656 0.824 0.744 0.795 0.717
2557 0.774 0.696 0.802 0.738 0.798 0.716
2689 0.785 0.712 0.837 0.784 0.833 0.769
2690 0.672 0.691 0.755 0.739 0.714 0.737
2702 0.746 0.681 0.820 0.742 0.780 0.732
2703 0.726 0.680 0.772 0.733 0.771 0.729
0626 0.776 0.723 0.824 0.798 0.795 0.767
Note. Unif denotes uniform sample; PN denotes pseudo-normal sample.
42
Appendix E Cross-validated Exact Agreement Rates and Kappas for Four Methods on All 20 Prompts
Table E1 Exact Agreement Rate for Cross-validation Using Four Methods OLS: Ordinary Least Squares; POM: Proportional Odds Model; CLFN: Classification Using Asymmetric Loss Matrix;QR: Quantile Regression, With the τ Values Listed in the Last Column
Prompt OLS POM CLFN QR τ
2399 .484 .481 .490 .486 55th
2400 .457 .474 .451 .481 50th
2420 .481 .486 .485 .494 55th
2421 .495 .487 .514 .509 55th
2427 .473 .469 .472 .470 55th
2428 .457 .453 .452 .453 60th
2429 .523 .540 .531 .534 55th
2431 .491 .502 .503 .513 50th
2434 .490 .511 .503 .502 55th
2438 .490 .522 .524 .514 60th
2441 .502 .511 .518 .520 55th
2448 .524 .524 .519 .534 50th
2450 .503 .514 .503 .510 55th
2554 .489 .498 .495 .490 60th
2557 .448 .462 .455 .460 50th
2689 .513 .518 .507 .515 55th
2690 .491 .518 .495 .507 50th
2702 .470 .482 .493 .497 70th
2703 .431 .468 .452 .453 60th
0626 .531 .535 .538 .535 55th
(Table continues)
43
Table E2 Exact Agreement Rate for Kappa Using Four Methods OLS: Ordinary Least Squares; POM: Proportional Odds Model; CLFN: Classification Using Asymmetric Loss Matrix;QR: Quantile Regression, With the τ Values Listed in the Last Column
Prompt OLS POM CLFN QR τ
0.336 0.333 0.344 0.339 55th
2400 0.302 0.324 0.294 0.333 50th
2420 0.333 0.339 0.338 0.349 55th
2421 0.351 0.340 0.375 0.369 55th
2427 0.322 0.317 0.321 0.318 55th
2428 0.302 0.297 0.295 0.297 60th
2429 0.387 0.408 0.397 0.401 55th
2431 0.345 0.360 0.361 0.374 50th
2434 0.344 0.371 0.361 0.360 55th
2438 0.344 0.385 0.388 0.375 60th
2441 0.360 0.371 0.380 0.383 55th
2448 0.388 0.388 0.381 0.401 50th
2450 0.361 0.375 0.361 0.370 55th
2554 0.343 0.354 0.351 0.344 60th
2557 0.290 0.308 0.299 0.306 50th
2689 0.374 0.380 0.366 0.376 55th
2690 0.345 0.380 0.351 0.366 50th
2702 0.318 0.334 0.348 0.353 70th
2703 0.268 0.316 0.295 0.297 60th
0626 0.397 0.402 0.406 0.402 55th
2399
44
Appendix F More on Quantile Regression
Because the quantile regression method has the advantage of looking at different
locations in the conditional score distributions, we may use this characteristic to improve
classification/prediction accuracy. For example, with prompt 2427, we applied OLS, POM, and
the following six quantile regression methods with the tuning parameter being 5%, 23%, 41%,
59%, 71%, and 95% on a random sample with size 247 from the pseudo-normal sample of 494 in
order to demonstrate how the quantile regression might be used to produce better classification
near the thresholds. These six regression quantiles were chosen to represent the evenly
distributed essay scores 1 to 6, i.e. 5% quantile represent the cutting point for score level 1 and 2;
23% represents the cutting point between score level 2 and 3 and so on.
As a contrast, Tables F1 and F2 show the confusion matrices of OLS and POM,
respectively. The diagonal elements, as bolded, indicate the correct classification.
Table F1
Confusion Matrix of OLS on Prompt 2427
Actual
1 2 3 4 5 6
1 1 1 0 0 0 0
2 5 10 12 2 0 0
3 3 15 35 25 1 0
4 0 1 21 44 21 1
5 0 1 1 6 16 11
6 0 0 0 0 6 8
Pred
icte
d
45
Table F2
Confusion Matrix of POM on Prompt 2427
Actual
1 2 3 4 5 6
1 2 3 0 0 0 0
2 6 11 19 4 0 0
3 1 11 25 19 1 0
4 0 2 24 44 17 0
5 0 1 1 10 20 12
6 0 0 0 0 6 8
Table F3 through F8 are the confusion matrices for the six regression quantiles. Two
distinguishing findings are:
1) Regression quantiles yielded very high agreement as measured by Kappa for
the particular score levels they were aiming at (see the bolded numbers). For
example, at score level 1, the 5th regression quantile only misclassified 1 of 9;
while OLS and POM misclassified 8 and 7 out of 9. The differences were
greater at higher and lower ends of the quantile regressions.
2) This extra correct prediction of scores had a cost associated with it. For
example, at score level 2, 20 out of 28 2s were predicted to be 1, and 21 of the
69 of 3s were labeled 1.
Table F3
Confusion Matrix of 5th Regression Quantile
Actual
1 2 3 4 5 6
1 8 20 21 3 0 0
2 1 6 35 37 8 0
3 0 1 12 33 17 3
4 0 1 1 4 16 11
5 0 0 0 0 2 5
6 0 0 0 0 2 1
Pred
icte
d Pr
edic
ted
46
Table F4
Confusion Matrix of 23rd Regression Quantile
Actual
1 2 3 4 5 6
1 6 7 2 1 0 0
2 3 15 28 7 1 0
3 0 5 29 48 14 0
4 0 1 10 18 16 5
5 0 0 0 3 11 10
6 0 0 0 0 2 5
Pred
icte
d
Table F5
Confusion Matrix of 41st Regression Quantile
Actual
1 2 3 4 5 6
1 2 3 0 0 0 0
2 5 11 15 3 0 0
3 2 12 42 39 5 0
4 0 1 11 30 21 3
5 0 1 1 5 15 10
6 0 0 0 0 3 7
Pred
icte
d
Table F6
Confusion Matrix of 59th Regression Quantile
Actual
1 2 3 4 5 6
1 1 1 0 0 0 0
2 4 6 2 0 0 0
3 4 17 37 12 0 0
4 0 3 28 56 18 0
5 0 1 2 8 20 11
6 0 0 0 1 6 9
Pred
icte
d
47
Table F7
Confusion Matrix of 77th Regression Quantile
Actual
1 2 3 4 5 6
1 0 1 0 0 0 0
2 2 4 0 0 0 0
3 7 15 28 7 0 0
4 0 6 31 51 13 0
5 0 2 10 18 21 6
6 0 0 0 1 10 14
Pred
icte
d
Table F8
Confusion Matrix of 95th Regression Quantile
Actual
1 2 3 4 5 6
1 0 0 0 0 0 0
2 0 1 0 0 0 0
3 3 6 4 0 0 0
4 6 13 29 16 0 0
5 0 8 31 54 22 1
6 0 0 5 7 22 19
Pred
icte
d
A possible solution to this problem exists in theory. If we “know” the position of the
examinee in the conditional score distributions, we would be able to use a more proper
regression quantile to predict. For instance, if we “know” a person’s essay score lies
approximately between 30 to 45 percentiles, then the 41st quantile regression would do a decent
prediction. Well, of course, the problem is that we can never know the percentile of an
examinee’s essay score in practice.
A preliminary study showed that one possible way to solve the puzzle is to use quantile
regression itself. The six regression quantiles provided six predictions for each examinee; for
48
example, the 2nd through 7th columns in Table F9 shows these six predictions for six examinees
whose essay score is 1 according to human raters. The 8th column shows the mean of the six
predictions. We used the rounded value of the mean as a rough estimate of the location of the
examinee’s score in conditional score distributions. For example, for examinee 154, the rough
location estimate is 2, and then we used the 23rd quantile regression, which is the 2nd among the
six, to produce the final prediction. The result is 1, as bolded. By the same arguments, this
method provided 6 correct predictions. As a contrast, the 50th regression quantile only yielded
one exact correct prediction, as shown in the last column in Table F9.
Table F9
Combining Several Regression Quantile Results to Give More Accurate Prediction
ID 5th 23rd 41st 50th 59th 77th 95th mean
154 1 1 2 2 3 3 4 2.3 ⇒23rd
196 1 1 2 2 2 3 3 2.0 ⇒23rd
205 1 1 1 1 1 2 3 1.5 ⇒23rd
214 1 1 2 2 2 3 4 2.2 ⇒23rd
233 1 1 1 2 2 2 3 1.7 ⇒23rd
244 1 1 2 3 3 3 4 2.3 ⇒23rd
This preliminary result shows a promising property of the quantile regression method.
Future research is needed, such as seeing how prior information may be used to estimate the
location in conditional score distributions. And more complicated combinations among
regression quantiles, for instance, the “systematic statistics,” which is the linear combination of
several regression quantiles, needs study. We hope, with further research, that a more effective
decision rule that takes advantage of quantile regression method can improve the accuracy of e-
rater prediction.
49