Improving the Statistical Aspects of e-rater · scores. Following an explanation of the statistical aspects of the process is a description of alternatives to the process. Our explorations

RESEARCH REPORT June 2003 RR-03-15

Research & Development Division Princeton, NJ 08541

Improving the Statistical Aspects of E-rater®: Exploring Alternative Feature Reduction and Combination Rules

Xin Feng Neil J. Dorans Liane N. Patsula Bruce Kaplan

Improving the Statistical Aspects of E-rater®:

Exploring Alternative Feature Reduction and Combination Rules

Xin Feng

Department of Statistics, Columbia University

Neil J. Dorans, Liane N. Patsula, and Bruce Kaplan

Educational Testing Service, Princeton, New Jersey

June 2003

Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from:

Research Publications Office Mail Stop 10-R Educational Testing Service Princeton, NJ 08541

Abstract

This study explores alternative ways of reducing the number of variables/features and additional

ways of combining information across features to produce more stable and accurate e-rater®

scores. Following an explanation of the statistical aspects of the process is a description of

alternatives to the process. Our explorations resulted in certain conclusions and directions for

future research. We have examined enough e-rater data to conclude that stepwise regression

seems to be effective as a feature reduction procedure. However, this may be attributed to the

consistently strong relationship with essay score that is observed for the content vector analysis

(CVA) variables and the two variables used to approximate word length (number of auxiliary

verbs and the ratio of the number of auxiliary verbs to the number of words). To yield better

validation results, we also suggest that the hold-out method for evaluating validity should replace

the current two-stage approach of first developing a model in a quasi-uniform training sample

and then validating these results in a target cross-validation sample. More research is needed in

several areas. First, explicit modeling of the part of essay scores that is unrelated to word length

is warranted. The POM (Proportional Odds Model) approach should be investigated in greater

depth. Also needed is a statistical justification for using essay scores to score CVA variables.

Algorithmic approaches to prediction/classification problem, such as boosting, may prove

fruitful. Further investigation of quantile regression and ridge regression should be conducted.

Key words: e-rater®, automated essay scoring, prediction, classification

i

Acknowledgements

This research occurred while the first author was an ETS Harold Gulliksen Psychometric Fellow.

This research was guided by the advice of a research team that included Paul Holland, Nan

Kong, and Sandip Sinharay. The authors are grateful to these colleagues for their advice and

critiques. Dan Eignor, Shelby Haberman, and Hariharan Swaminathan also provided helpful

comments. Any opinions expressed in this article are those of the authors and not necessarily of

Educational Testing Service.

ii

Table of Contents

Page

1. Introduction................................................................................................................................. 1

2. Statistical Aspects of the Process................................................................................................ 2

2.1 Data....................................................................................................................................... 2

2.1.1 Variables ..................................................................................................................... 2

2.1.2 Sample......................................................................................................................... 3

2.2 Methods ................................................................................................................................ 3

2.2.1 Content Vector Analysis ............................................................................................. 4

2.2.2 Forward Stepwise Version of OLS Regression .......................................................... 5

3. Alternatives to the Process.......................................................................................................... 6

3.1 Alternative Data Source for Model Development ................................................................ 6

3.2 Alternatives to Stepwise Regression for Variable Selection/Reduction............................... 8

3.2.1 No Selection/No Reduction ........................................................................................ 8

3.2.2 Equal Weights Regression .......................................................................................... 9

3.2.3 Principal Component Analysis.................................................................................. 10

3.2.4 A Novel Application of Content Vector Analysis .................................................... 11

3.2 Alternatives to Stepwise Version of OLS Regression for Model Development ................ 12

3.3.1 Quantile Regression .................................................................................................. 12

3.3.2 Ridge Regression ...................................................................................................... 13

3.3.3 Classification and Proportional Odds Model ............................................................ 14

4. Results....................................................................................................................................... 16

4.1 How to Assess Adequacy of Models .................................................................................. 16

4.1.1 Kappa and Weighted Kappa...................................................................................... 17

4.1.2 Average Loss............................................................................................................. 19

4.2 Results Comparison ............................................................................................................ 20

4.2.1 OLS, PCA, and EW .................................................................................................. 20

4.2.2 Ridge Regression and a Novel Application of CVA ................................................ 22

4.2.3 POM and Quantile Regression.................................................................................. 23

4.3 Other Results....................................................................................................................... 26

4.3.1 Effects of the Number of Words ............................................................................... 26

iii

5. Conclusions, Recommendations, and Future Research ............................................................ 28

5.1 Conclusions......................................................................................................................... 28

5.2 Future Research .................................................................................................................. 30

References..................................................................................................................................... 32

List of Appendixes........................................................................................................................ 34

iv

List of Tables

Page

Table 1. Essay Score Distribution for Training and Cross-validation Data ................................. 3

Table 2. Confusion Matrix.......................................................................................................... 17

Table 3. Loss Matrix ................................................................................................................... 19

Table 4. Exact Agreement Rate and Kappa for a GMAT Essay Prompt.................................... 24

Table 5. Hold-out Exact Agreement Rate and Kappa for a GMAT Essay Prompt .................... 24

Table 6. Average Loss for a GMAT Essay Prompt .................................................................... 25

Table 7. Asymmetric Loss Matrix .............................................................................................. 25

Table 8. Adjacent Loss Matrix.................................................................................................... 25

Table 9. Generalized Kappa for the Same GMAT Essay Prompt in Table 8 ............................. 26

Table 10. Exact Agreement of Two-step Regression on a Combined Sample............................ 27

Table 11. Kappa of Two-step Regression on a Combined Sample ............................................. 28

v

1. Introduction

Educational Testing Service (ETS) developed the e-rater® (electronic essay rater) system

to address the practical concerns that limit the number of essay questions on current tests, such as

the time and costs associated with human scoring. The system, which became operational in

1998, scores essays automatically.

E-rater, by design, provides a distinct scoring model for each new essay topic. Each

topic-specific model is built on a sample of scored essays, distributed across the set of all non-

zero rating points ranging from 1 to 6. As originally formulated, e-rater model development

involved using a hybrid feature methodology that incorporated several variables that are either

derived statistically, extracted through Natural Language Processing (NLP) techniques, or

measured by simple counting procedures. The e-rater approach is constantly changing. At one

time, there were 56 variables, which were referred to as features. These 56 features would be

entered into a stepwise linear regression analysis that is performed in a training dataset to

identify a subset of features that parsimoniously explains observed variation in the human rater

consensus scores. There was a regression model/equation developed for each essay topic.

Finally, final score predictions for a cross-validation dataset were obtained by using the

estimated equation from the training sample in a validation sample. When the resulting score

predictions exhibited sufficient agreement with corresponding human ratings, the scoring/e-rater

model was considered to have been validated. For more details, see Burstein et al. (1998).

As already alluded to, there are both linguistic and statistical components to the e-rater

process. We focused on the statistical component. As noted, there is a separate e-rater model

developed for each essay topic/prompt. Stepwise linear regression plays a prominent role in this

process. It reduces the number of features associated with a prompt by, in effect, giving zero

weight to most of the predictors. As a feature reduction procedure, stepwise regression may

capitalize on chance relationships in the data. In one sample, a large regression coefficient

estimate may be assigned to a chosen variable, while the estimated coefficients of variables that

are highly correlated to that chosen variable are near zero in value. Yet in another sample, the

previously chosen variable may be excluded in favor of one of the other variables that had been

excluded in the first sample. In addition, as a combination rule for available data, stepwise

regression, by discarding much of the data, ignores a large amount of potentially useful

information.

1

We explore alternative ways of reducing the number of variables/features and alternative

ways of combining information across features to produce more stable and accurate e-rater

scores. The remainder of this paper is divided into four sections. Section 2 offers a description of

the statistical aspects of the e-rater process. Section 3 follows with by a description of

alternatives to the process circa 2001. Section 4 is a presentation of results for the various

approaches. And finally, conclusions and recommendations for future research are made.

2. Statistical Aspects of the Process

In this section, the data and methods used in the process circa 2001 are described. Within

each description, shortcomings of the process are highlighted.

2.1 Data

The datasets are collected for each essay prompt. Training data are used to build the

model and then the model is cross-validated on another independent dataset to evaluate model fit

and stability. Each dataset contains the same variables collected on different samples. First, the

variables are described, followed by a description of the sample in each of the two datasets.

2.1.1 Variables

The dependent/response variable is the essay score. In the training dataset, the essay

score represents the combined judgments of two human raters. Two trained human raters score

each candidate’s essay. If the two resulting scores are no more than a single score point apart,

both scores are accepted and their rounded average is taken as the consensus score. In cases

where the two initial scores are not within a single point of each other, a third human rating is

obtained. When the third rating is equidistant to each of the two initial ratings, the average of all

the three scores is taken as the final score. When the third rating is not equidistant to each other

of the two initial ratings the single most discrepant score is excluded and the average of the two

remaining scores is taken as the consensus score. In the cross-validation data set, only the score

of one human rater is recorded.

For the version of e-rater we studied, there were 56 feature variables (see Appendix A).

These features are divided into three general classes: 1) discourse-related, word counts done by

NLP techniques, or by simple “counting” procedures; 2) syntax-related, word counts or ratios;

2

and 3) topic-related, categorical variables, derived statistically by content vector analysis (CVA)

using a “cosine correlation” technique that has the same scale values as essay scores.

2.1.2 Sample

As shown in Table 1, scores in the training sample are almost uniformly distributed and

the score distribution in the cross-validation sample is unimodal with a mild skewness. Later on,

we refer to the training dataset as the “uniform sample” and the cross-validation dataset as the

“pseudo-normal” sample.

Neither of the two datasets are random samples from the population. Here “population”

refers to all responses for a particular essay prompt when the prompt is active. Every year there

are approximately 200,000 GMAT® examinees and more than 80 active prompts. Thus, each

prompt is administered to approximately 2,500 examinees. For both the training and cross-

validation datasets, processing waits until the pre-assigned sample size is reached. Because

sample selection is not random, the results based on such datasets may not necessarily generalize

to the population of examinees for which e-rater is used. This is described further in section 3,

“Alternatives to the Process.”

Table 1

Essay Score Distribution for Training and Cross-validation Data

Score level Training Cross-validation 1 15 19 2 50 59 3 50 124 4 50 154 5 50 99 6 50 39

Total 265 494

2.2 Methods

Two statistical methods used by e-rater are content vector analysis and the forward

stepwise version of ordinary least squares (OLS) regression. A description of each follows.

3

2.2.1 Content Vector Analysis

Content vector analysis (CVA) is a statistical technique used to identify relationships

between words and documents. With regard to the approximate specifications in the rubric about

essay content, CVA can identify language (content words) in essays that appear to contribute to

essay score. The description of CVA below is based on Burstein et al. (1998).

Standard CVA characterizes each text document (essay) at the lexical (word) level. The

document is transformed into a list of (word, frequency) pairs, where frequency is simply the

number of times that word appeared in the document. This list constitutes a vector, which

represents the lexical content of the document. Morphological analysis then can be optionally

used to combine the counts of inflectionally related forms. For example, the words “talk,”

“talks,” and “talking” all contribute to the frequency of their stem, “talk.” In this way, a degree

of generalization is realized across morphological variants and the number of (word, frequency)

pairs is reduced. To represent a whole class of documents, such as a score level for a set of

essays, the documents in the score level are merged and concatenated and a single vector is

generated to represent the score level.

CVA as employed by e-rater refines this basic approach by assigning a weight to each

word in the vector based on the word’s salience. Salience is defined by the frequency of the word

in the document (or score level) multiplied by the inverse of its frequency over all documents.

For example, “of” may be very frequent in a given document, but its salience will be low

because it appears in all documents. If the word “infinitesimal” appears even a few times in a

document, it will likely have high salience because there are relatively few documents that

contain this word.

Finally, an essay is compared to a score level by computing a cosine correlation between

their weighted vectors. The cosine correlation is the same as Tucker’s coefficient of congruence

(Tucker, 1951). The coefficient of congruence between two columns x (words counts for a class)

and y (words counts for an individual document), is defined as the normalized inner product

between the columns x and y, namely as

|| || || ||⋅∑ xyx y

and || ||x 2 = . The larger the value of the cosine correlation, the closer the essay is to the

class. The class that is closest to the test item is selected.

∑ 2x

4

Summarized below are the steps applied to e-rater.

1. Vector construction for each essay (or score level):

a. Extract words from essay (or combined essays at a given score level)

b. Apply morphological analysis (optional)

c. Construct frequency vector

d. Apply weights based on salience to words to form weighted vector

2. Scoring:

a. Compute cosine correlation between essay vector and the vector of each score

level

b. Assign score to essay equal to the score level with the highest cosine correlation

2.2.2 Forward Stepwise Version of OLS Regression

Forward stepwise version of (OLS) regression is used to reduce the number of variables

and build e-rater models. Essentially, this search method develops a sequence of regression

models, at each step adding or deleting a predictor variable. The criterion for adding or deleting a

predictor variable can be expressed in terms of residual sum of squares reduction, coefficient of

partial correlation, t-statistic or F-statistic. The full description of the algorithm can be found in

standard linear regression textbooks, for example, chapter 8 in Weisberg (1985) and chapter 8 in

Neter, Kutner, Nachtsheim, and Wasserman (1996).

The forward stepwise regression methods are easy to explain, and inexpensive to

compute, as compared with the all-possible-regressions procedures. The comparative simplicity

of the results from stepwise regression seems to appeal to many analysts. It is probably the most

widely used automatic search methods. But stepwise methods must be used with caution. The

model selected in a stepwise fashion need not optimize any reasonable criterion function for

choosing a model. The apparent ordering of the predictors is an artifact of the method and need

not reflect relationships of substantive interest. Finally, forward stepwise regression may

seriously overstate significance of results. See chapter 8 in Weisberg (1985) for an example.

The forward selection procedure is a simplified version of forward stepwise regression in

which the statistical test of whether an already-entered variable should be eliminated from the

prediction equation is bypassed. In a pilot study, we applied both forward selection and forward

stepwise schemes to some essay prompts. The variables selected by both search algorithms were

almost identical; some “dominant predictors” were always selected by both procedures and the

5

differences were at most two variables. Given this finding and the large number of predictors, we

used the forward selection scheme to approximate e-rater in this study.

Discussed in the next section are alternatives to the use of forward stepwise regression for

variable selection and feature reduction.

3. Alternatives to the Process

In this section are descriptions for alternatives to uniform sampling for model

development and alternatives using forward stepwise regression to reduce the number of

variables for the model.

3.1 Alternative Data Source for Model Development

For the process described in section 2.1.2, sample selection is not random. The results

based on such datasets will not necessarily generalize to the population of examinees for which

e-rater is used. To demonstrate that the samples used are potentially problematic, we reversed

the roles of training (uniform distribution) and cross-validation (pseudo-normal distribution)

data. Figure 1 describes the typical situation in which distributions of multiple Rs in the training

sample are higher than their cross-validated correlations. If samples were chosen from the same

population and the sampling was not problematic, we would expect lower correlations in cross-

validation than in the training sample (Dorans & Drasgow, 1978). In Figure 2, however, where

the pseudo-normal sample was used to build the model and the uniform sample was used to

validate the model, this result did not occur. Instead of finding the decrease of correlations, we

observed increases from the model-building to the model-testing sample in most cases (see

Figure 2 and numerical results in Appendixes B and C). This increase in correlations informs us

that the non-random non-representative uniform samples typically employed in the training

sample are likely to produce optimistic estimates of the performance in the population.

The difference in results when the roles of the samples are reversed is caused by direct

selection on the criterion or response variable. Least squares regression estimates the conditional

mean function E(Y| X = x) = ap + bp x, where the subscript p indicates “population.” When

the samples are selected based on response variable Y, least squares regression then estimates

E(Yu | X = x) = au + bu x, where “u” represents the uniform-like sample distribution. Usually ap

a≠ u and bp b≠ u. Thus the resulting multiple-R will be misleading. In the case of e-rater, the

multiple-Rs in the training samples are higher than what is found in the full population. Hence,

6

the sample estimates are too optimistic, even more optimistic than what sampling error for

random samples suggests.

0.

750.

800.

85

Uniform Pseudo Normal

Cor

rela

tion

Figure 1. Boxplots of correlations betwprompts for current practice (uniformsample used to cross-validate the mode

An alternative to developing an e-

sample is to use a hold-out procedure (La

(training and cross-validation samples), w

procedure proceeds as the following:

(a) Omit one observation from th

remaining data.

(b) Classify (i.e., predict essay sc

constructed in Step (a).

(c) Repeat Steps (a) and (b) until

(d) Compute Kappa and average l

section 4).

Sample

een essay score and predicted essay score of 20 essay sample used to build the model and pseudo-normal l).

rater model using the uniformly distributed training

chenbruch & Mickey, 1968) on the combined sample

hich is a jackknife approach. Lachenbruch’s hold-out

e combined data set and develop the model based on the

ore) the hold-out observation using the estimated model

all of the observations are classified.

oss (more details on Kappa and loss function are in

7

While this procedure takes advantage of larger sample sizes for model building, a shortcoming is

that by holding out one observation at a time, we need to build the model hundreds of times,

which might cause practical computing time concerns.

0.78

0.80

0.82

0.84

0.86

Pseudo Normal Uniform

Cor

rela

tion

Sample

Figure 2. Boxplots of correlations between essay score and predicted essay score of 20 essay prompts for effect of reversing roles of samples (pseudo-normal sample used to build the model and uniform sample used to cross-validate the model).

3.2 Alternatives to Stepwise Regression for Variable Selection/Reduction

In this section, four variable selection/reduction alternatives to forward stepwise

regression are presented: no selection/no reduction, equal weights regression, principal

components analysis, and a novel application of content vector analysis.

3.2.1 No Selection/No Reduction

One alternative to forward stepwise regression for variable selection/reduction is to use

all variables, which implies no feature selection of reduction. The regression model that retains

8

all 56 features, i.e., that does not use any variable selection scheme, is labeled “LSall” in the

tables that follow.

3.2.2 Equal Weights Regression

An alternative for variable selection/reduction is equal weights (EW) regression. EW

regression (Einhorn & Hogarth, 1975; Dorans & Drasgow, 1978) is widely used in psychology

and social science because of its stability for cross-validation. In the EW method, the criterion

(response variable) is regressed onto a composite (centroid) of standardized predictors.

Procedurally, the first step is to standardize the predictors. Then, the standardized scores are

summed to obtain a composite. A least squares regression of the criterion onto this composite is

performed. Then, through the relationship of the composite to the original predictors, this single

regression weight is converted into raw score weights for predictors. See Dorans and Drasgow

(1980) for more details.

In addition to the usual EW method, labeled “Ewall,” we also tried two variations of EW

methods. Before computing the centroid, we eliminated some features that had very low

correlations with essay scores. We calculated the correlation matrix of features for each essay

prompt, so we had 20 matrices in total. For variation “EWmin00max03,” we included features

that had minimum correlations greater than 0 across all 20 prompts and whose maximum

correlations all exceed 0.3. For variation “EWmin03,” we only included features with minimum

correlations greater than 0.3 across all 20 prompts. Figure 3 is a visual summary of correlations

for the three EW-type methods applied to four GMAT essay prompts (see Appendix D for more

correlations for other prompts). Pairs of boxplots of correlations across 20 GMAT essay prompts

were used to evaluate model fit. The first boxplot of each pair, indexed by “Unif,” is for the

correlation coefficient between the essay scores in the “uniform sample” and the prediction built

in the same sample. The second boxplot, indexed by “PN,” is for the correlation coefficient

between the essay scores in the “pseudo normal sample” and the estimated scores using the

model built on the “uniform sample.” The central location of the boxplots shows the predictive

power of each regression method. The drop in location between the boxplots for a given method

shows the stability of that method. The steeper the drop in location, the less stable the method;

the higher the central location, the more predictive the model.

As can be inferred from Figure 3, the two methods that did not use features that have lower

correlations with the essay score give better results. Based on this finding, we did not consider

9

EWall or EWmin00max03 further. Several of the e-rater features had low, possibly negative,

correlations with human ratings.

Figure 3. Boxplots of correlations between essay score and predicted essay score of 20 essay

prompts for three EW methods.

Cor

rela

tion

3.2.3 Principal Component Analysis

Another alternative to forward stepwise regression for reducing the number of variables

is principal component analysis (PCA). The central idea of PCA is to reduce the dimensionality

of a dataset, which consists of a large number of interrelated variables, while retaining as much

as possible of the variation present in the dataset. This is achieved by transforming the original

features into a new set of variables, the principal components, which are uncorrelated, and

ordered so that the first few retain most of the variation present in all of the original variables.

10

3.2.4 A Novel Application of Content Vector Analysis

Finally, CVA was considered as an alternative to forward stepwise regression for

reducing the number of variables. In particular, the cosine correlation method of CVA was used

to construct a few variables out of many variables. Some e-rater features have the same string of

characters in variable names (see Appendix A). Some have the phrase “init,” while others had

the phrase “dev.” We constructed three vectors from the 56 features: “init” (all features with

“init,” which flags a word or phrase that marks the beginning of an argument), “dev” (all features

with “dev,” which flags a word or phrase that marks argument development), and “other” (all the

rest). We also tried another partition scheme. We constructed four vectors: “words” (all features

with “W”), “phrases” (all features with “P”), “claim” (all features with “claim,” which flag

words or terms related to a claim), and “other” (the rest). We identified these two methods as

“I&D” and “W&P,” respectively.

Next, we constructed the typical pattern vector, used by the cosine correlation method, by

merging the vectors of all the examinees in each score level. After getting the typical pattern

vectors, we followed the steps of CVA, calculated cosine correlations and assigned a score based

on those values. Using this CVA approach, we reduced these 56 features to either 3 or 4

variables. Least squares regressions were then run on the two reduced predictor sets.

There are at least two potential problems with cosine correlation. First, cosine correlation

is used to measure the distance between two vectors, by taking the cosine value of the angle

between the two vectors regardless of their lengths, which ignores information about the

distribution of the groups. For example, as shown in Figure 4, two groups indexed by squares

and circles have different variability. When using cosine correlation, we assign a new

observation, shown by an arrow, to the densely packed circle group. However, if we take the

shape of the two groups into account, we would assign it to the highly dispersed square group.

Of a more serious nature is the fact that score levels of the dependent variable, essay

scores, are used to define the categories of vectors against which individual score vectors are

compared and then scored. The “score” assigned to an individual’s vector of (word, frequency)

pairs is the essay score of the group most like (in a cosine correlation sense) that vector of scores.

This CVA “score” is then used to predict that person’s actual essay score. This functional

dependency may yield an artificially high level of association with essay score for variables

defined by the cosine correlation approach.

11

Figure 4. An illustration of a drawback of classification using cosine correlation.

3.2 Alternatives to Stepwise Version of OLS Regression for Model Development

Currently, the stepwise version of OLS regression is used by e-rater to build topic-

specific models. Although it is computationally convenient, in the context of e-rater, OLS has

several shortcomings, as outlined previously. Alternatives include quantile regression, ridge

regression, classification models, and proportional odds models. Each is described, respectively.

3.3.1 Quantile Regression

Quantile regression (Koenker & Bassett, 1978) is an alternative to OLS regression that

may better meet the needs of e-rater for the primary reason that different users may be interested

in different parts of the examinees’ score distribution. For example, a graduate school admission

committee may be interested in highly competitive examinees, the higher end of the conditional

score distribution, say 90 or 95 percentile of the conditional distribution. On the other hand, a

remedial program may focus at the lower end of the conditional score distribution. In addition,

one of the effective ways to provide diagnostic or instructional information is to compare the

differences in essays written by examinees with different writing proficiencies. Typically, the

regression relationships at different percentiles (or quantiles) of the conditional score

12

distributions are different. So, a simple location estimate, which is what least squares mean

regression does, may not be sufficient.

It also may be the case that a school may want to penalize underestimation and

overestimation differently. The consequences of each are often quite different. Underestimation

makes the type-I error of judging qualified examinees as less than qualified; while

overestimation makes the type-II error of misclassifying less than proficient examinees as

proficient. The effects of type-II error are bad, and those of type-I may be worse. An asymmetric

penalty would allow underestimation and overestimation to be penalized differently.

The quadratic loss function, which is the objective function of the least squares method,

is very sensitive to outliers. It penalizes each misclassification by the square of the distance

between the human rater score and predicted score. Some milder loss function, say absolute loss,

may be more appropriate.

Quantile regression can be related to OLS. As shown below, it can depict a more

complete picture of the conditional score distributions by changing a tuning parameter τ

accordingly. The objective function it minimizes is an asymmetric absolute loss function, which

is robust to outliers and puts on different weights on underestimation and overestimation.

The linear conditional quantile function, QY(τ| X = xi) = xi'β(τ) , where x i' and β(τ) are

column vectors of the same length, can be estimated by solving

Min ( )τρ∑ 1

n

ii

y=

− 'ix β

where ρ τ(u) = u {τ- I(u < 0)} and I(.) is an indicator function.

When we chose τ > 0.5, the check function ρ τ(u) will assign τ | u | loss to

underestimation and (1-τ) | u | to overestimation. In the special case τ = 0.5, the fitted surface is

the regression median of y on x that minimizes the absolute deviations. By linear programming,

the resulting minimization problem can be solved very efficiently (Koenker & D’Orey, 1987).

3.3.2 Ridge Regression

Ridge regression (Hoerl & Kennard, 1970) is a method for overcoming serious

multicollinearity problems of OLS regression by modifying the method of least squares. When a

lot of predictors are used, collinearity may occur. That is, the observations may fall near an

(n-1)-dimensional super plane in an n-dimensional space. In this case, the sample covariance

13

matrix is singular or almost singular. If OLS is used, the resulting coefficient estimates will not

be stable and will usually be exaggerated. Ridge regression is one of several methods that have

been proposed to remedy multicollinearity problems by modifying the method of least squares to

allow biased estimators of regression coefficients. Ridge regression overcomes multicollinearity

by adding a small number, k (between 0 and 1), to the diagonal elements of the covariance

matrix. The constant k reflects the amount of bias in the estimators. When k = 0, the estimators

are reduced to the OLS regression coefficients in standard form. When k > 0, the ridge regression

coefficients are biased but tend to be more stable (i.e. less variable) than OLS estimators.

When an estimator has only a small bias and is substantially more precise than an

unbiased estimator, it may well be the preferred estimator since it will have a larger probability

of being close to the true parameter value. A measure of the combined effect of bias and

sampling variation is the mean square error. This criterion can be used to select k and the ridge

traces can be used for variable selection purpose (Neter et al., 1996). Methods of estimating k

have been suggested by many writers, including Horel and Kennard (1970); a survey of

suggested methods is given by Vinod and Ullah (1981).

3.3.3 Classification and Proportional Odds Model

Finally, classification (CLFN) methods and proportional odds models (POM) may be an

improvement over OLS in estimating essay scores. Each is described in the following.

3.3.3.1 Classification. Due to the discrete nature of the response variable (essay scores),

e-rater modeling can be conceptualized as a classification problem. The goal is to find a

classification scheme to minimize the expected cost of misclassification (ECM). Let

1) Pi = prior probability of score level i, where i = 1, 2, …, 6

2) Lki = loss associated with allocating an essay to score level k when, in fact, it belongs

to score level i and

3) fi be the density associated with score level i.

The function to minimize is to assign an essay to score k, for which

P∑≠=

6

1ki

ii Lki fi

is the smallest (see Johnson & Wichern, 1992).

14

3.3.3.2 Proportional odds model. E-rater scores can be thought of as ordered categories.

One of the most effective ways to construct a model for ordinal responses such as this is to

invoke the concept of a latent variable, Z, which we will term writing ability. The actual recorded

response Y is envisaged as a discrete manifestation of the continuous latent variable in such a

way that the relationship is monotone:

α j-1 < Z ≤ α j ===> Y = j

The “cutting-points,” α j , are envisaged as unknown points on the latent ability scale.

With e-rater, the z-interval (- ,∞ α 1] is interpreted as the lowest level, whose score is 1; the

interval (α j-1 , α j) is assigned a score j, j =2, … , 6. Unless the latent abilities are close to one

of the boundaries, similar values of the latent variable are not distinguished and give rise to

identical observed grades.

The mechanism of the model requires knowledge of latent ability. Like all mathematical

models of behavior, this is an idealization of what really happens, and is not to be taken literally,

particularly at the edges. The important part is not so much the mechanism but the prediction. If

the model predictions are sufficiently close to observations and known limiting behaviors, the

model is consider to be working.

When assuming a linear relationship between the latent ability and the covariates, we have

Z = x' β o + ε , where ε is a random variable with cumulative distribution function F(.) and x

and β o are column vectors of the same length. Then the probability Pr(Z ≤ z) is F(z- x' β o ).

Relationship (1) specifies the implied model for Y in the form

θ j = Pr(Y j) = Pr(Z≤ ≤ α j ) = F(α j - x' β o ), (1)

or equivalently F –1(θ j) = α j - x' β o .

When we choose F(x) = ex/(1+ex), which is a logistic density for ε , this scheme produces

the proportional odds model.

logit(θ j ) = α j + x'β 1 (2)

where logit(p) = log(p/(1-p)), β 1 = – β o . The proportional odds model is a linear logistic

model in which the intercepts depend on category j, but the slopes are all equal. The graph of the

five cumulative logits against x is a series of parallel planes with intercepts α 1 , …, α 5 .

15

“Proportional odds” refers to the fact that the odds of event Y≤ j satisfies

odds(Y j) = Pr(Y j) / (1-Pr(Y ≤ ≤ ≤ j)) = exp(α j + x' β 1 ).

Consequently, the ratio of the odds of the event Y≤ j for x1 and x0 is

odds(Y j | x≤ 1) / odds(Y ≤ j | x0)) = exp((x1 - x0 )' β 1 )

which is a constant independent of j.

3.3.3.3 Connection between classification and POM. In the special situation, where a uniform

prior is assumed, i.e. P1 = … = P6 , and Cik = 0, when i = k; Cik = 1, when i k, the

classification scheme based on ECM in section 3.3.3.1 is the same as assigning group

membership with the largest P(Y = j ). In addition, the response variable is ordinal and POM

takes the ordering into account. There are several other different methods that can be used to

model ordinal data, as seen in Agresti (1990).

≠

4. Results

In describing the results, we first discuss how we assess the adequacy of models. This is

followed by comparisons of OLS, PCA, and EW; ridge regression and a novel application of

CVA; and POM and quantile regression. Finally, other results are presented, which include the

affect of the number of words and more on quantile regression.

4.1 How to Assess Adequacy of Models

We can evaluate the model fit and stability by R, the correlation between the response

variable and the predictions from a certain model. We use notation R(. | .), where the second dot

represents the dataset on which our model is built, and the first dot is the dataset on which the

developed model is justified or validated. A better model needs to have

(a) higher R(Unif | Unif) or training sample correlation;

(b) smaller decline from R(Unif | Unif) to R(PN | Unif), the validation sample

correlation,

where “Unif” stands for the “uniform sample” and “PN,” the “pseudo normal sample.”

This R is appropriate for the mean regression analysis where the response variable takes on

continuous values. For the classification nature of the problem, two additional measurements are

proposed to assess the fit of the model. One is Cohen’s Kappa (Cohen, 1960) between human

rater and e-rater; the other is the average loss of misclassification. Due to the discrete nature of

16

the score, the assessment depends on the confusion matrix (Johnson & Wichern, 1992; Kohavi

& Provost, 1998), which shows predicted versus actual group membership.

Table 2

Confusion Matrix Target score

1 2 3 4 5 6

1 C11 C12 C13 C14 C15 C16

2 …

3

4

5

6 C61 C62 C63 C64 C65 C66

Predicted score

The columns are human scores, which are assumed to be the target scores, and rows are

predicted scores (rounded to the nearest integers, with truncation at scores 1 and 6). Cij is the

count of essays whose target scores are j, but are classified as i. So the diagonal elements

represent the successful classifications, and the off diagonals, represent misclassification.

4.1.1 Kappa and Weighted Kappa

Exact agreement rate was used to evaluate the agreement between two raters because it is a

popular single summary index of the confusion matrix, which is easily calculated by

Po =

∑∑

∑

= =

=6

1

6

1

6

1

iij

j

ii

C

Ci = ∑

=

6

1iiiP

A shortcoming of Po is that it does not compare the agreement to that expected if the

ratings were independent. So it is an inflated measure of agreement between two raters. To

address this problem, Powers (2000) recommended the use of Cohen’s Kappa as a measure of

agreement. More recently, Dorans and Patsula (2003) proposed confusion reduction and

confusion infusion, which adjust for the expected agreement using trivial assignment rules, such

17

as rules for uniform random assignment, modal assignment, and matched proportion random

assignment.

Let Pe denote the value expected on the basis of chance alone. Under the independence

assumption, it can be computed by . The obtained excess beyond chance is Pii

i PP .

6

1.∑

=

o-Pe,

whereas the maximum possible excess is 1-Pe. The ratio of these two differences is denoted

Kappa,

e

eo

PPP

−−

=1

κ

In our study, we choose the matched proportion random (MPR) assignment rule (Dorans &

Patsula, 2003) as the baseline needing improvement. So Pe is calculated using the score

distribution in cross-validation sample shown in Table 1.

Kappa is designed for nominal classification. In the context of e-rater, where the

categories are ordered, the seriousness of a disagreement depends on the difference between the

ratings. Cohen (1968) generalized his Kappa measure to the case where the relative seriousness

of each possible disagreement could be quantified. The measure weighted Kappa (Spitzer,

Cohen, Fleiss, & Endicott, 1967) use weights wij to describe closeness of agreement. The weights

are restricted to lie in the interval 0 w≤ ij≤1 and wij = wji, i.e. the two raters are considered

symmetrically.

The observed weighted proportion of agreement is, say,

Po(w) = ∑∑ = =

6

1

6

1iij

jijPw

where the proportions Pij are in confusion matrix. The chance-expected weighted proportion of

agreement is,

Pe(w) = ∑∑= =

6

1)(

6

1iMPRij

jij Pw

Weighted Kappa is then given by

)(

)()(

1 we

wewow P

PP−

−=κ

Note that, when wij = 0 for all i j, and w≠ ij = 1 for all i= j, then weighted Kappa becomes

identical to the commonly used Kappa. Though developed from different perspectives in

18

statistics, decisions on the weights are closely related to the choices of loss function, as we will

show in the rest of this section.

4.1.2 Average Loss

We use the loss function that takes into account correct and incorrect classifications to evaluate

model fit. We can express the loss function in a format that is similar to the confusion matrix:

The columns are human scores categories, and the rows are rounded predicted scores with

truncation at scores 1 and 6.

Table 3

Loss Matrix Target score

1 2 3 4 5 6 1 L11 L12 L13 L14 L15 L16

2 … 3 4 5 6 L61 L62 L63 L64 L65 L66

Predicted score

Let L11= L22 = … = L66 =0 and let Lij be the loss that occurs when score I is assigned to

an examinee whose target score is j. The average loss, denoted by L, is computed by the

following formula:

∑

∑

=

== 6

1,

6

1,

jiij

jiijij

C

LCL .

The exact agreement rate can be expressed by the average loss, where Lii = 0, Lij=1, i≠ j.

The quadratic loss function is Lij = | i – j | 2, which is the minimand for OLS estimator; the

absolute loss function is Lij = | i – j | which yields the MAD estimator. Both of the two loss

functions are symmetric.

19

Fleiss and Cohen (1973) used the weights taken as

wij = 1 2

2

)1()

−ki( −

−j

Independent of Cohen (1968), Cicchetti and Allison (1971) proposed a statistic for measuring

inter-rater reliability that is formally identical to weighted Kappa. They suggested using the

weights

wij = 11−

−−

ji k

The two weights considered above are naturally connected to the quadratic loss and

absolute loss, for example,

Po(sq) = 1 2)1( −k1

− Lo(sq).

and some simple algebra shows that

)(

)()(

)(

)()(

1 sqo

sqosqMPR

sqe

sqesqosq L

LLP

PP −=

−

−=κ

where LMPR(sq) = 1 - Pe(sq) and Lo(sq) = 1 - Po(sq).

For the asymmetric loss function, which typically requires Lij > Lji when j > i. In this

study, we use asymmetric loss function Lij= | i – j | I(i>j) + | i – j | 2 I( i≤ j), which is a combination

of the upper triangle of quadratic loss and lower triangle of absolute loss. The currently chosen

asymmetric loss function is best viewed as a starting point.

4.2 Results Comparison

4.2.1 OLS, PCA, and EW

Figure 5 shows the regression results of six methods:

i) EWmin03: equal weight regression using features whose minimum correlations with essay

scores are greater than 0.3 across all 20 prompts

ii) LSall: least squares regression using all the linguistic features

iii) LS18: least squares regression using 18 features selected by the forward selection scheme

iv) LS8: least squares regression using 8 features selected by the forward selection scheme

v) PC40: PCA regression with first 40 principal components

vi) PC8: PCA regression with first 8 principal components

20

Two boxplots of correlations across 20 GMAT essay prompts were used to evaluate

model fit. The first boxplot of the two, indexed by “Unif,” is for the correlation coefficient

between the essay scores in the “uniform sample” and the prediction built in the same sample.

The second boxplot, indexed by “PN,” is for the correlation coefficient between the essay scores

in the “pseudo normal sample” and the estimated scores using the model built on the “uniform

sample.” The central location of the boxplots shows the predictive power of each regression

method. The drop in location between the boxplots for a given method shows the stability of that

method. The steeper the drop in location, the less stable the method; the higher the central

location, the more predictive the model.

Here are the most obvious findings drawn from Figure 5:

1. One dominant feature in the figure is the overfitting of LSall to the data—it has the

steepest location drop among all the methods, which means it is the least stable

method. This finding was expected because LSall keeps all the features no matter

how relevant they are to the response variable.

2. PCA, in general, is more stable than LSall, but it did not account for as much of the

variation of the response variable as LSall does. We are not surprised to see that the

Rs of LSall are generally higher than those of PCA. This result occurs because the

two methods, PCA and LSall, seek different goals. The predictive components of the

PCA regression are linear combination of the predictor variables that are chosen to

maximize the variance of the linear combination (the principal component) making

sure at the same time that each successive linear combination is independent of prior

ones. The coefficients of LS are computed to minimize the sum of squares of the

difference between the observed response variable and the predicted values. Thus,

PCA may not be as predictive as LS.

3. EW does not work well in e-rater modeling. Though EWmin03 was a better predictor

than EWall, it still failed to compete with the LSall and PCA methods.

4. LS8, which is our surrogate for forward stepwise OLS, is the best performer in Figure

5 in terms of both predictive power and stability.

5. The variation of results across different methods is greater in the Unif training

samples than it is in the PN validation samples.

21

Figure 5. Boxplots of correlations between essay score and predicted essay score based on six regression methods for 20 GMAT essay prompts.

4.2.2 Ridge Regression and a Novel Application of CVA

Figure 6 compares ridge regression with LS8, LSall, I&D, and W&P, where LSall and

LS8 are as defined in section 4.2.1; and I&D and W&P are two LS methods that use CVA as

defined in section 3.2.4. In this ridge regression method, we use a formula proposed by Hoerl,

Kennard, and Baldwin (1975) to estimate the ridge parameter k: 2ˆˆ

ˆ ˆpk σ

=β'β

where p is the number of predictors, is the OLS coefficient estimates and is the error

variance estimate that would be obtained when the OLS method is used with all the feature

variables. It can be shown that with this choice of ridge parameter estimates can yield predictions

with smaller mean square error than those produced by OLS method. In addition, the

corresponding ridge estimates using the above-mentioned ridge parameter estimate is a special

case of so-called “double h-class ridge estimates.” See Vinod and Ullah (1981) for details.

β̂ 2σ̂

22

From the comparisons in Figure 6 we see:

1. The partition scheme using “init” and “dev” (I&D) gave a slightly better result than

that using “words” and “phrases” (W&P) when evaluated in terms of both

predictability and stability.

2. The two CVA methods (W&P and I&D) gave comparable results to LS8, which was

our proxy for e-rater.

3. Ridge regression cross-validated better than LSall, but did not cross-validate as well

as the other three methods.

Figure 6. Boxplots of correlations between essay score and predicted essay score for 20 GMAT essay prompts comparing three regression methods with two methods using cosine correlation.

4.2.3 POM and Quantile Regression

Both POM and quantile regression used the same subset of features as selected by forward least

squares method. As mentioned in section 3.3.3, the quantile regression method—unlike OLS,

23

which only provides a single location estimation, the conditional mean function—gives a series

of estimates of conditional quantile functions. For prediction, a single quantile regression, which

yields the highest Kappa or lowest loss, needs to be selected from among this series. To pick the

correct regression quantile, 19 quantile regression procedures, with τ ranges from 5% to 95% in

steps of 5%, were computed on the combined dataset (combining Unif and PN) with 749

examinees. For each quantile regression method, Kappa, weighted Kappa, and several average

losses were computed. We chose the τth regression quantile with highest Kappa.

For example, Table 4 shows the exact agreement rates and Kappas of different regression

quantiles for GMAT essay prompt 2434. Only the 45th to 70th regression quantiles are shown

because this range contained the highest Kappa. The 55th regression quantile (in bold) was

selected. As a reference, we did a hold-out procedure using 50th, 55th, and 60th regression

quantile. These results are shown in Table 5.

Table 4

Exact Agreement Rate and Kappa for a GMAT Essay Prompt

Quantile 45% 50% 55% 60% 65% 70% Agreement rate .501 .503 .520 .518 .495 .485 Kappa .358 .361 .383 .380 .351 .338

Table 5

Hold-out Exact Agreement Rate and Kappa for a GMAT Essay Prompt

Method OLS POM CLFN 50% 55% 60% Agreement Rate .490 .511 .503 .499 .502 .497 Kappa .344 .371 .361 .356 .360 .353

Table 6 contains the hold-out exact agreement rates and corresponding Kappas for four

methods. For CLFN, we assumed a uniform prior. For the other two components, fi and Lki as

defined in section 3.3.3.1, we used POM to fit fi, the density associated with score level i, and

24

used a combination of an upper triangle of squared loss and lower triangle of absolute loss as

seen in Table 7.

Table 6

Average Loss for a GMAT Essay Prompt

OLS POM CLFN 50% 55% 60%Sq .640 .603 .634 .615 .636 .652Ab .553 .527 .542 .539 .544 .552Eq .553 .527 .540 .539 .544 .551Aj .582 .548 .557 .560 .563 .565

Table 7

Asymmetric Loss Matrix

1 2 3 4 5 6 1 0 1 4 9 16 252 1 0 1 4 9 163 2 1 0 1 4 9 4 3 2 1 0 1 4 5 4 3 2 1 0 1 6 5 4 3 2 1 0

Table 6 shows four different average losses, where “Sq” means squared loss; “Ab”—absolute

loss; “Eq”—equal loss and “Aj”—adjacent loss, which is shown in Table 8.

Table 8

Adjacent Loss Matrix

1 2 3 4 5 61 0 1 2 2 2 22 1 0 1 2 2 23 2 1 0 1 2 24 2 2 1 0 1 25 2 2 2 1 0 16 2 2 2 2 1 0

25

The corresponding weighted Kappas are listed in Table 9.

Table 9

Generalized Kappa for the Same GMAT Essay Prompt in Table 8

OLS POM CLFN 50% 55% 60%

Sq .791 .803 .793 .799 .792 .787

Ab .597 .616 .605 .607 .603 .597

Eq .289 .322 .306 .307 .300 .291

Aj .504 .533 .525 .523 .520 .518

POM had the highest Kappa and weighted Kappa and lowest loss; it outperformed the

other methods in this prompt. The 55th regression quantile and CLFN gave basically the same

results. These results are typical, as seen in Appendix E.

4.3 Other Results

4.3.1 Effects of the Number of Words

In the past, empirical studies (Breland, Bonner, & Kubota, 1995; Page & Petersen, 1995;

Sheehan, 2003) have shown that the number of words is highly correlated with essay scores. We

noticed that two features always selected by forward selection across the 20 GMAT essay

prompts were 1) the number of auxiliary verbs and 2) the ratio of the number of auxiliary verbs

to the number of words. If we divide 1) by 2), we get total number of words. So the effects of the

number of words deserves in depth study.

We fit a quadratic regression using number of words to essay scores of Prompt 2438. In

this regression, we predicted essay score from the total number of words, and that total squared.

The dataset on which we built the model combined the uniform sample and pseudo-normal

sample. The Kappa between human scores and the predicted scores based on words and words

squared was .344. When OLS with forward selection was used on this prompt, Kappa was .354.

The quadratic regression using the number of words did a very good job of predicting, as has

26

been demonstrated by Sheehan (2003) and others. But there remained aspects of the essay score

that the quadratic regression using the number of words could not capture.

To see what lies beyond the number of words and that number squared, we fit a “two-step

regression” to the essay scores. The two-step regression proceeded as follows:

i) fit a quadratic regression using the number of words to essay scores, yi

ii) apply OLS with forward selection to fit the residuals, ri , from step i

The corresponding hold-out procedure for cross-validation involved three steps. For each

examinee in the sample, we:

a) predicted yi using the estimated quadratic regression in step i

b) predicted ri using the estimated regression equation from step ii

c) summed the predicted values from a) and b) to produce a final prediction

Tables 10 and 11 show that, when taking the effects of the number of words into account

and using OLS, we could improve the prediction accuracy; the Kappa increased from .344 to

.378 (the exact agreement rate rose from .490 to .516). In addition, we looked at using the

quantile regression approach in place of the OLS method as the second step of the two-step

regression. The 60th regression quantile yielded the highest Kappa in the hold-out sample, .396,

which is higher than the OLS two-step regression.

Table 10

Exact Agreement of Two-step Regression on a Combined Sample

Fitted values Hold-out cross-validation

OLS with forward selection .498 .490 Two-step OLS regression .524 .516

50th .526 .520 55th .526 .521 60th .534 .530

Two-step quantile regression

65th .510 .509

27

Table 11

Kappa of Two-step Regression on a Combined Sample

Fitted values Hold-out cross-validation

OLS with forward selection .354 .344 Two-step OLS regression .388 .378

50th .390 .383 55th .390 .384 60th .401 .396

Two-step quantile regression

65th .370 .369

Another interesting finding was that the two variables, the number of auxiliary verbs and

ratio of the number of auxiliary verbs to the number of words, were not entered in the regression

equation in the second step. We expected to see this because the first step already took into

account of the effects of number of words.

Our conclusion is that number of words is a powerful predictor. However, there remains

something number of words cannot capture—predicting the part not accounted for by words

from other e-rater features and adding that to the part predicted by words yields a better score

prediction.

5. Conclusions, Recommendations, and Future Research

Our explorations to date have resulted in certain conclusions and directions for future

research.

5.1 Conclusions

We have examined enough e-rater data to conclude that:

1. Stepwise regression seemed to be an effective feature reduction procedure with e-rater

data. Not suited were equal weights regression and principal components regression as

these approaches assume dimensional structures that were not consistent with the e-rater

data. Equal weights regression works best when the predictor variables exhibit

collinearity and each predictor is related to the criterion (essay score in this case) to about

the same degree. Principal components regression works best when most of the criterion

28

variable projects onto the reduced feature space defined by the dimensions that maximize

variance in that space. With e-rater data, there are a few relatively strong predictors.

2. The effectiveness of stepwise regression may be attributed to the consistently strong

relationship with essay score that is observed for the content vector analysis (CVA)

variables and the two variables that are surrogates for word length (number of auxiliary

verbs and ratio of the number of auxiliary verbs to the number of words). Note that

because performance on the criterion variable is used to develop scores on the CVA

variables, there is a functional dependency between essay score and the scores on these

CVA variables that boosts the predictive power of these CVA variables. Furthermore,

word length has been shown to be a powerful predictor of essay score. The ratio of two e-

rater features, (the number of auxiliary verbs / the ratio of the number of auxiliary verbs

to the number of words), is the number of words.

3. Replacing the current two-stage approach of first developing a model in a quasi-uniform

training sample and then validating these results in a target cross-validation sample with

the hold-out method for evaluating validity will yield better validation results. One

possibility is the K-fold cross-validation method (Hastie, Tibshirani, & Friedman, 2001)

in which part of the available data is used to fit the model and a different part to test it.

This approach entails splitting the data into K roughly equal-sized parts. For example,

when K=10, the scenario looks like—

1 2 3 4 5 6 7 8 9 10

Train Train Train Train Test Train Train Train Train Train

For the kth part (5th above), we fit the model to the other K-1 parts of the data, and

calculate the average losses or confusion reductions of the fitted model when predicting

the kth part of the data. We do this for k = 1, 2, …, K and combine the K estimates of the

average loss or confusion reductions. The hold-out procedure is a special case of this K-

fold method.

29

5.2 Future Research

More research is needed to:

1. Explicitly model the part of essay scores that is unrelated to word length.

2. Investigate the POM approach in greater depth because the ordered categorical nature of

essay scores means that automated scoring of essays is essentially a classification

problem. Viewing the essay-grading problem as a classification problem will enable us to

explicitly consider the consequences of attaching different loss functions to different

kinds of misclassifications.

3. See if it matters much whether the response variable is treated as if it is continuous, as is

the case with model building for the current approach, or as a category as is the case

currently for model evaluation. One could argue that the grades assigned by raters to

essays are actually a course discrete version of a continuous variable of essay quality. In

that case, a classification model may not be appropriate because the score categories are

not really categories in the sense that gender and/or national origin are categorical. Then

measures of association other than agreement may be of greater interest.

4. Develop a statistical justification for using essay scores to score CVA variables, and if

such a justification can be developed, determine if CVA can be used as part of a model

building process that uses representative samples. For example, it might be possible that

CVA could use a uniformly distributed sample, while a representative sample is used to

build the model.

5. Investigate algorithmic approaches to prediction/classification problem, such as boosting.

The applications mentioned in this report tend to operate within what Breiman (2001)

calls a data modeling approach. Response variables are viewed as functions of predictor

variables, random noise and parameters. For example, the prediction model for e-rater

states that essay scores for a given prompt are a linear combination of about ten variables

that are selected from a set of more than 50 candidate predictor variables. A subset of

variables is selected and parameters are estimated. The resultant model is then used to

score future essays. An alternative to this traditional data modeling approach that is used

increasingly in prediction settings is what Breiman (2001) calls the algorithmic approach.

Instead of postulating a model for predicting response variable from a string of

30

predictors, fitting the model, and evaluating its fit, the algorithmic approach tries to find a

function that does the best job of predicting response variable in the population.

6. Investigate quantile regression further because it appears to give better agreement rates

than OLS and is more flexible to differential treatment of types of error. Feng, Ying, and

Dorans (2002) have begun explorations along these lines (see Appendix F).

7. Investigate whether ridge regression can be used more effectively once changes are made

in the data employed by e-rater.

8. Due to the large pool of potential predictor variables, use of a “best” subsets algorithm

may not be feasible. Approaches that overcome the shortcomings of automatic search

procedures should be investigated. For example, we might use the subset identified by the

automatic search procedures as a starting point for search for other “good” subsets. One

option might be is treat the number of predictor variables in the regression model

identified by the automatic search procedure as being about the right subset size and then

use the all-possible-regressions procedures for subsets of this and nearby sizes. Another

possibility is to use a logical analysis of the predictors’ space in conjunction with

statistics to weight variables.

9. Investigate whether incorporating prior information about coefficients based on available

models for other prompts can be used to more stable e-rater models.

31

References

Agresti, A. (1990). Categorical data analysis. New York: Wiley.

Breiman, L. (2001). Statistical modeling: The two cultures (with discussion). Statistical Science,

16(3), 199-231.

Breland, H. M., Bonner, M. W., & Kubota, M. Y. (1995). Factors in performance on brief,

impromptu essay examinations (College Board Report No. 95-4 and ETS RR-95-41).

New York: College Entrance Examination Board.

Burstein, J., Braden-Harder, L., Chodorow, M., Hua, S., Kaplan, B., Kukich, K., Lu, C., Nolan,

J., Rock, D., & Wolff, S. (1998). Computer analysis of essay content for automated score

prediction: A prototype automated scoring system for GMAT analytical writing

assessment essays (ETS RR-98-15). Princeton, NJ: Educational Testing Service.

Cicchetti, D. V., & Allison, T. (1971). A new procedure for assessing reliability of scoring EED

sleep recordings. American Journal of EEG Technology, 11, 101-109.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological

Measurement, 20, 37-46.

Cohen, J. (1968). Weighted Kappa: Nominal scale agreement with provision for scaled

disagreement or partial credit. Psychological Bulletin, 70, 213-220.

Dorans, N. J., & Drasgow, F. (1978). Alternative weighting schemes for linear prediction.

Organizational Behavior & Human Performance, 21(3), 316-345.

Dorans, N. J., & Drasgow, F. (1980). A note on cross-validating prediction equations. Journal

of Applied Psychology, 65(6), 728-730.

Dorans, N. J., & Patsula, L. N. (2003). Using confusion infusion and confusion reduction indices

to compare alternative essay scoring rules (ETS RR-03-09). Princeton, NJ: Educational

Testing Service.

Einhorn, H. J., & Hogarth, R. M. (1975). Unit weighting schemes for decision making.

Organizational Behavior & Human Performance, 13(2), 171-192.

Feng, X., Ying, Z., & Dorans, N. J. (2002). Using quantile regression to model simulated

essay scores. Paper presented at the Annual Meeting of the Psychometric Society,

Chapel Hill, NC.

32

Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted Kappa and the intraclass

correlation coefficient as measure of reliability. Educational and Psychological Measurement,

33, 613-619.

Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data

mining, inference, and prediction, New York: Springer

Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal

problems. Technometrics, 12(3), 55-67.

Hoerl, A. E., Kennard, R. W., & Baldwin, K. F. (1975). Ridge regression: some simulations.

Communications in Statistics, 4, 105-23.

Johnson, R. A., & Wichern, D. W. (1992). Applied multivariate statistical analysis. Englewood

Cliffs, NJ: Prentice Hall.

Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrica, 46, 33-50.

Koenker, R., & D’Orey, V. (1987). Computing regression quantiles. Applied Statistics, 36, 383-393.

Kohavi, R., & Provost. F. (1998). Glossary of terms. Journal of Machine Learning, 30, 271-274.

Lachenbruch, P. A., & Mickey, M. R. (1968). Estimation of error rates in discriminant analysis.

Technometrics, 10(1), 1-11.

Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996). Applied linear regression

models (3rd ed.). Chicago, IL: Irwin.

Page, E. B., & Peterson, N. (1995). The computer moves into essay grading: updating the ancient

test. Phi Delta Kappa, 76, 561-565.

Powers, D. E. (2000). Computing reader agreement for the GRE writing assessment (ETS RM-

00-8). Princeton, NJ: Educational Testing Service.

Sheehan, K. M. (2003) Discrepancies in human and computer-generated essay scores for

TOEFL-CBT. Manuscript submitted for publication.

Spitzer, R. L., Cohen, J., Fleiss, J. L., & Endicott, J. (1967). Quantification of agreement in

psychiatric diagnosis. Archives of General Psychiatry, 17, 83-87.

Tucker, L. R. (1951). A method for synthesis of factor analysis studies (Personnel Research

Section Report No.984). Washington, DC: Dept. of the Army.

Vinod, H. D., & Ullah, A. (1981). Recent advances in regression methods. New York: Marcel,

Dekker.

Weisberg, S. (1985). Applied linear regression (2nd ed.). New York: Wiley.

33

List of Appendixes

Page

A. E-rater® Feature Variables Circa 2001................................................................................... 35

B. Correlations Between Essay Score and Predicted Essay Score for Nine Methods Given

Practice of Using Uniform Sample to Build the Model and Pseudo-normal Sample to Cross-

validate the Model................................................................................................................... 38

C. Correlations Between Essay Score and Predicted Essay Score for Nine Methods After

Reversing Roles of Samples (Pseudo-normal Sample Used to Build the Model and Uniform

Sample Used to Cross-validate the Model) ............................................................................ 40

D. Correlations Between Essay Score and Predicted Essay Score for Three Equal Weights

Methods................................................................................................................................... 42

E. Cross-validated Exact Agreement Rates and Kappas for Four Methods on All 20 Prompts.. 43

F. More on Quantile Regression .................................................................................................. 45

34

Appendix A E-rater® Feature Variables Circa 2001

Discourse-related Features

Note:

Arg_init = word (W) or phrase (P) that flags the beginning of an argument

Arg_dev = word or phrase that flags argument development

Capitalized terms indicate a “class” of discourse, e.g., PARALLEL_W = words or terms like,

First, Second, that indicate “parallel” argument structure.

1. total number of paragraphs

2. sentences in a paragraph 1

3. number of arg_inits in paragraph 1

4. number of arg_devs in paragraph 1

5. number of arg_aux in paragraph 1

6. total_arg_init in essay

7. total_arg_dev in essay

8. total_arg_init_CLAIM_N

9. total_arg_init_CLAIM_PRO

10. total_arg_init_CLAIM_THAT

11. total_arg_init_CLAIM_THAT_LESS

12. total_arg_init_CLAIM_Ving

13. total_arg_init_PARALLEL_W

14. total_arg_init_PARALLEL_P

15. total_arg_init_SUMMARY_W

16. total_arg_init_SUMMARY_P

17. total_arg_init_TRANSITION_P

18. total_arg_init_To_INFCL

19. total_arg_init_D_SPECIFIC

20. total_arg_init_NEW_PARAGRAPH

21. total_arg_dev_CLAIM_THAT

22. total_arg_dev_CLAIM_THAT_LESS

35

23. total_arg_dev_CLAIM_Ving

24. total_arg_dev_To_INFCL

25. total_arg_dev_SAME_TOPIC

26. total_arg_dev_BELIEF_W

27. total_arg_dev_BELIEF_P

28. total_arg_dev_CONTRAST_W

29. total_arg_dev_CONTRAST_P

30. total_arg_dev_DETAIL_W

31. total_arg_dev_DETAIL_P

32. total_arg_dev_DISBELIEF_W

33. total_arg_dev_EVIDENCE_W

34. total_arg_dev_EVIDENCE_P

35. total_arg_dev_INFERENCE_W

36. total_arg_dev_INFERENCE_P

37. total_arg_dev_REFORMULATION_W

38. total_arg_dev_REFORMULATION_P

39. total_arg_dev_RHETORICAL_W

40. total_arg_dev_RHETORICAL_P

41. total words in 1st argument

42. total sentences in 1st argument

43. total arg_devs in 1st argument

Syntax-related

1. COMPCL= total number of complement clauses

2. INFCL = total number of infinitive clauses

3. RELCL = total number of relative clauses

4. SUBCL = total number of subordinate clauses

5. AUX_VERB = total number of auxiliary verbs

6. R-COMPCL= ratio of complement clauses based on total number of sentences

7. R-INFCL = ratio of infinitive clauses based on total number of sentences

8. R-RELCL = ratio of relative clauses based on total number of sentences

36

9. R-SUBCL = ratio of subordinate clauses based on total number of sentences

10. R-AUX_VERB = ratio of auxiliary verbs based on total number of sentences

Topic-related

1. Analysis of all essay vocabulary

2. Analysis of essay vocabulary within each argument (the essay is partitioned into by the e-

rater discourse analyzer)

3. Analysis of essay vocabulary in the first and final arguments of the essay

37

Appendix B Correlations Between Essay Score and Predicted Essay Score for Nine Methods Given Practice of Using Uniform Sample to

Build the Model and Pseudo-normal Sample to Cross-validate the Model

38

Lsall LS8 LS18 Ridge I&D W&P PC8 PC40 Ewmin03

Prompt Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN

2399 0.841 -0.094 0.816 -0.069 0.834 -0.082 0.835 -0.077 0.826 -0.053 0.828 -0.063 0.781 -0.027 0.834 -0.092 0.749 -0.028

2400 0.831 -0.043 0.811 -0.019 0.824 -0.036 0.824 -0.023 0.818 -0.017 0.823 -0.037 0.784 -0.006 0.822 -0.037 0.771 0.005

2420 0.852 -0.095 0.823 -0.044 0.841 -0.078 0.843 -0.068 0.824 -0.032 0.831 -0.052 0.773 -0.006 0.837 -0.078 0.780 -0.026

2421 0.898 -0.142 0.873 -0.109 0.890 -0.125 0.894 -0.129 0.878 -0.107 0.880 -0.111 0.832 -0.080 0.877 -0.126 0.811 -0.077

2427 0.861 -0.132 0.831 -0.053 0.850 -0.093 0.855 -0.101 0.834 -0.079 0.834 -0.068 0.792 -0.033 0.847 -0.104 0.766 -0.032

2428 0.852 -0.075 0.825 -0.028 0.839 -0.052 0.845 -0.054 0.832 -0.032 0.842 -0.039 0.786 0.011 0.851 -0.073 0.778 -0.016

2429 0.889 -0.098 0.870 -0.072 0.884 -0.086 0.885 -0.090 0.878 -0.075 0.880 -0.077 0.852 -0.061 0.886 -0.094 0.822 -0.067

2431 0.906 -0.156 0.881 -0.131 0.899 -0.138 0.903 -0.149 0.887 -0.117 0.889 -0.112 0.831 -0.098 0.903 -0.142 0.823 -0.113

2434 0.901 -0.132 0.864 -0.063 0.880 -0.093 0.897 -0.117 0.872 -0.068 0.875 -0.073 0.818 -0.022 0.888 -0.112 0.788 -0.027

2438 0.899 -0.151 0.873 -0.081 0.889 -0.120 0.895 -0.140 0.883 -0.105 0.889 -0.112 0.821 -0.082 0.891 -0.142 0.792 -0.060

2441 0.872 -0.088 0.848 -0.055 0.866 -0.075 0.866 -0.078 0.852 -0.053 0.852 -0.058 0.822 -0.055 0.857 -0.084 0.803 -0.059

2448 0.878 -0.119 0.852 -0.056 0.868 -0.086 0.872 -0.095 0.856 -0.049 0.858 -0.055 0.821 -0.050 0.861 -0.094 0.800 -0.060

2450 0.869 -0.080 0.844 -0.040 0.862 -0.070 0.864 -0.072 0.850 -0.027 0.857 -0.036 0.816 -0.037 0.862 -0.080 0.792 -0.043

2554 0.885 -0.099 0.873 -0.087 0.880 -0.093 0.881 -0.091 0.873 -0.090 0.877 -0.097 0.847 -0.079 0.878 -0.093 0.824 -0.08

(Table continues)

Appendix B (continued)

39


Prompt Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN

2557 0.887 -0.146 0.858 -0.103 0.877 -0.117 0.883 -0.131 0.856 -0.083 0.869 -0.114 0.835 -0.067 0.881 -0.142 0.802 -0.064

2689 0.908 -0.123 0.888 -0.081 0.903 -0.106 0.904 -0.115 0.893 -0.080 0.896 -0.083 0.861 -0.077 0.896 -0.128 0.837 -0.053

2690 0.832 -0.106 0.799 -0.039 0.821 -0.092 0.825 -0.078 0.809 -0.060 0.808 -0.058 0.765 -0.018 0.826 -0.092 0.755 -0.016

2702 0.893 -0.130 0.873 -0.104 0.888 -0.122 0.888 -0.123 0.866 -0.087 0.872 -0.087 0.837 -0.081 0.883 -0.133 0.820 -0.078

2703 0.854 -0.134 0.830 -0.103 0.845 -0.129 0.848 -0.117 0.834 -0.091 0.836 -0.093 0.804 -0.078 0.847 -0.125 0.772 -0.039

0626 0.901 -0.081 0.873 -0.029 0.889 -0.063 0.895 -0.070 0.888 -0.044 0.884 -0.040 0.850 -0.041 0.888 -0.083 0.824 -0.026

Note. Unif denotes uniform sample; PN denotes pseudo-normal sample.

Appendix C Correlations Between Essay Score and Predicted Essay Score for Nine Methods After Reversing Roles of Samples (Pseudo-

normal Sample Used to Build the Model and Uniform Sample Used to Cross-validate the Model)

40


Prompt PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif

2399 0.833 -0.070 0.800 -0.018 0.821 -0.044 0.830 -0.063 0.813 -0.020 0.811 -0.031 0.776 -0.013 0.825 -0.067 0.768 -0.004

2400 0.833 -0.037 0.810 -0.021 0.821 -0.031 0.830 -0.033 0.817 -0.014 0.812 -0.010 0.787 -0.011 0.826 -0.033 0.767 0.014

2420 0.830 -0.029 0.803 -0.002 0.816 -0.015 0.826 -0.027 0.815 -0.005 0.811 -0.001 0.774 0.003 0.811 -0.030 0.756 0.007

2421 0.824 0.022 0.801 0.047 0.811 0.035 0.821 0.025 0.806 0.046 0.807 0.045 0.771 0.053 0.814 0.030 0.755 0.062

2427 0.824 -0.017 0.803 0.009 0.811 0.000 0.821 -0.010 0.810 -0.006 0.815 -0.012 0.768 0.024 0.817 -0.013 0.717 0.048

2428 0.838 -0.035 0.819 -0.008 0.825 -0.020 0.835 -0.033 0.820 -0.003 0.826 -0.001 0.803 -0.013 0.825 -0.031 0.762 0.001

2429 0.837 0.016 0.817 0.029 0.825 0.025 0.835 0.022 0.823 0.037 0.825 0.037 0.797 0.047 0.834 0.021 0.755 0.063

2431 0.821 0.042 0.798 0.063 0.810 0.051 0.818 0.041 0.809 0.052 0.812 0.054 0.763 0.055 0.813 0.036 0.726 0.086

2434 0.845 -0.002 0.826 0.021 0.840 0.000 0.842 -0.006 0.835 0.011 0.835 0.014 0.805 -0.005 0.828 -0.003 0.757 0.013

2438 0.819 0.037 0.801 0.049 0.812 0.045 0.816 0.040 0.808 0.051 0.809 0.054 0.758 0.053 0.808 0.042 0.737 0.076

2441 0.827 0.000 0.810 0.021 0.820 0.010 0.824 0.012 0.813 0.023 0.813 0.020 0.768 0.053 0.814 0.013 0.751 0.040

2448 0.840 -0.017 0.828 -0.004 0.831 -0.008 0.837 -0.007 0.829 0.010 0.828 0.012 0.788 0.038 0.836 -0.010 0.754 0.051

2450 0.852 -0.032 0.834 -0.017 0.843 -0.020 0.850 -0.030 0.844 -0.010 0.851 -0.018 0.790 0.011 0.841 -0.032 0.753 0.043

2554 0.843 -0.005 0.815 0.034 0.837 0.003 0.841 0.001 0.826 0.015 0.832 0.004 0.787 0.045 0.840 -0.004 0.766 0.050

(Table continues)

Appendix C (continued)

41


Prompt PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif PN Unif

2557 0.817 0.019 0.790 0.044 0.809 0.028 0.814 0.027 0.799 0.036 0.797 0.041 0.768 0.066 0.813 0.020 0.733 0.066

2689 0.844 0.014 0.832 0.032 0.841 0.020 0.841 0.022 0.835 0.039 0.837 0.040 0.792 0.060 0.829 0.022 0.789 0.047

2690 0.817 -0.049 0.791 -0.023 0.814 -0.044 0.815 -0.047 0.801 -0.031 0.803 -0.041 0.778 -0.030 0.814 -0.045 0.744 0.011

2702 0.818 0.024 0.800 0.043 0.814 0.027 0.815 0.026 0.803 0.041 0.806 0.039 0.774 0.054 0.812 0.021 0.739 0.083

2703 0.810 -0.014 0.785 0.004 0.803 -0.007 0.806 -0.006 0.789 0.012 0.794 0.006 0.755 0.035 0.801 -0.008 0.739 0.014

0626 0.875 -0.027 0.855 -0.001 0.870 -0.020 0.872 -0.015 0.864 0.000 0.863 -0.003 0.814 0.034 0.864 -0.015 0.786 0.040


Appendix D Correlations Between Essay Score and Predicted Essay Score for Three Equal Weights

Methods

Ewall Ewmin03 Ewmin00max03

Prompt Unif PN Unif PN Unif PN

2399 0.655 0.629 0.749 0.721 0.696 0.702

2400 0.699 0.688 0.771 0.776 0.727 0.747

2420 0.732 0.699 0.780 0.754 0.771 0.764

2421 0.748 0.688 0.811 0.734 0.797 0.723

2427 0.717 0.660 0.766 0.734 0.754 0.710

2428 0.709 0.704 0.778 0.762 0.746 0.716

2429 0.762 0.709 0.822 0.755 0.798 0.744

2431 0.729 0.599 0.823 0.710 0.789 0.646

2434 0.739 0.736 0.788 0.761 0.779 0.767

2438 0.734 0.638 0.792 0.732 0.751 0.699

2441 0.714 0.671 0.803 0.744 0.752 0.713

2448 0.698 0.644 0.800 0.740 0.748 0.681

2450 0.697 0.675 0.792 0.749 0.737 0.711

2554 0.744 0.656 0.824 0.744 0.795 0.717

2557 0.774 0.696 0.802 0.738 0.798 0.716

2689 0.785 0.712 0.837 0.784 0.833 0.769

2690 0.672 0.691 0.755 0.739 0.714 0.737

2702 0.746 0.681 0.820 0.742 0.780 0.732

2703 0.726 0.680 0.772 0.733 0.771 0.729

0626 0.776 0.723 0.824 0.798 0.795 0.767


42

Appendix E Cross-validated Exact Agreement Rates and Kappas for Four Methods on All 20 Prompts

Table E1 Exact Agreement Rate for Cross-validation Using Four Methods OLS: Ordinary Least Squares; POM: Proportional Odds Model; CLFN: Classification Using Asymmetric Loss Matrix;QR: Quantile Regression, With the τ Values Listed in the Last Column

Prompt OLS POM CLFN QR τ

2399 .484 .481 .490 .486 55th

2400 .457 .474 .451 .481 50th

2420 .481 .486 .485 .494 55th

2421 .495 .487 .514 .509 55th

2427 .473 .469 .472 .470 55th

2428 .457 .453 .452 .453 60th

2429 .523 .540 .531 .534 55th

2431 .491 .502 .503 .513 50th

2434 .490 .511 .503 .502 55th

2438 .490 .522 .524 .514 60th

2441 .502 .511 .518 .520 55th

2448 .524 .524 .519 .534 50th

2450 .503 .514 .503 .510 55th

2554 .489 .498 .495 .490 60th

2557 .448 .462 .455 .460 50th

2689 .513 .518 .507 .515 55th

2690 .491 .518 .495 .507 50th

2702 .470 .482 .493 .497 70th

2703 .431 .468 .452 .453 60th

0626 .531 .535 .538 .535 55th

(Table continues)

43

Table E2 Exact Agreement Rate for Kappa Using Four Methods OLS: Ordinary Least Squares; POM: Proportional Odds Model; CLFN: Classification Using Asymmetric Loss Matrix;QR: Quantile Regression, With the τ Values Listed in the Last Column

Prompt OLS POM CLFN QR τ

0.336 0.333 0.344 0.339 55th

2400 0.302 0.324 0.294 0.333 50th

2420 0.333 0.339 0.338 0.349 55th

2421 0.351 0.340 0.375 0.369 55th

2427 0.322 0.317 0.321 0.318 55th

2428 0.302 0.297 0.295 0.297 60th

2429 0.387 0.408 0.397 0.401 55th

2431 0.345 0.360 0.361 0.374 50th

2434 0.344 0.371 0.361 0.360 55th

2438 0.344 0.385 0.388 0.375 60th

2441 0.360 0.371 0.380 0.383 55th

2448 0.388 0.388 0.381 0.401 50th

2450 0.361 0.375 0.361 0.370 55th

2554 0.343 0.354 0.351 0.344 60th

2557 0.290 0.308 0.299 0.306 50th

2689 0.374 0.380 0.366 0.376 55th

2690 0.345 0.380 0.351 0.366 50th

2702 0.318 0.334 0.348 0.353 70th

2703 0.268 0.316 0.295 0.297 60th

0626 0.397 0.402 0.406 0.402 55th

2399

44

Appendix F More on Quantile Regression

Because the quantile regression method has the advantage of looking at different

locations in the conditional score distributions, we may use this characteristic to improve

classification/prediction accuracy. For example, with prompt 2427, we applied OLS, POM, and

the following six quantile regression methods with the tuning parameter being 5%, 23%, 41%,

59%, 71%, and 95% on a random sample with size 247 from the pseudo-normal sample of 494 in

order to demonstrate how the quantile regression might be used to produce better classification

near the thresholds. These six regression quantiles were chosen to represent the evenly

distributed essay scores 1 to 6, i.e. 5% quantile represent the cutting point for score level 1 and 2;

23% represents the cutting point between score level 2 and 3 and so on.

As a contrast, Tables F1 and F2 show the confusion matrices of OLS and POM,

respectively. The diagonal elements, as bolded, indicate the correct classification.

Table F1

Confusion Matrix of OLS on Prompt 2427

Actual

1 2 3 4 5 6

1 1 1 0 0 0 0

2 5 10 12 2 0 0

3 3 15 35 25 1 0

4 0 1 21 44 21 1

5 0 1 1 6 16 11

6 0 0 0 0 6 8

Pred

icte

d

45

Table F2

Confusion Matrix of POM on Prompt 2427

Actual

1 2 3 4 5 6

1 2 3 0 0 0 0

2 6 11 19 4 0 0

3 1 11 25 19 1 0

4 0 2 24 44 17 0

5 0 1 1 10 20 12

6 0 0 0 0 6 8

Table F3 through F8 are the confusion matrices for the six regression quantiles. Two

distinguishing findings are:

1) Regression quantiles yielded very high agreement as measured by Kappa for

the particular score levels they were aiming at (see the bolded numbers). For

example, at score level 1, the 5th regression quantile only misclassified 1 of 9;

while OLS and POM misclassified 8 and 7 out of 9. The differences were

greater at higher and lower ends of the quantile regressions.

2) This extra correct prediction of scores had a cost associated with it. For

example, at score level 2, 20 out of 28 2s were predicted to be 1, and 21 of the

69 of 3s were labeled 1.

Table F3

Confusion Matrix of 5th Regression Quantile

Actual

1 2 3 4 5 6

1 8 20 21 3 0 0

2 1 6 35 37 8 0

3 0 1 12 33 17 3

4 0 1 1 4 16 11

5 0 0 0 0 2 5

6 0 0 0 0 2 1

Pred

icte

d Pr

edic

ted

46

Table F4

Confusion Matrix of 23rd Regression Quantile

Actual

1 2 3 4 5 6

1 6 7 2 1 0 0

2 3 15 28 7 1 0

3 0 5 29 48 14 0

4 0 1 10 18 16 5

5 0 0 0 3 11 10

6 0 0 0 0 2 5

Pred

icte

d

Table F5

Confusion Matrix of 41st Regression Quantile

Actual

1 2 3 4 5 6

1 2 3 0 0 0 0

2 5 11 15 3 0 0

3 2 12 42 39 5 0

4 0 1 11 30 21 3

5 0 1 1 5 15 10

6 0 0 0 0 3 7

Pred

icte

d

Table F6


Actual

1 2 3 4 5 6

1 1 1 0 0 0 0

2 4 6 2 0 0 0

3 4 17 37 12 0 0

4 0 3 28 56 18 0

5 0 1 2 8 20 11

6 0 0 0 1 6 9

Pred

icte

d

47

Table F7


Actual

1 2 3 4 5 6

1 0 1 0 0 0 0

2 2 4 0 0 0 0

3 7 15 28 7 0 0

4 0 6 31 51 13 0

5 0 2 10 18 21 6

6 0 0 0 1 10 14

Pred

icte

d

Table F8


Actual

1 2 3 4 5 6

1 0 0 0 0 0 0

2 0 1 0 0 0 0

3 3 6 4 0 0 0

4 6 13 29 16 0 0

5 0 8 31 54 22 1

6 0 0 5 7 22 19

Pred

icte

d

A possible solution to this problem exists in theory. If we “know” the position of the

examinee in the conditional score distributions, we would be able to use a more proper

regression quantile to predict. For instance, if we “know” a person’s essay score lies

approximately between 30 to 45 percentiles, then the 41st quantile regression would do a decent

prediction. Well, of course, the problem is that we can never know the percentile of an

examinee’s essay score in practice.

A preliminary study showed that one possible way to solve the puzzle is to use quantile

regression itself. The six regression quantiles provided six predictions for each examinee; for

48

example, the 2nd through 7th columns in Table F9 shows these six predictions for six examinees

whose essay score is 1 according to human raters. The 8th column shows the mean of the six

predictions. We used the rounded value of the mean as a rough estimate of the location of the

examinee’s score in conditional score distributions. For example, for examinee 154, the rough

location estimate is 2, and then we used the 23rd quantile regression, which is the 2nd among the

six, to produce the final prediction. The result is 1, as bolded. By the same arguments, this

method provided 6 correct predictions. As a contrast, the 50th regression quantile only yielded

one exact correct prediction, as shown in the last column in Table F9.

Table F9

Combining Several Regression Quantile Results to Give More Accurate Prediction

ID 5th 23rd 41st 50th 59th 77th 95th mean

154 1 1 2 2 3 3 4 2.3 ⇒23rd

196 1 1 2 2 2 3 3 2.0 ⇒23rd

205 1 1 1 1 1 2 3 1.5 ⇒23rd

214 1 1 2 2 2 3 4 2.2 ⇒23rd

233 1 1 1 2 2 2 3 1.7 ⇒23rd

244 1 1 2 3 3 3 4 2.3 ⇒23rd

This preliminary result shows a promising property of the quantile regression method.

Future research is needed, such as seeing how prior information may be used to estimate the

location in conditional score distributions. And more complicated combinations among

regression quantiles, for instance, the “systematic statistics,” which is the linear combination of

several regression quantiles, needs study. We hope, with further research, that a more effective

decision rule that takes advantage of quantile regression method can improve the accuracy of e-

rater prediction.

49

Documents

Improving the Statistical Aspects of e-rater · scores. Following an explanation of the statistical aspects of the process is a description of alternatives to the process. Our explorations