English 588 Teaching Freshman Composition

Edwards 1

Andrea M. Edwards

Dr. John Edlund

English 588- Teaching Freshman Composition

11 June 2013

Revised: Fall 2015

Machine Scoring of Student Essay: For Better or For Worse?

Abstract

This article will address the issue of machine or computer scoring of students’ essays,

and this has been a topic of discussion since 1966 (Cushing Weigle 336). Machine scoring is a

method in which computers or machines grade and or score student papers using a certain rubric

or algorithm. It shall be a topic of discussion for generations to come in the field of education for

K-12 and college instructors alike. I plan to explore these questions, and I will analyze the

benefits and the setbacks of using machine scoring for student essays from a historic and

economic point of view. I begin with the history and the methodology of machine scoring for

student essays.

Machine scoring of students’ essays has been a topic of discussion since 1966 (Cushing

Weigle 336), and It shall be a topic of discussion for generations to come in the field of

education for K-12 and college instructors alike. We are now living in the “Computer Era,” and

as such we use computers for the majority our daily tasks, but there is a down side to using

Edwards 2

computer technology when it involves student writing and grading. Machine or computer scoring

is a method in which computers or machines grade and or score student papers using a certain

rubric or algorithm. Much in the same way as

Watson from Jeopardy (Hesse 1), a machine that can

only process, but not analyze a question; these

machines or computers are not perfect, and they can

score the mechanical and the grammatical errors in

students’ papers, but they will not always check for

understanding (Page et al. 7). According to Page et

al., “results of these studies have also suggested that computers perform as well or better than

human raters” (Page et al. 6), and I argue this cannot be the case as Page et al. states, “the claim

that the computer can actually ‘understand’ the text is not accurate” (Page et al. 7). I

acknowledge that machine scoring has now been implemented at several universities, and it has

been around for quite some time now, but these machines are not fully capable of doing what the

human rater can do as of yet. And this is not to say machine scoring should not be used, but if it

is to be used, then it should be used with the knowledge that it will only be able to perform

certain grading tasks, and it will not be able to perform others. An example of this is seen in the

different types of grading engines used to perform these tasks (Page et al. 6-7), and there are also

various limits to which the computers can perform these tasks (Page et al. 7). Page et al. uses the

example of “Queen America sailed to Santa Maria with 1492 ships” (7), and although it makes

sense grammatically, it, however, does not make sense comprehensively.

This system of grading is also fueled by economics and politics, “noting that automated

scoring has already been authorized, funded, and supported by government officials and business

I can currently state that these computer scoring systems are not substitutes for human raters, in fact, human scoring is still necessary to ensure a student’s level of analytical and critical understanding of what they are reading and writing.

Edwards 3

leaders, Ericsson and Haswell offer their collection as

means of educating teachers, administrators, and students

as members of a community responsible for action on

behalf of higher education to count the inroads made by

others” (Rutz 139), but is scoring economical and feasible

in the long run? And, what are the reasons behind using

this kind of mechanical system to grade student papers?

How beneficial is machine or computer scoring to the

students and to the instructors? I plan to explore these

questions, and I will analyze the benefits and the setbacks

of using machine or computer scoring for student essays from a historic and economic point of

view, and I will address ideas and different functions of these systems. I begin with the history

and different types of machine scoring systems for student essays.

Machine scoring of essays has been around since the 1960s, roughly 1966 (Cushing

Weigle 336), and according to Sara Cushing Weigle in her article “Validations of Automated

Scores of TOEFL IBT Tasks Against Non-test Indicators of Writing Ability,” machine scoring

“has only recently been used on a large scale. For some years e-rater was use operationally,

along with human raters, on the Graduation Management Admissions Test (GMAT), and e-rater

is also used in ETS’ web-based essay evaluation service, known as Criterion” (336). This e-rater

“is trained on a large set of essays to exact a small set of features that are predictive of scores

given by human raters” (Cushing Weigle 336), and the current version has “a set standards of

features across prompts, allowing both general and prompt specific modeling for scoring”

(Cushing Weigle 336). It scores for specific features: errors in grammar, usage, mechanics, style,

I completely stand by my arguments made here due to the personal experiences I have had with Achieve 3000 and SBAC at SFiAM. The usage of these grading systems are highly fueled by economics and politics because from a financial viewpoint, educators would not have to be paid to grade papers, however, students are not benefitting from machine scoring due to flaws and glitches in these various means of machine scoring.

Edwards 4

organization, development, lexical complexity, and prompt-specific vocabulary usage (Cushing

Weigle 336). Automated essay scoring or AES (Steier et. al. 126) was used to study how an e-

rater would score essays compared to a human rater, and that the e-rater and the human rater

would score essays in nearly the exact same manner (Steier et. al. 136-39). There have been

many groups that have developed machine scoring systems, and the first group is at the heart of

this topic and controversy is EdX.

EdX is a nonprofit enterprise founded by Harvard and

Massachusetts Institute of Technology to offer courses online.

EdX had created an automated essay scoring system and has

made its software available free online to any institution that

wishes to use it (Markoff and Strauss). The company’s

president, Anant Agarwal, an electrical engineer, “predicted

that the scoring software would be a useful pedagogical tool,

enabling students to take tests and write essays over and over

and improve the quality of their answers” (Markoff). Valerie

Strauss also brings up EdX in her article “Can Computers Really Grade Essay Test?” and she

describes the software as being new, and she also says it “uses artificial intelligence to grade

student essays and short written answers,” and while I can see that this computer software is

designed to grade students paper and answers, it does not, however, answer the question of

whether or not the scoring is accurate, and it does not indict how accurate the software’s scoring

is. Stephen P. Balfour states in his article “Assessing Writing in MOOCs: Automated Essay

Scoring and Calibrated Peer Review,” that “Although EdX’s application is not yet available for

testing, three longstanding commercial Automated Essay Scoring (AES) applications have been

EdX reminds me of Achieve 3000 and SBAC in the sense that systems are relatively similar; however, I have discovered there is no correlation between the two companies.

Achieve 3000 and SBAC are computer grading systems created for the K-12 designed to have the same capabilities as EdX and other AES applications, however, there is some question as to how similar or dissimilar these systems might be?

Edwards 5

tested and are established in the academic literature (Shermis, Burstein, Higgins, & Zechner,

2010)” (40). This appears problematic because although there have been three AES applications

that have undergone testing, one has to wonder why EdX has yet to test their system? Could it be

there are still glitches in the system, and if this is the case, when will those errors be addressed?

Now, EdX is only one example of many systems available for machine scoring, and I will list the

names of these systems.

Other systems available are:

E-rater, developed by Educational Testing Service (ETS), is used

as one reader for evaluating the essay portion of the Graduate

Management Admissions Test—a human is still the other reader.

Intellimetric, developed by Vantage Technologies, is used for evaluating

writing in a range of applications, K through college. WritePlacer Plus,

developed by Vantage for the College Board, is being marketed as a cheap

and reliable placement instrument. The Intelligent Essay Assessor,

developed by Landauer, Laham, and Foltz at the University of Colorado,

is being marketed through their company, Knowledge Analysis

Technologies, to evaluate essay exams for college courses across

disciplines (Herrington and Moran 480).

And, these systems have similar, yet unique designs and standards of scoring and grading writing

that may or may not agree with what a human rater would have to say. There are also websites

such as Turnitin.com that allow instructors to correct student essays and check for plagiarism

(turnitin.com), but this system also has its share of flaws when it comes to correcting papers by

Edwards 6

flagging ideas and quotes that should not be deemed as plagiarism, yet often are because of the

software embedded in the system. This technology was created by people who are not relatively

using them for the purposes they have been designed for, and it is up to professors and

instructors to make sure these e-raters are performing at the same level as the human rater. There

were only slight variations in how the essays were scored. There are also the various names for

machine scoring, and more insight as to how e-raters work.

Machine or computer scoring systems have different tasks, and usually have different

names to associate these systems with their different strategies and methods for grading or

scoring student essays. There are various nicknames for these machine scoring systems such as

“robo-reader” (Hesse 1), “e-raters” (Cushing Weigle 336), “essay grading software” and

“artificial intelligence technology” (Markoff), and Automated Essay Scoring, or AES, but

regardless of what the name of the system is, its tasks are the same: to score student essays in the

most efficient way possible, and there are several methodologies used in regards to e-raters.

Here are some of the methodologies and criteria these systems used for scoring student

essays: articulates a clear and insightful position on an issue, develops the position fully with

compelling reasons and persuasive examples, sustains a well-focused, well-organized analysis,

connecting ideas logically, and demonstrates facility with the conventions of standard written in

English (Steier et al. 129). With criteria such as this, instructors would wish to believe computers

are scoring student papers as they should be, and in “Scoring with the Computer” by Michael

Steier et al., they mention “essay ratings typically use a small number of categories that

correspond to the descriptor levels in the scoring rubrics. Many large-scale assessment programs

use a six-point scale” (127), which indicates the e-rater should be using the same scale and

system that a human rater is using, but what is it really scoring?

Edwards 7

In Valerie Strauss’s article “Grading Writing: The Art and Science—and Why Computers

Can’t Do It”, she gives examples of how computers might miss something, or score a piece of

writing as correct, when it in reality a human rater would mark it as wrong. She starts with the

question: “Which writing is better?” and she list the question three times with two choices for

each question. Her first set of answers offered are: A) “See Dick run! Run, Dick, run!” and B)

“Dick’s running merits attention and encouragement,” and the correct answer here would be B

because it is a more intelligible answer because it is written at a high school grade level. The

only case in where A would be correct is for a student at the elementary school level, and this is

not that case because the student has surpassed elementary school (“Grading writing”). She gives

other examples as well, but they are similar to the example I have provided. Grammatically, any

of these answers would be correct, but in terms of lexical complexity and vocabulary usage they

would not be strong enough for a student in high school or college. (In addition: This had also

proven to be an issue with systems such as Achieve 3000 in which students would type in the

answers to questions from the articles they have read, and if the student was showing

improvement or lack of improvement in their reading, writing, and comprehension levels, the

computer was not necessarily taking those improvements or lack of improvements into

consideration. This is a case where educators have to come in and double check to see where the

student is actually standing academically. Educators cannot fully rely on the system to do all of

the work for them.) In Steier et al., they state that:

In the context of large-scale, high-stakes writing assessments in

particular, a primary goal is ensure that raters think similarly enough about

what constitutes a high- or low-quality student response to achieve

Edwards 8

reasonable consistency of scores across ratings. […] Moreover, rater

training has not been able to eliminate completely these differences (126).

Due to problems such as this, e-raters still have a long way to go until they meet certain criteria

for scoring essays in the same manner that a human rater would score essays.

Human raters are often said to be biased and slightly unreliable when it comes to grading

student papers, and in Page et al.’s article “Trait Ratings for Automated Essay Grading,” they

claim that “the results of these studies have also suggested that the computer performs as well or

better than human raters” (6) and later goes on to say, “when good reliability among human

raters is obtained, it is sometimes for different reasons. The best conclusion that can be reached

is it is hard to get raters to articulate why an essay is good (or bad) but that they can recognize

good writing when they see it” (6). This leads us to believe that human raters cannot score as

well as computers, but in Steier et al.’s article “Scoring with the Computer: Alternative

Procedures for Improving the Reliability of Holistic Essay Scoring,” they state that:

There is indeed empirical support for the close resemblance

between human and automated scores. Based on a sample of 2000 sixth-

to twelfth-grade student, each of whom wrote two essays, Attali and

Burstein (2006) estimated the true- score correlations between e-rater and

human essay scores to be .97. In other words, the alternate-form

correlations between human and machine scores were almost the same as

the alternate-form reliability of the human scores (126).

Edwards 9

So, the e-raters are not so different from human raters, and this demonstrates human raters are

still reliable when it comes to scoring essays. If the problem is not with human raters, then the

problem is the cost and economic issues of machine scoring.

There is the cost of the computer hardware, the software, the maintenance for these

computers, and if these computers can only score these papers grammatically, then a human

scorer would still be needed to make sure that the paper is comprehensible. According to

Douglass D. Hesse’s article, he states “there are two primary reasons given for having computers

score writing. One is economic, a savings that accrues in mass testing situations” (5), and in a

massive testing situation such as the SATs or the ACTs, using a computer scoring system for

writing would make sense because as Hesse also goes on to state, “if you have hundreds of

thousands of essays to grade for a national testing situation (as for example with the Common

Core standards), hiring and training enough human raters to complete the task in a reasonable

time is expensive” (5). Valerie Strauss appeared to agree with what Hesse had to say in his

article, and she articulated this in her article.

According to Valerie Strauss in her article “Can Computers Really Grade Essay Tests?”,

she states “in 2010, the federal government awarded $330 million to two consortia of states ‘to

provide ongoing feedback to teachers during the course of the school year, measure annual

school growth, and move beyond narrowly focused bubble test’” (United States Department of

Education). She also goes on to say in her article, “by combining the already existing National

Assessment of Educational Progress (NAEP) assessment structures for evaluating school system

performance with ongoing portfolio assessment of student learning by educators, we can cost-

effectively assess writing without relying on flawed machine-scoring methods” (Strauss).

Throughout my ongoing research, there has been no real indicator as to how much money will be

Edwards 10

spent, or how much money will be saved by using machine or computer scoring for student

essays. There have been methods and ways of saving money by using machine scoring by not

paying them as much for their time and services.

However, human raters cannot cost anymore than e-

raters because according to Herrington and Moran in

their article “What Happens When Machines Read

Our Students’ Writing?” states one system “WPP is

marketed as costing $5.00 per scored essay test”

(487). They also note “this is more costly than our

present human reader placement program at U Mass

Amherst ($3.70 per essay) and more expensive than the placement programs now established at

some other Massachusetts public colleges” (487). And, Herrington and Moran also go on to say:

We would respond that the alleged cost for Accuplacer and WPP

do not themselves include in-network management, user training, and

depreciation. Interestingly, the College Board also offers “trained reader

scoring”—at $10.00 per essay test!—for those schools still clinging to old

ways (487).

What this all means is often times, the e-raters can cost as much as human raters, and the cost

does not seem to vary that much. What a university or college pays an e-rater could be the same

as what they pay a human rater depending on the program, and/or the level of training they are

given in order to score those essays. In other cases, it just makes sense to use an e-rater and a

human rater to score the essays “while recognizing the limitations of most scoring approaches…

Measurability and cost appeared to be the core issue when addressing the usage of e-raters. Was cost truly an issue, and if so, how would academic institutions address those issues in the years to come? And as technology advances and becomes more available, would cost still be an issue?

Edwards 11

teachers can nonetheless push strongly for human scorers so that computers, with their

mechanical counting of syllables and words, do not become substitutes for the human

interchange between writer and reader that lies at the heart of communication” (Herrington and

Moran 484). Another critic argues their theory on e-rater, and gives a different take on the

matter.

Sara Cushing Weigle in her article “Validation of Automated Scores of TOEFL IBT

Tasks Against Non-test Indicators of Writing Ability” she states that “one approach involves

investigating the relationships between automated scores and scores given by human raters”

(337), and she later goes on to state in this same article:

Several studies have demonstrated the comparability of scores

between human raters and automated scores. The literature in the second

category, the criterion-related validity of automated scores, is scant,

although some researchers have looked at the relationship between human

scores on writing assessments and performance on other measures of

writing. Breland, Bridgeman, and Fowles (199) provide an overview of

this research, noting that essay test performance correlated more highly

with other writing performance than with grades, GPA, or instructor’s

ratings (337).

The argument being made is a student’s writing performance is more aligned with what the

computer has to say rather than what his or her instructors have to say, but it is the instructors,

not the computers, teaching our students. If that is the case, then the student should be more

inclined to listen to what the instructor has to say to them, and not what the computer instructs

Edwards 12

them to do. I think it is best to note that cost, and the interaction between student and computer,

should not be the issue here, and it is maintaining those interactions between writer and reader

that is at the heart and soul of this issue.

Interaction between reader and writer should be crucial, and instructors tend to learn

more about their students writing when they are grading their students’ papers. In Page et al.

again, I use an example of how a computer will miss a key piece of writing an instructor would

catch right away upon seconds into reading the paper:

Consider the following: “Queen America sailed to the Santa Maria

with 1492 ships. Her husband, King Columbus, looked to Indian explorer,

Niña Pinta, to find vast wealth on the beaches of Isabella, but would settle

for spices from the continent of Ferdinand.” Of course, the answer above

is designed to be ridiculous, though some parsers might give it a high

score for content because the passage contains many or the keywords

associated with Columbus’s discover of America (7).

Many instructors know their U.S. and World History, and they would know the above piece of

writing does not make sense, but a computer may still give it a high score, because there are no

grammar and spelling errors in it, and it does have all of the key vocabulary terms, however, if a

human rater were to look at this they would give it a lower score for not being comprehensible.

A computer cannot fully comprehend meaning in an essay, and although it sees the content,

grammar, spelling, structure, and mechanics, it is not analyzing the essay to see if it makes any

sense.

Edwards 13

Here is the traditional rubric and scoring standards for an e-rater, and why the e-rater may

score for some things, and not for others. According to Cushing Weigle, “these features are

typically the following, though they may vary for different versions of e-rater: Errors in

grammar, usage, mechanics and style, organization and development, lexical complexity, and

prompt-specific vocabulary usage” (336), and this gives insight into what these e-raters can

score, but this does not mean these e-raters follow the same structure or rules for scoring.

Cushing Weigle states “though they may vary” which means these e-raters may not follow the

same protocol, and how one e-rater will score an essay might be different than how another e-

rater might score an essay. If this is the case, then in similarity to human raters, the e-raters can

be biased as well. In Hesse’s article, he also claims that one version of the ETS E-rater scoring

engine looks at the same material as the e-raters that Cushing Weigle refers to in her article,

however, Hesse dives into what they are really scoring “one problem, as Les Perlman famously

and dramatically illustrated, is the ‘truth’ or logic of essays is undervalued or overlooked

altogether” (2), and this means the e-rater is looking at what is on the surface of the student’s

writing, but it may not be viewing what is below the surface into the depth and understanding of

the writing to which the student is producing. Here is another example of what the e-rater views,

and how the human rater will see it as incorrect:

6 (Lower). Dogs are interesting animals. Dogs are friendly to their owners.

Dogs show affection by wagging their tails.

7 (Higher). Friendly to their owners, wagging tails to show affection, dogs

are interesting animals (Hesse 3).

Edwards 14

Although number 7 is correct in terms of grammar, spelling and vocabulary usage, it does not

appear to be fully comprehensible at first glance. Number 6 may have a lower level of

complexity and lexical usage, but it is still comprehensible. We know who or what is being

talked about, and we know what the topic is in number 6, however, we do not necessarily receive

that knowledge in number 7 until the end of the sentence. Hesse addresses the issue further when

he addresses what he calls “Beating the System” (3).

In the part of Hesse’s article entitled “Beating the System”(3), he addresses the issue of

sophistication and computers. He states “the same

technology that allows people to have ‘conversations’

with the iPhone’s Siri, is improving the analysis of

writing. Still, just as Siri is not well-equipped to discuss

with you whether Nietzsche or Wittgenstein is the better

philosopher, so too do computer scoring systems run into

difficulty with complex tasks; they can surely score the

elements pretty well, but whether these elements add up

to a strong or weak piece of writing is another matter”

(Hesse 3). What he is pointing out is that these computers

can give a student the basics of what is wrong with their paper, but they will not be able to

reason with the student on what is wrong with his or her paper. I have an iPhone, and I can ask

Siri what’s the best place to eat at based on overall ratings of a certain restaurant, but it would

not be Siri’s personal opinion (she is after all, just a female voice on my smart phone), and the

same goes for e-raters. Their opinion is based on what is programmed into their system, but not

based on personal opinion, and some people would argue it is wonderful the computer has no

I had considered this as well with my students as they had been utilizing Apple products and using Achieve 3000 last year in which the students could manipulate the system if they wanted to do so, and if they chose to do that, the e-rater would be none the wiser. Unless it is a system such as Turnitin.com, students could find a way to master and cheat the system, and the machine scoring system wouldn’t know the difference.

Edwards 15

personal opinion, and thus is not biased in any way, but that also leads to stubbornness if a

student’s paper might be correct in terms of comprehension and overall creativity, but in the eyes

of the computer, it will not be correct in terms of grammar and mechanical issues, and this now

leads me to my conclusion.

These machines are still being studied, and there are still several focus groups and test

being developed to understand how machine scoring of essays works, and if it is working

accurately (“Can Computers Really Grade Essay Tests?”). Machine or computer scoring may

sound feasible at first, and they may have come a long way from their first inceptions. To put this

in another way, e-raters are in their Watson phase where they can only pay attention to the

grammar, mechanics, organization, development, and

vocabulary usage, and have yet to begin their Data (That

grammatically correct android from Star Trek: Next

Generation) or Hal stage (but let’s hope they never get to the

Hal stage) where they can actually comprehend meaning in a

student’s essays and analyze all of their content. Hesse

mentioned the iPhone’s Siri in his article (3), and how people

can ask her a question, and they will get a basic answer, but

she will not be able to chat with them on what her personal

opinion of anything would be. There is also the issue of cost

and the accuracy of the e-rater when it comes to scoring

essays, and as I have discovered, the e-rater can cost as much

as a human rater. As to the accuracy of the e-rater, the e-rater

I used Data and Hal as examples because I believe Data is a representation of what many educators and creators of machine scoring would hope to achieve in terms of how we would wish for e-raters to work: finding flaws in student papers, yet we can reason with the machine as to how those errors should be handled and corrected. Whereas, Hal would represent what we don’t want to see in machine scoring which is a somewhat negative approach to machine scoring, wherein we reject student papers based on what the computer deems as a great paper, yet we neglect human judgment in what a great paper is or can be.

Edwards 16

can be just as accurate as a human rater, but it cannot comprehend or analyze the meaning of a

student’s essay as of yet.

Works Cited

“Always Already: Automated Essay Scoring and Grammar-Checkers in College Writing

Courses.” Machine Scoring of Student Essays: Truth and Consequesnces. National

Writing Project. Logan, UT: Utah State UP, 2006. Web. 5 May 2013.

Balfour, Stephen P., Ph.D. “Assessing Writing in MOOCs: Automated Essay Scoring and

Calibrated Peer Review.” RPAJournal.com. RPA Journal, Summer 2013. Web. 18 Oct.

2015.

Cushing Weigle, Sara. “Validation of Automated Scores of TOEFL IBT Tasks Against Non-test

Indicators of Writing Ability.” Language Testing, 27.3 (2010): 335-353.

Hesse, Douglass D. “Can Computers Grade Writing? Should They?” U. of Denver. Feb. 2012. 1-

7.

Herrington, Anne, and Charles Moran. "What Happens when Machines Read Our Students'

Writing?." College English, 63.4 (2001): 480-499.

John, Markoff. "Essay-Grading Software Offers Professors a Break." The New York Times [New

York City] 4 Apr. 2013: n. pag. Web. 25 May 2013.

Edwards 17

Page, Ellis, Timothy Keith, Mark Shermis, Chantal Koch, and Susanmarie Harrington. "Trait

Ratings for Automated Essay Grading."Educational and Psychological Measurement,

62.1 (2002): 5-18.

Rutz, Carol. “Scoring by Machine.” College Composition and Communication, 59.1 (2007): 139-144.

Steier, Michael, Yigal Attali, and Will Lewis. "Scoring with the Computer: Alternative

Procedures for Improving the Reliability of Holistic Essay Scoring." Language Testing,

30.1 (2013): 125-141.

Strauss, Valerie. “Can Computers Really Grade Essay Test?” The Washington

Post [Washington] 25 April 2013: n. pag. Web. 1 June 2013.

Strauss, Valerie. “Grading Writing: The Art and Science--and Why Computers Can't Do It.” The

Washington Post [Washington] 2 May 2013: n. pag. Web. 30 May 2013.

Turnitin.com. n.d. Web. 3 June 2013.

Warnock, Scott. Teaching Writing Online: How and Why. Urbana, IL: National Council of

Teachers of English, 2009. Print.

Documents

English 588 Teaching Freshman Composition