Upload
andrea-edwards
View
72
Download
1
Embed Size (px)
DESCRIPTION
English 588 Teaching Freshman Composition - "Machine Scoring of Student Essays: For Better or Worse" essay for English Rhetoric Portfolio written by Andrea Edwards.
Citation preview
Edwards 1
Andrea M. Edwards
Dr. John Edlund
English 588- Teaching Freshman Composition
11 June 2013
Revised: Fall 2015
Machine Scoring of Student Essay: For Better or For Worse?
Abstract
This article will address the issue of machine or computer scoring of students’ essays,
and this has been a topic of discussion since 1966 (Cushing Weigle 336). Machine scoring is a
method in which computers or machines grade and or score student papers using a certain rubric
or algorithm. It shall be a topic of discussion for generations to come in the field of education for
K-12 and college instructors alike. I plan to explore these questions, and I will analyze the
benefits and the setbacks of using machine scoring for student essays from a historic and
economic point of view. I begin with the history and the methodology of machine scoring for
student essays.
Machine scoring of students’ essays has been a topic of discussion since 1966 (Cushing
Weigle 336), and It shall be a topic of discussion for generations to come in the field of
education for K-12 and college instructors alike. We are now living in the “Computer Era,” and
as such we use computers for the majority our daily tasks, but there is a down side to using
Edwards 2
computer technology when it involves student writing and grading. Machine or computer scoring
is a method in which computers or machines grade and or score student papers using a certain
rubric or algorithm. Much in the same way as
Watson from Jeopardy (Hesse 1), a machine that can
only process, but not analyze a question; these
machines or computers are not perfect, and they can
score the mechanical and the grammatical errors in
students’ papers, but they will not always check for
understanding (Page et al. 7). According to Page et
al., “results of these studies have also suggested that computers perform as well or better than
human raters” (Page et al. 6), and I argue this cannot be the case as Page et al. states, “the claim
that the computer can actually ‘understand’ the text is not accurate” (Page et al. 7). I
acknowledge that machine scoring has now been implemented at several universities, and it has
been around for quite some time now, but these machines are not fully capable of doing what the
human rater can do as of yet. And this is not to say machine scoring should not be used, but if it
is to be used, then it should be used with the knowledge that it will only be able to perform
certain grading tasks, and it will not be able to perform others. An example of this is seen in the
different types of grading engines used to perform these tasks (Page et al. 6-7), and there are also
various limits to which the computers can perform these tasks (Page et al. 7). Page et al. uses the
example of “Queen America sailed to Santa Maria with 1492 ships” (7), and although it makes
sense grammatically, it, however, does not make sense comprehensively.
This system of grading is also fueled by economics and politics, “noting that automated
scoring has already been authorized, funded, and supported by government officials and business
I can currently state that these computer scoring systems are not substitutes for human raters, in fact, human scoring is still necessary to ensure a student’s level of analytical and critical understanding of what they are reading and writing.
Edwards 3
leaders, Ericsson and Haswell offer their collection as
means of educating teachers, administrators, and students
as members of a community responsible for action on
behalf of higher education to count the inroads made by
others” (Rutz 139), but is scoring economical and feasible
in the long run? And, what are the reasons behind using
this kind of mechanical system to grade student papers?
How beneficial is machine or computer scoring to the
students and to the instructors? I plan to explore these
questions, and I will analyze the benefits and the setbacks
of using machine or computer scoring for student essays from a historic and economic point of
view, and I will address ideas and different functions of these systems. I begin with the history
and different types of machine scoring systems for student essays.
Machine scoring of essays has been around since the 1960s, roughly 1966 (Cushing
Weigle 336), and according to Sara Cushing Weigle in her article “Validations of Automated
Scores of TOEFL IBT Tasks Against Non-test Indicators of Writing Ability,” machine scoring
“has only recently been used on a large scale. For some years e-rater was use operationally,
along with human raters, on the Graduation Management Admissions Test (GMAT), and e-rater
is also used in ETS’ web-based essay evaluation service, known as Criterion” (336). This e-rater
“is trained on a large set of essays to exact a small set of features that are predictive of scores
given by human raters” (Cushing Weigle 336), and the current version has “a set standards of
features across prompts, allowing both general and prompt specific modeling for scoring”
(Cushing Weigle 336). It scores for specific features: errors in grammar, usage, mechanics, style,
I completely stand by my arguments made here due to the personal experiences I have had with Achieve 3000 and SBAC at SFiAM. The usage of these grading systems are highly fueled by economics and politics because from a financial viewpoint, educators would not have to be paid to grade papers, however, students are not benefitting from machine scoring due to flaws and glitches in these various means of machine scoring.
Edwards 4
organization, development, lexical complexity, and prompt-specific vocabulary usage (Cushing
Weigle 336). Automated essay scoring or AES (Steier et. al. 126) was used to study how an e-
rater would score essays compared to a human rater, and that the e-rater and the human rater
would score essays in nearly the exact same manner (Steier et. al. 136-39). There have been
many groups that have developed machine scoring systems, and the first group is at the heart of
this topic and controversy is EdX.
EdX is a nonprofit enterprise founded by Harvard and
Massachusetts Institute of Technology to offer courses online.
EdX had created an automated essay scoring system and has
made its software available free online to any institution that
wishes to use it (Markoff and Strauss). The company’s
president, Anant Agarwal, an electrical engineer, “predicted
that the scoring software would be a useful pedagogical tool,
enabling students to take tests and write essays over and over
and improve the quality of their answers” (Markoff). Valerie
Strauss also brings up EdX in her article “Can Computers Really Grade Essay Test?” and she
describes the software as being new, and she also says it “uses artificial intelligence to grade
student essays and short written answers,” and while I can see that this computer software is
designed to grade students paper and answers, it does not, however, answer the question of
whether or not the scoring is accurate, and it does not indict how accurate the software’s scoring
is. Stephen P. Balfour states in his article “Assessing Writing in MOOCs: Automated Essay
Scoring and Calibrated Peer Review,” that “Although EdX’s application is not yet available for
testing, three longstanding commercial Automated Essay Scoring (AES) applications have been
EdX reminds me of Achieve 3000 and SBAC in the sense that systems are relatively similar; however, I have discovered there is no correlation between the two companies.
Achieve 3000 and SBAC are computer grading systems created for the K-12 designed to have the same capabilities as EdX and other AES applications, however, there is some question as to how similar or dissimilar these systems might be?
Edwards 5
tested and are established in the academic literature (Shermis, Burstein, Higgins, & Zechner,
2010)” (40). This appears problematic because although there have been three AES applications
that have undergone testing, one has to wonder why EdX has yet to test their system? Could it be
there are still glitches in the system, and if this is the case, when will those errors be addressed?
Now, EdX is only one example of many systems available for machine scoring, and I will list the
names of these systems.
Other systems available are:
E-rater, developed by Educational Testing Service (ETS), is used
as one reader for evaluating the essay portion of the Graduate
Management Admissions Test—a human is still the other reader.
Intellimetric, developed by Vantage Technologies, is used for evaluating
writing in a range of applications, K through college. WritePlacer Plus,
developed by Vantage for the College Board, is being marketed as a cheap
and reliable placement instrument. The Intelligent Essay Assessor,
developed by Landauer, Laham, and Foltz at the University of Colorado,
is being marketed through their company, Knowledge Analysis
Technologies, to evaluate essay exams for college courses across
disciplines (Herrington and Moran 480).
And, these systems have similar, yet unique designs and standards of scoring and grading writing
that may or may not agree with what a human rater would have to say. There are also websites
such as Turnitin.com that allow instructors to correct student essays and check for plagiarism
(turnitin.com), but this system also has its share of flaws when it comes to correcting papers by
Edwards 6
flagging ideas and quotes that should not be deemed as plagiarism, yet often are because of the
software embedded in the system. This technology was created by people who are not relatively
using them for the purposes they have been designed for, and it is up to professors and
instructors to make sure these e-raters are performing at the same level as the human rater. There
were only slight variations in how the essays were scored. There are also the various names for
machine scoring, and more insight as to how e-raters work.
Machine or computer scoring systems have different tasks, and usually have different
names to associate these systems with their different strategies and methods for grading or
scoring student essays. There are various nicknames for these machine scoring systems such as
“robo-reader” (Hesse 1), “e-raters” (Cushing Weigle 336), “essay grading software” and
“artificial intelligence technology” (Markoff), and Automated Essay Scoring, or AES, but
regardless of what the name of the system is, its tasks are the same: to score student essays in the
most efficient way possible, and there are several methodologies used in regards to e-raters.
Here are some of the methodologies and criteria these systems used for scoring student
essays: articulates a clear and insightful position on an issue, develops the position fully with
compelling reasons and persuasive examples, sustains a well-focused, well-organized analysis,
connecting ideas logically, and demonstrates facility with the conventions of standard written in
English (Steier et al. 129). With criteria such as this, instructors would wish to believe computers
are scoring student papers as they should be, and in “Scoring with the Computer” by Michael
Steier et al., they mention “essay ratings typically use a small number of categories that
correspond to the descriptor levels in the scoring rubrics. Many large-scale assessment programs
use a six-point scale” (127), which indicates the e-rater should be using the same scale and
system that a human rater is using, but what is it really scoring?
Edwards 7
In Valerie Strauss’s article “Grading Writing: The Art and Science—and Why Computers
Can’t Do It”, she gives examples of how computers might miss something, or score a piece of
writing as correct, when it in reality a human rater would mark it as wrong. She starts with the
question: “Which writing is better?” and she list the question three times with two choices for
each question. Her first set of answers offered are: A) “See Dick run! Run, Dick, run!” and B)
“Dick’s running merits attention and encouragement,” and the correct answer here would be B
because it is a more intelligible answer because it is written at a high school grade level. The
only case in where A would be correct is for a student at the elementary school level, and this is
not that case because the student has surpassed elementary school (“Grading writing”). She gives
other examples as well, but they are similar to the example I have provided. Grammatically, any
of these answers would be correct, but in terms of lexical complexity and vocabulary usage they
would not be strong enough for a student in high school or college. (In addition: This had also
proven to be an issue with systems such as Achieve 3000 in which students would type in the
answers to questions from the articles they have read, and if the student was showing
improvement or lack of improvement in their reading, writing, and comprehension levels, the
computer was not necessarily taking those improvements or lack of improvements into
consideration. This is a case where educators have to come in and double check to see where the
student is actually standing academically. Educators cannot fully rely on the system to do all of
the work for them.) In Steier et al., they state that:
In the context of large-scale, high-stakes writing assessments in
particular, a primary goal is ensure that raters think similarly enough about
what constitutes a high- or low-quality student response to achieve
Edwards 8
reasonable consistency of scores across ratings. […] Moreover, rater
training has not been able to eliminate completely these differences (126).
Due to problems such as this, e-raters still have a long way to go until they meet certain criteria
for scoring essays in the same manner that a human rater would score essays.
Human raters are often said to be biased and slightly unreliable when it comes to grading
student papers, and in Page et al.’s article “Trait Ratings for Automated Essay Grading,” they
claim that “the results of these studies have also suggested that the computer performs as well or
better than human raters” (6) and later goes on to say, “when good reliability among human
raters is obtained, it is sometimes for different reasons. The best conclusion that can be reached
is it is hard to get raters to articulate why an essay is good (or bad) but that they can recognize
good writing when they see it” (6). This leads us to believe that human raters cannot score as
well as computers, but in Steier et al.’s article “Scoring with the Computer: Alternative
Procedures for Improving the Reliability of Holistic Essay Scoring,” they state that:
There is indeed empirical support for the close resemblance
between human and automated scores. Based on a sample of 2000 sixth-
to twelfth-grade student, each of whom wrote two essays, Attali and
Burstein (2006) estimated the true- score correlations between e-rater and
human essay scores to be .97. In other words, the alternate-form
correlations between human and machine scores were almost the same as
the alternate-form reliability of the human scores (126).
Edwards 9
So, the e-raters are not so different from human raters, and this demonstrates human raters are
still reliable when it comes to scoring essays. If the problem is not with human raters, then the
problem is the cost and economic issues of machine scoring.
There is the cost of the computer hardware, the software, the maintenance for these
computers, and if these computers can only score these papers grammatically, then a human
scorer would still be needed to make sure that the paper is comprehensible. According to
Douglass D. Hesse’s article, he states “there are two primary reasons given for having computers
score writing. One is economic, a savings that accrues in mass testing situations” (5), and in a
massive testing situation such as the SATs or the ACTs, using a computer scoring system for
writing would make sense because as Hesse also goes on to state, “if you have hundreds of
thousands of essays to grade for a national testing situation (as for example with the Common
Core standards), hiring and training enough human raters to complete the task in a reasonable
time is expensive” (5). Valerie Strauss appeared to agree with what Hesse had to say in his
article, and she articulated this in her article.
According to Valerie Strauss in her article “Can Computers Really Grade Essay Tests?”,
she states “in 2010, the federal government awarded $330 million to two consortia of states ‘to
provide ongoing feedback to teachers during the course of the school year, measure annual
school growth, and move beyond narrowly focused bubble test’” (United States Department of
Education). She also goes on to say in her article, “by combining the already existing National
Assessment of Educational Progress (NAEP) assessment structures for evaluating school system
performance with ongoing portfolio assessment of student learning by educators, we can cost-
effectively assess writing without relying on flawed machine-scoring methods” (Strauss).
Throughout my ongoing research, there has been no real indicator as to how much money will be
Edwards 10
spent, or how much money will be saved by using machine or computer scoring for student
essays. There have been methods and ways of saving money by using machine scoring by not
paying them as much for their time and services.
However, human raters cannot cost anymore than e-
raters because according to Herrington and Moran in
their article “What Happens When Machines Read
Our Students’ Writing?” states one system “WPP is
marketed as costing $5.00 per scored essay test”
(487). They also note “this is more costly than our
present human reader placement program at U Mass
Amherst ($3.70 per essay) and more expensive than the placement programs now established at
some other Massachusetts public colleges” (487). And, Herrington and Moran also go on to say:
We would respond that the alleged cost for Accuplacer and WPP
do not themselves include in-network management, user training, and
depreciation. Interestingly, the College Board also offers “trained reader
scoring”—at $10.00 per essay test!—for those schools still clinging to old
ways (487).
What this all means is often times, the e-raters can cost as much as human raters, and the cost
does not seem to vary that much. What a university or college pays an e-rater could be the same
as what they pay a human rater depending on the program, and/or the level of training they are
given in order to score those essays. In other cases, it just makes sense to use an e-rater and a
human rater to score the essays “while recognizing the limitations of most scoring approaches…
Measurability and cost appeared to be the core issue when addressing the usage of e-raters. Was cost truly an issue, and if so, how would academic institutions address those issues in the years to come? And as technology advances and becomes more available, would cost still be an issue?
Edwards 11
teachers can nonetheless push strongly for human scorers so that computers, with their
mechanical counting of syllables and words, do not become substitutes for the human
interchange between writer and reader that lies at the heart of communication” (Herrington and
Moran 484). Another critic argues their theory on e-rater, and gives a different take on the
matter.
Sara Cushing Weigle in her article “Validation of Automated Scores of TOEFL IBT
Tasks Against Non-test Indicators of Writing Ability” she states that “one approach involves
investigating the relationships between automated scores and scores given by human raters”
(337), and she later goes on to state in this same article:
Several studies have demonstrated the comparability of scores
between human raters and automated scores. The literature in the second
category, the criterion-related validity of automated scores, is scant,
although some researchers have looked at the relationship between human
scores on writing assessments and performance on other measures of
writing. Breland, Bridgeman, and Fowles (199) provide an overview of
this research, noting that essay test performance correlated more highly
with other writing performance than with grades, GPA, or instructor’s
ratings (337).
The argument being made is a student’s writing performance is more aligned with what the
computer has to say rather than what his or her instructors have to say, but it is the instructors,
not the computers, teaching our students. If that is the case, then the student should be more
inclined to listen to what the instructor has to say to them, and not what the computer instructs
Edwards 12
them to do. I think it is best to note that cost, and the interaction between student and computer,
should not be the issue here, and it is maintaining those interactions between writer and reader
that is at the heart and soul of this issue.
Interaction between reader and writer should be crucial, and instructors tend to learn
more about their students writing when they are grading their students’ papers. In Page et al.
again, I use an example of how a computer will miss a key piece of writing an instructor would
catch right away upon seconds into reading the paper:
Consider the following: “Queen America sailed to the Santa Maria
with 1492 ships. Her husband, King Columbus, looked to Indian explorer,
Niña Pinta, to find vast wealth on the beaches of Isabella, but would settle
for spices from the continent of Ferdinand.” Of course, the answer above
is designed to be ridiculous, though some parsers might give it a high
score for content because the passage contains many or the keywords
associated with Columbus’s discover of America (7).
Many instructors know their U.S. and World History, and they would know the above piece of
writing does not make sense, but a computer may still give it a high score, because there are no
grammar and spelling errors in it, and it does have all of the key vocabulary terms, however, if a
human rater were to look at this they would give it a lower score for not being comprehensible.
A computer cannot fully comprehend meaning in an essay, and although it sees the content,
grammar, spelling, structure, and mechanics, it is not analyzing the essay to see if it makes any
sense.
Edwards 13
Here is the traditional rubric and scoring standards for an e-rater, and why the e-rater may
score for some things, and not for others. According to Cushing Weigle, “these features are
typically the following, though they may vary for different versions of e-rater: Errors in
grammar, usage, mechanics and style, organization and development, lexical complexity, and
prompt-specific vocabulary usage” (336), and this gives insight into what these e-raters can
score, but this does not mean these e-raters follow the same structure or rules for scoring.
Cushing Weigle states “though they may vary” which means these e-raters may not follow the
same protocol, and how one e-rater will score an essay might be different than how another e-
rater might score an essay. If this is the case, then in similarity to human raters, the e-raters can
be biased as well. In Hesse’s article, he also claims that one version of the ETS E-rater scoring
engine looks at the same material as the e-raters that Cushing Weigle refers to in her article,
however, Hesse dives into what they are really scoring “one problem, as Les Perlman famously
and dramatically illustrated, is the ‘truth’ or logic of essays is undervalued or overlooked
altogether” (2), and this means the e-rater is looking at what is on the surface of the student’s
writing, but it may not be viewing what is below the surface into the depth and understanding of
the writing to which the student is producing. Here is another example of what the e-rater views,
and how the human rater will see it as incorrect:
6 (Lower). Dogs are interesting animals. Dogs are friendly to their owners.
Dogs show affection by wagging their tails.
7 (Higher). Friendly to their owners, wagging tails to show affection, dogs
are interesting animals (Hesse 3).
Edwards 14
Although number 7 is correct in terms of grammar, spelling and vocabulary usage, it does not
appear to be fully comprehensible at first glance. Number 6 may have a lower level of
complexity and lexical usage, but it is still comprehensible. We know who or what is being
talked about, and we know what the topic is in number 6, however, we do not necessarily receive
that knowledge in number 7 until the end of the sentence. Hesse addresses the issue further when
he addresses what he calls “Beating the System” (3).
In the part of Hesse’s article entitled “Beating the System”(3), he addresses the issue of
sophistication and computers. He states “the same
technology that allows people to have ‘conversations’
with the iPhone’s Siri, is improving the analysis of
writing. Still, just as Siri is not well-equipped to discuss
with you whether Nietzsche or Wittgenstein is the better
philosopher, so too do computer scoring systems run into
difficulty with complex tasks; they can surely score the
elements pretty well, but whether these elements add up
to a strong or weak piece of writing is another matter”
(Hesse 3). What he is pointing out is that these computers
can give a student the basics of what is wrong with their paper, but they will not be able to
reason with the student on what is wrong with his or her paper. I have an iPhone, and I can ask
Siri what’s the best place to eat at based on overall ratings of a certain restaurant, but it would
not be Siri’s personal opinion (she is after all, just a female voice on my smart phone), and the
same goes for e-raters. Their opinion is based on what is programmed into their system, but not
based on personal opinion, and some people would argue it is wonderful the computer has no
I had considered this as well with my students as they had been utilizing Apple products and using Achieve 3000 last year in which the students could manipulate the system if they wanted to do so, and if they chose to do that, the e-rater would be none the wiser. Unless it is a system such as Turnitin.com, students could find a way to master and cheat the system, and the machine scoring system wouldn’t know the difference.
Edwards 15
personal opinion, and thus is not biased in any way, but that also leads to stubbornness if a
student’s paper might be correct in terms of comprehension and overall creativity, but in the eyes
of the computer, it will not be correct in terms of grammar and mechanical issues, and this now
leads me to my conclusion.
These machines are still being studied, and there are still several focus groups and test
being developed to understand how machine scoring of essays works, and if it is working
accurately (“Can Computers Really Grade Essay Tests?”). Machine or computer scoring may
sound feasible at first, and they may have come a long way from their first inceptions. To put this
in another way, e-raters are in their Watson phase where they can only pay attention to the
grammar, mechanics, organization, development, and
vocabulary usage, and have yet to begin their Data (That
grammatically correct android from Star Trek: Next
Generation) or Hal stage (but let’s hope they never get to the
Hal stage) where they can actually comprehend meaning in a
student’s essays and analyze all of their content. Hesse
mentioned the iPhone’s Siri in his article (3), and how people
can ask her a question, and they will get a basic answer, but
she will not be able to chat with them on what her personal
opinion of anything would be. There is also the issue of cost
and the accuracy of the e-rater when it comes to scoring
essays, and as I have discovered, the e-rater can cost as much
as a human rater. As to the accuracy of the e-rater, the e-rater
I used Data and Hal as examples because I believe Data is a representation of what many educators and creators of machine scoring would hope to achieve in terms of how we would wish for e-raters to work: finding flaws in student papers, yet we can reason with the machine as to how those errors should be handled and corrected. Whereas, Hal would represent what we don’t want to see in machine scoring which is a somewhat negative approach to machine scoring, wherein we reject student papers based on what the computer deems as a great paper, yet we neglect human judgment in what a great paper is or can be.
Edwards 16
can be just as accurate as a human rater, but it cannot comprehend or analyze the meaning of a
student’s essay as of yet.
Works Cited
“Always Already: Automated Essay Scoring and Grammar-Checkers in College Writing
Courses.” Machine Scoring of Student Essays: Truth and Consequesnces. National
Writing Project. Logan, UT: Utah State UP, 2006. Web. 5 May 2013.
Balfour, Stephen P., Ph.D. “Assessing Writing in MOOCs: Automated Essay Scoring and
Calibrated Peer Review.” RPAJournal.com. RPA Journal, Summer 2013. Web. 18 Oct.
2015.
Cushing Weigle, Sara. “Validation of Automated Scores of TOEFL IBT Tasks Against Non-test
Indicators of Writing Ability.” Language Testing, 27.3 (2010): 335-353.
Hesse, Douglass D. “Can Computers Grade Writing? Should They?” U. of Denver. Feb. 2012. 1-
7.
Herrington, Anne, and Charles Moran. "What Happens when Machines Read Our Students'
Writing?." College English, 63.4 (2001): 480-499.
John, Markoff. "Essay-Grading Software Offers Professors a Break." The New York Times [New
York City] 4 Apr. 2013: n. pag. Web. 25 May 2013.
Edwards 17
Page, Ellis, Timothy Keith, Mark Shermis, Chantal Koch, and Susanmarie Harrington. "Trait
Ratings for Automated Essay Grading."Educational and Psychological Measurement,
62.1 (2002): 5-18.
Rutz, Carol. “Scoring by Machine.” College Composition and Communication, 59.1 (2007): 139-144.
Steier, Michael, Yigal Attali, and Will Lewis. "Scoring with the Computer: Alternative
Procedures for Improving the Reliability of Holistic Essay Scoring." Language Testing,
30.1 (2013): 125-141.
Strauss, Valerie. “Can Computers Really Grade Essay Test?” The Washington
Post [Washington] 25 April 2013: n. pag. Web. 1 June 2013.
Strauss, Valerie. “Grading Writing: The Art and Science--and Why Computers Can't Do It.” The
Washington Post [Washington] 2 May 2013: n. pag. Web. 30 May 2013.
Turnitin.com. n.d. Web. 3 June 2013.
Warnock, Scott. Teaching Writing Online: How and Why. Urbana, IL: National Council of
Teachers of English, 2009. Print.