Computer generated feedback

Assessing Writing 19 (2014) 51–65

Contents lists available at ScienceDirect

Assessing Writing

The effects of computer-generated feedback onthe quality of writing

Marie Stevenson ∗, Aek PhakitiUniversity of Sydney, Australia

a r t i c l e i n f o

Article history:Available online 17 December 2013

Keywords:Automated writing evaluation (AWE)Computer-generated feedbackEffects on writing qualityCritical review

a b s t r a c t

This study provides a critical review of research into the effects ofcomputer-generated feedback, known as automated writing evalu-ation (AWE), on the quality of students’ writing. An initial researchsurvey revealed that only a relatively small number of studies havebeen carried out and that most of these studies have examined theeffects of AWE feedback on measures of written production such asscores and error frequencies. The critical review of the findings forwritten production measures suggested that there is modest evi-dence that AWE feedback has a positive effect on the quality of thetexts that students produce using AWE, and that as yet there is littleevidence that the effects of AWE transfer to more general improve-ments in writing proficiency. Paucity of research, the mixed natureof research findings, heterogeneity of participants, contexts anddesigns, and methodological issues in some of the existing researchwere identified as factors that limit our ability to draw firm con-clusions concerning the effectiveness of AWE feedback. The studyprovides recommendations for further AWE research, and in par-ticular calls for more research that places emphasis on how AWEcan be integrated effectively in the classroom to support writinginstruction.

© 2013 Elsevier Ltd. All rights reserved.

1. Introduction

This study provides a critical review of literature on the pedagogical effectiveness of computer-based educational technology for providing students with feedback on their writing that is commonly

∗ Corresponding author.E-mail addresses: [email protected] (M. Stevenson), [email protected] (A. Phakiti).

1075-2935/$ – see front matter © 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.asw.2013.11.007

dx.doi.org/10.1016/j.asw.2013.11.007

http://www.sciencedirect.com/science/journal/10752935

http://crossmark.crossref.org/dialog/?doi=10.1016/j.asw.2013.11.007&domain=pdf

mailto:[email protected]

mailto:[email protected]

dx.doi.org/10.1016/j.asw.2013.11.007

52 M. Stevenson, A. Phakiti / Assessing Writing 19 (2014) 51–65

known as Automated Writing Evaluation (AWE).1 AWE software provides computer-generated feed-back on the quality of written texts. A central component of AWE software is a scoring engine thatgenerates automated scores based on techniques such as artificial intelligence, natural languageprocessing and latent semantic analysis (See Dikli, 2006; Philips, 2007; Shermis & Burstein, 2003;Yang, Buckendahl, Juszkiewicz, & Bhola, 2002). AWE software that is used for pedagogical purposesalso provides written feedback in the form of general comments, specific comments and/or corrections.

Originally, AWE was primarily used in high-stakes testing situations to generate summative scoresto be used for assessment purposes. Widely used, commercially available scoring engines are ProjectEssay GraderTM (PEG), e-rater®, Intelligent Essay AssessorTM (IEA), and IntelliMetricTM. In recentyears, the use of AWE for the provision of formative feedback in the writing classroom has steadilyincreased, particularly in classrooms in the United States. AWE programs are currently being usedin many elementary, high school, college and university classrooms with a range of writers fromdiverse backgrounds. Examples of commercially available AWE programs designed for classroom useare: Criterion (Educational Testing Service: MY Access! (Vantage Learning): Write to Learn and Sum-mary Street (Pearson Knowledge Technologies); and Writing Roadmap (McGraw Hill). These programssometimes incorporate the same scoring engine as used in summative programs. For example, Crite-rion incorporates the e-rater scoring engine and MY Access! incorporates the IntellimetricTM scoringengine.

Common to all AWE programs designed for classroom use is that they provide writers with multipledrafting opportunities, and upon receiving feedback writers can choose whether or not to use thisfeedback to revise their texts. AWE programs vary in the kinds of feedback they provide writers. Someprovide feedback on both global writing skills and language use (e.g., Criterion, MY Access!), whereasothers focus on language use (e.g., QBL) and some claim to focus primarily on content knowledge (e.g.,Write to Learn and Summary Street). Some programs incorporate other tools such as model essays,scoring rubrics, graphic organizers, and dictionaries and thesauri.

Like many other forms of educational technology, the use of AWE in the classroom has been thesubject of controversy, with scholars taking divergent stances. On the one hand, AWE has been hailedas a means of liberating instructors, freeing them up to devote valuable time to aspects of writinginstruction other than marking assignments (e.g., Burstein, Chodorow, & Leacock, 2004; Herrington& Moran, 2001; Hyland & Hyland, 2006; Philips, 2007). It has been seen as impacting positively onthe quality of students’ writing, due to the immediacy of its ‘on-line’ feedback (Dikli, 2006), and themultiple practice and revision opportunities it provides (Warschauer & Ware, 2006). It has also beenclaimed to have positive effects on student autonomy (Chen & Cheng, 2008).

On the other hand, the notion that computers are capable of providing effective writing feedback hasaroused considerable suspicion, perhaps fueled by the fearful specter of a world in which humans arereplaced by machines. Criticisms have been made concerning the capacity of AWE to provide accurateand meaningful scores (e.g., Anson, 2006; Freitag Ericsson, 2006). There is a common perception thatcomputers are not capable of scoring human texts, as they do not possess human inferencing skills andbackground knowledge (Anson, 2006). Other criticisms relate to the effects that AWE has on students’writing. AWE has been accused of reflecting and promoting a primarily formalist approach to writing,in which writing is viewed as simply being “mastery of a set of subskills” (Hyland & Hyland, 2006,p. 95). Comments generated by AWE have been said to place too much emphasis on surface featuresof writing, such as grammatical correctness (Hyland & Hyland, 2006) and the effects of writing for anon-human audience have been decried. There is also fear that using AWE feedback may be more of anexercise in developing test-taking strategies than in developing writing skills, with students writingto the test by consciously or unconsciously adjusting their writing to meet the criteria of the software(Patterson, 2005).

Positive and negative claims regarding the effects of AWE on students’ writing are not always basedon empirical evidence, and at times appear to reflect authors’ own ‘techno-positivistic’ or ‘technopho-bic’ stances toward technology in the writing classroom. Moreover, quite a lot of the research that has

1 Other terms found in the literature are automated essay evaluation (AEE) (See Shermis & Burstein, 2013) and writingevaluation technology.

M. Stevenson, A. Phakiti / Assessing Writing 19 (2014) 51–65 53

been carried out is or from authors who have been involved in developing a particular AWE programor who are affiliated with organizations that have developed these programs, so could contain a biastoward showing AWE in a positive light. Consequently, there is lack of clarity concerning the currentstate of evidence for the effects on the quality of students’ writing of AWE programs designed forteaching and learning purposes.

However, it is important to be aware that over the past decades there has also been controversyabout the effects of teacher feedback on writing. Perhaps the strongest opponent of classroom writingfeedback was Truscott (1996), who claimed that feedback on grammar should be abandoned, as itignored deeper learning processes, only led to pseudo-learning and had a negative effect on the qualityof students’ writing. While most scholars have taken less extreme positions, in a review of issuesrelating to feedback in the classroom, Hyland and Hyland (2006) concluded that there was surprisinglylittle consensus about the kinds of feedback that are effective and in particular about the long termeffects of feedback on writing development. However, some research synthetic evidence exists for theeffectiveness of teacher feedback. In a recent met-analytic study, Biber, Nekrasova, and Horn (2011)found that, when compared to no feedback, teacher feedback was associated with gains in writingdevelopment for both first and second language writers. They found that a focus on content andlanguage use was more effective than focus on a focus on form only, especially for second languagewriters. They also found that comments were more effective than error correction, even for improvinggrammatical accuracy. It is therefore timely to evaluate whether there is evidence that computer-generated feedback is also associated with improvements in writing.

To date, the thrust of AWE research has been on validation through the examination of the psy-chometric properties of AWE scores by, for example, calculating the degree of correlation betweencomputer-generated scores and scores given by human raters. Studies have frequently found highcorrelations between AWE scores and human scores. and these results have been taken as providingevidence that AWE scores provide a psychometrically valid measure of students’ writing. (See twovolumes edited by Shermis and Burstein (2003, 2013) for detailed results and in-depth discussion ofthe reliability and validity of specific AWE systems). Such studies, however, do not inform us aboutwhether AWE is effective as a classroom tool to actually improve students’ writing. As Warschauerand Ware (2006) pointed out, while evidence of psychometric reliability and validity is a necessarypre-requisite, it is not sufficient for understanding whether AWE ‘works’ in the sense of contributingto positive outcomes for student learning. Even the recently published ‘Handbook of Automated EssayEvaluation’ (Shermis & Burstein, 2013), although it pays some attention to AWE as a teaching andlearning tool, still has a strong psychometric and assessment focus.

Although a number of individual studies have examined the effects of AWE feedback in theclassroom, no comprehensive review of the literature exists that examines whether AWE feedbackimproves the quality of students’ writing. Warschauer and Ware (2006) provided a thought-provokingdiscussion of some existing research on AWE in the classroom and used this to make recommenda-tions for future AWE research. However, they only provided a limited review that did not include all ofthe then available research and did not provide an overview of the evidence for the effects of AWE onstudents’ writing. Moreover, since their paper was written a number of studies have been publishedin this area.

2. The current study

The current study provides an evaluation of the available evidence for the effects of AWE feedbackin the writing classroom in terms of written production. The study focuses on research involving AWEsystems specifically designed as tools for providing formative evaluation in the writing classroom,rather than AWE systems designed to provide summative assessment in testing situations. The purposeof formative evaluation is to provide writers with individual feedback that can form the basis for furtherlearning (Philips, 2007). In formative evaluation, there is a need to inform students not only about theirlevel of achievement, but also about their specific strengths and weaknesses. Formative evaluation canbe said to involve assessment for learning, rather than assessment of learning (Taylor, 2005). In thisstudy, feedback is viewed as encompassing both numeric feedback (i.e., scores and ratings) and written


feedback (i.e., global or specific comments on the quality of the text and/or identification of specificproblems in the actual text).

The study focuses on the effects of AWE on written production, because the capability to improvethe quality of students’ texts is central to claims made about the effectiveness of AWE feedback, andbecause, likely as a consequence of this, the bulk of AWE pedagogical research focuses on written pro-duction outcomes. The study includes AWE research on students from diverse backgrounds, in diverseteaching contexts, and receiving diverse kinds of feedback from diverse AWE programs. The scope ofthe research included is broad due to the relatively small number of existing studies and the hetero-geneity of these studies. The study does not aim to make comparisons or draw conclusions about therelative effects of AWE feedback on student writing for specific populations, contexts, feedback typesor programs. Instead, it aims to critically evaluate the effects of AWE feedback on written productionby identifying general patterns and trends, and identifying issues and factors that may impact on theseeffects.

The study is divided into two stages: a research survey and a critical review. The objective of theresearch survey is to determine the maturity of the research domain, and to provide a characterizationof the existing research that can be drawn on in the critical review. The objective of the critical review,which is the central stage, is to identify overall patterns in the research findings and to evaluate andinterpret these findings, taking account of relevant issues and factors.

3. Method

3.1. The literature search

A comprehensive and systematic literature search was conducted to identify relevant primarysources for inclusion in the research survey and critical review. Both published research (i.e., journalarticles, book chapters and reports) and unpublished research (i.e., theses and conference papers)were identified.

The following means of identifying research were used:

a) Search engines: Google Scholar, Google.b) Databases: ERIC, MLA, PsychInfo, SSCI, MLA, Ovid, PubPsych, Linguistics and Language Behavior

Abstracts (LLBA), Dissertation Abstracts International, Academic Search Elite, Expanded Academic,ProQuest Dissertation and Theses Full-text, and Australian Education Index.

c) Search terms used: automated writing evaluation, automated writing feedback, computer-generated feedback, computer feedback, and automated essay scoring automated evaluation,electronic feedback, and program names (e.g., Criterion, Summary Street, Intelligent Essay Assessor,Write to Learn, MY Access!).

d) Websites: ETS website (ets.org) (ETS Research Reports, TOEFL iBT Insight series, TOEFL iBT researchseries, TOEFL Research Reports); AWE software websites.

e) Journals from 1990 to 2011: CAELL Journal; CALICO Journal; College English; English Journal; Com-puter Assisted Language Learning; Computers and Composition; Educational Technology Researchand Development; English for Specific Purposes; IEEE Intelligent Systems; Journal of Basic Writ-ing; Journal of Computer-Based Instruction; Journal of Educational Computing Research; Journalof Research on Technology in Education; Journal of Second Language Writing; Journal of Technol-ogy; Journal of Technology, Learning and Assessment, Language Learning and Technology; LanguageLearning; Language Teaching Research; Learning, and Assessment; ReCALL; System; TESL-EJ.

f) Reference lists of already identified publications. In particular, the Ericson and Haswell (2006)bibliography.

To be included, a primary source had to focus on empirical research on the use AWE feedbackgenerated by one or more commercially or non-commercially available programs for the formativeevaluation of texts in the writing classroom. The program reported on needed to provide text-specific feedback. Studies were excluded that reported on programs that provided generic writingguidelines (e.g., The Writing Partner: Zellermayer, Salomon, Globerson, & Givon, 1991; Essay Assist:


Chandrasegaran, Ellis, & Poedjosoedarmo, 2005). Studies that reported results already reported else-where were also excluded. Where the same results were reported more than once, published studieswere chosen above unpublished ones, or if both were published, the first publication was chosen. Thisled to the exclusion of Grimes (2005) and Kintsch et al. (2000).

Based on the above criteria, 33 primary sources were identified for inclusion in the research survey(See Appendix A).

3.2. Coding of research survey

A coding scheme of study descriptors was developed for the research survey. The unit of codingwas the study. A study was defined as consisting of “a set of data collected under a single research planfrom a designated sample of respondents” (Lipsey & Wilson, 2001, p. 76). As one of the publications,Elliot and Mikulas (2004), included four studies with different samples, this led to a total of 36 studiesbeing identified.

In order to obtain an overview of the scope of the research domain, the studies were first classifiedin terms of constructs of effectiveness: Product, Process and Perceptions. Lai (2010) defined effec-tiveness of AWE feedback in terms of three dimensions: (1) the effects on written production (e.g.,quality scores, error frequencies and rates, lexical measures and text length); (2) the effects on writingprocesses (e.g., rates and types of revisions, editing time, time on task, and rates of text production);and (3) perceived usefulness. In our study, combinations of these constructs were possible, as somestudies included more than one construct.

Subsequently, as the focus of the study is writing outcomes, only studies that included Productmeasurements were coded in terms of Substantive descriptors and Methodological descriptors (SeeLipsey & Wilson, 2001). Substantive descriptors relate to substantive aspects of the study, such as thecharacteristics of the intervention and the research context. Methodological descriptors relate to themethods and procedures used in the study. Table 1 lists the coding categories for both kinds of descrip-tors and the coding options within each category. In the methodological descriptors, ‘Control group’refers to whether the study included a control condition and whether this involved comparing AWEfeedback with a no feedback condition or with a teacher feedback condition. ‘Text’ refers to whetheroutcomes were measured using texts for which AWE feedback had been received or other texts, suchas writing assessment tasks. ‘Outcome measure’ refers to the measure(s) of written production thatwere included in the study.

The coding categories and options were developed inductively by reading through the samplestudies. Developing the coding scheme was a cyclical process, and each study was coded a numberof times, until the coding scheme was sufficiently refined. These coding cycles were carried out bythe first researcher. The reliability of the coding was checked through the coding of 12 studies (one

Table 1Research survey coding scheme.

Categories Descriptors

Substantive descriptorsPublication type ISI-listed journal; non-ISI listed journal; book chapter; thesis; report;

unpublished paperAWE program Open codingCountry Open codingEducational context Elementary; high school; elementary/high school; university & collegeLanguage background L1; L1 & ESL; EFL/ESL only; unspecified

Methodological descriptorsDesign Between group; within-groups; between & within group; single groupReporting Statistical testing; descriptive statistics; no statisticsControl group No feedback; teacher feedback; no feedback & teacher feedback;

different AWE conditions; no control groupText AWE texts; other texts; AWE texts & other textsOutcome measure Scores; scores & other product measures; errors; citations


Table 2Research survey: constructs.

Construct Frequency

Product 17Product & process 4Product & perceptions 5Product, process, & perceptions 4Perceptions 5Perceptions & process 1Total 36

third of the data) by the second researcher. Rater reliability was calculated using Cohen’s kappa. Forthe substantive descriptors the kappa values were all 1.00, except for language background, whichwas .75. For the methodological descriptors the kappa values were .85 for Design, 1.00 for Reporting,.85 for Control group, 1.00 for Text and .85 for Outcome measure. Any disagreements were resolvedthrough discussion.

For the research survey, the frequencies of the coding categories were collated and this informationwas used to describe the characteristics of the studies in the research sample. For the critical literaturereview, the findings of the sample studies were critically discussed in relation to the characteristicsof the studies identified in the research survey and also in relation to strengths or weaknesses ofparticular studies.

3.3. Research survey

Table 2 shows that the primary focus of AWE research has so far been on the effects of AWE onwritten production. Thirty of the thirty six studies include Product measures: 17 focus solely on Prod-uct, and another 13 studies involve Product in combination with one or more of the other constructs.The secondary focus has been on Perceptions, with five studies focusing solely on Perceptions, andanother 10 including Perceptions. No studies have focused solely on Process. In the remaining survey,the thirty studies involving product measurements are characterized.

Table 3 shows that, in terms of types of publication, relatively few of the studies have appeared inISI-listed journals or in books. A number of the studies are from non-ISI-listed journals, and a numberare unpublished papers from conferences or websites. Table 3 also shows that 10 AWE programs areinvolved in the sample and that the majority of these have been developed by organizations that aremajor players in the field of educational technology: Criterion from ETS, MY Access! from VantageLearning, IEA and Summary Street from Pearson Knowledge Analysis Technologies. Criterion is theprogram that has been examined most frequently.

Criterion, MY Access! and Writing Roadmap provide scores and feedback on both content andlanguage. However, one of the studies that examined Criterion (i.e., Chodorow, Gamon, & Tetreault,2010) limited itself to examining feedback on article errors. Summary Street, IEA, LSA and ECS are all

Table 3Publication, program and feedback.

Publication K Program K Feedback K

ISI-listed 7 Criterion 11 Content & language 20Non-ISI-listed 7 My access 5 Content 5Book chapter 1 Writing roadmap 1 Language 4Thesis 5 ETIPS 2 Citations 1Report 1 IEA 1Unpublished paper 9 LSA semantic space 1

Summary street 3ECS 1SAIF 1QBL 4


based on a technique known as latent sematic analysis that purports to focus primarily on contentfeedback. ETIPS provides feedback for pre-service teachers on tasks carried out in an on-line case-basedlearning environment. SAIF provides feedback on the citations in a text. QBL provides comments onlanguage errors only. The table shows that most of the studies have involved programs that provideboth content and language feedback.

Table 4 shows that the majority of studies were carried out in classrooms in the United States, withthe remaining studies being carried out in Asian countries, with the exception of a single study carriedout in Egypt. University and college contexts were the most common, followed by high school contexts,and then elementary contexts. Almost half the studies do not specify the language background of theparticipants. Among the studies that did report the language backgrounds of the participants, only twoof the studies (i.e., Chodorow et al., 2010; Choi, 2010) investigated the effects of language backgroundon the effects of AWE feedback as a variable. Chodorow et al. (2010) compared the effects of Criterionfeedback on the article errors of native and non-native speakers, and Choi (2010) compared the effectsof Criterion feedback on written production measures of EFL students in Korea and ESL students inthe U.S.

3.4. Methodological features

Table 5 shows that most of the studies involved statistical testing, and that between group designs,in which one or more AWE conditions were compared with one or more control conditions, are themost common design. There were also a number of within group comparisons in which the samegroup of students was compared across drafts and/or texts. One study (i.e., Scharber, Dexter, & Riedel,2008) used a single group design in which students’ ETIPS scores were correlated with the number ofdrafts they submitted.

Table 5 also shows that the most common control group for the between group comparisonsinvolved a condition in which students received no feedback. In some cases, students in this con-dition wrote the same texts as students in the experimental condition(s) but received no feedback onthem, and in other cases students in the control condition did not produce any experimental texts.However, it is unclear in most of the studies whether students in the control condition did receivesome teacher feedback during their normal classroom instruction. Only three studies have explicitlycompared AWE feedback to teacher feedback.

In addition, the table shows that many of the studies have examined the effects of AWE feedbackon AWE texts. However, 11 of the studies focus partly or exclusively on the transfer effects of AWE tothe quality of texts that were not written using AWE.

Lastly, Table 5 shows that scores followed by errors are the most common writing productionmeasures that have been examined in the studies. Other measures that have been examined includetext length, sentence length, lexical measures and number of citations.

4. Critical review

The research survey has shown that the AWE pedagogical research domain is not a very matureone. Even though written production has been the main focus of research to date, the total numberof studies carried out remains relatively small, and a number of these studies are either unpublishedpapers or published in unranked journals, and perhaps as a consequence are lacking in rigor. Moreover,these studies are highly heterogeneous, varying in terms of factors such as the AWE program that isexamined, the design of the study, and the educational context in which the studies were carriedout. Hence, not surprisingly, the research has produced mixed and sometimes contradictory results.As a result, there is only modest evidence that AWE feedback has a positive effect on the quality ofstudents’ writing and, as the research survey showed, much of the available evidence relates to theeffectiveness of AWE in improving the quality of texts written using AWE feedback.

The evidence for the effects of AWE on writing quality from within group comparisons can be saidto be stronger than the evidence from between-group comparisons. In general, within-group studieshave shown that AWE scores increase and the number of errors decrease across AWE drafts andtexts produced by the same writers (e.g., Attali, 2004; Choi, 2010; El Ebyary & Windeatt, 2010; Foltz,

58

M.

Stevenson, A

. Phakiti

/ A

ssessing W

riting 19

(2014) 51–65

Table 4Country, context, language background and sample size.

Country K Educational context K Language background k Sample size K

USA 21 University & College 17 L1 1 <10 1Taiwan 4 High school 8 Mixed 6 11–50 4USA & Korea 1 Elementary 3 EFL 8 51–100 9Japan 1 Elementary & High school 2 EFL &ESL 1 101–200 5China 1 Unspecified 14 >200 10Hong Kong 1 Unspecified 1Egypt 1

M.

Stevenson, A

. Phakiti

/ A

ssessing W

riting 19

(2014) 51–65

59

Table 5Methodological features.

Design K Reporting K Control Text Outcome

Between groups 20 Statistical testing 23 No feedback 17 AWE text 19 Scores 13Within groups 7 Descriptive statistics 3 Teacher feedback 3 Other text 9 Scores + other measures 11Between & within 2 No statistics 4 No feedback & teacher feedback 1 Both AWE and other text 2 Errors 5Single group 1 Different AWE conditions 1 Citations 1

No control 8


Laham, & Landaur, 1999; Shermis, Garvan, & Diao, 2008; Warden & Chen, 1995). This would appearto indicate that writers are able to incorporate AWE feedback to improve the quality and accuracy ofAWE texts – at least according to the criteria that AWE programs use to evaluate texts. However, dueto methodological issues, some of the results of within-group studies need to be carefully interpreted.To give an example, Attali (2004) excluded 71% of his data set from analysis because the writers didnot undertake any revising or redrafting. While the remaining students did on average increase theirscore across drafts of the same texts, the lack of utilization of AWE by over two thirds of the cohortat the very least places a question mark against the efficacy of AWE for stimulating students to revisetheir texts. Moreover, an obvious limitation of within-group comparisons is that the lack of controlgroup makes it difficult to conclude with certainty that improvements are actually attributable to theuse of AWE software. Improvements made by students to successive drafts of a particular text could beattributable to their own revising skills rather than to their use of revisions suggested by AWE feedback.Improvements made to successive texts could possibly be attributable to other instructional factorsor possibly even to developmental factors.

The findings from between-group comparisons, which compare one or more AWE conditions withone or more control conditions, are more mixed, and those findings that provide positive evidencefrequently suffer from serious methodological drawbacks. More than half the studies using betweengroups comparisons showed either mixed effects or no effects for AWE feedback on writing outcomes.Mixed effects involve effects being found for some texts but not for others (e.g., Riedel, Dexter, Scharber,& Doering, 2006) for some measures but not for others (e.g., Rock, 2007), or for some groups of writersand not for others (e.g., Schroeder, Grohe, & Pogue, 2008). In a number of cases, in their discussionsthese studies largely ignore any negative evidence and hence draw conclusions about the effectivenessof AWE that are more optimistic than appear to be warranted. For example, in a study by Schroederet al. (2008) on the effectiveness of Criterion in improving writing in a criminal justice writing course,one of the three groups of students utilizing AWE feedback did not achieve significantly higher finalcourse grades than the control group. However, possible reasons for the non-significance of the resultsfor this third group are not mentioned and a very strong positive conclusion is drawn: “Results fromthis study overwhelmingly point toward the value of technology when teaching writing skills” (p.444). However, we did also find an example in which the authors did not appear to do justice totheir findings. Chodorow et al. (2010) found that Criterion reduced the article error rate of non-nativespeakers, but not of native speakers. However, the study did not report the article error rates for thenative-speakers and does not raise the point that AWE may be less effective for native-speakers simplybecause native-speakers do not tend to make many article errors. In this particular case, the lack ofa significant effect for native speakers should not be taken at face value as negative evidence for theeffectiveness of AWE.

A number of studies comparing AWE feedback to no feedback have found significant positive effectsfor AWE on writing outcomes. For example, in a study by Franzke et al. (2005) on Summary Streetusing a pre-test/posttest design with random assignment to an AWE or a no-feedback condition wrote,students in both conditions wrote four texts, the quality of which were scored by human raters. It wasfound that the AWE condition had higher holistic and content scores on both the averaged score forthe four texts and for orthogonal comparisons of the scores for the first two texts with the last twotexts. However, many of the studies are not as well-designed, and do not include a pretest or otherinformation on the comparability of students in experimental and control groups. In particular, resultsof studies that have compared writing outcomes of students who received AWE with those of studentsin previous cohorts should be viewed with caution. For example, Grimes (2008) found that in threeout of four schools students who used My Access had higher external test scores than students from aprevious year who did not receive AWE feedback. However, the author acknowledges that it is difficultto attribute this improvement to AWE as during the intervention period important improvements tothe quality of writing instruction provided by teachers were also instituted.

As shown by the research survey, only three studies have explicitly compared AWE feedback withteacher feedback (i.e., Frost, 2008; Rock, 2007; Warden, 2000). As the evidence from these studies isalso mixed, it seems premature to draw any firm conclusions. However, it should be pointed out thatnone of the studies shows that AWE feedback is less effective than teacher feedback, which as suchcould be taken as a positive sign. Nonetheless, of concern is that these studies report little about the


nature of the teacher feedback given or whether this feedback is not comparable to the AWE feedback.For example, in Warden (2000), an AWE condition in which students received specific error feedbackis compared with a teacher feedback condition in which students received no specific feedback, butonly general comments on content, organization, and grammar. As students in the teacher feedbackcondition received no specific feedback on the accuracy of their texts, it is hardly surprising that thenumber of errors decreased more in the AWE condition.

In general, there appears to be more support for improvement of error rates than improvement ofholistic scores. For example, Kellogg, Whiteford, and Quinlan (2010) found that holistic scores did notimprove, but that errors were reduced. As the errors types that reduced largely related to linguisticaspects of the text, they drew the conclusion that there was tentative support for learning aboutmechanical aspects of writing from AWE. In contrast, Chen (1997) found that an AWE group and a no-feedback control group decreased linguistic errors equally. However, the results of this study couldwell be attributable to a methodological drawback, as both experimental and control groups were inthe same classes. In these classes, the teachers spent time reviewing the most common error typesfound by the computer, in the presence of all the students. Hence, both groups of students may havebenefited from this instruction.

There appears to be no clear evidence as yet concerning whether AWE feedback is associated withmore generalized improvements in writing proficiency. Some of the studies that have examined trans-fer of the effects of AWE to texts for which no AWE feedback has been provided found no significantdifferences between scores for AWE and non-AWE conditions (i.e., Choi, 2010; Kellogg et al., 2010;Shermis, Burstein, & Bliss, 2004). Moreover, although three studies did find evidence of transfer (Elliot& Mikulas, 2004; Grimes, 2008; Wang & Wang, 2012), none of these studies is rigorously designed.The Wang and Wang (2008) study had only one participant in each condition. The flaws in the Grimes(2008) study have already been discussed. In Elliot and Mikulas (2004), in each of four sub-studies itwas claimed that AWE feedback was associated with better exam performance. However, there wasno random assignment to conditions and the reader is given no information concerning the char-acteristics of the participants in the two conditions. In one of the sub-studies, students’ results arecompared with students from a year 2000 baseline. In addition, results for two of the four sub-studieswere not tested statistically, and those that were tested are tested non-parametrically. Also, some ofthe claims seem to be rather remarkable, such as that a group who used MY Access! between Februaryand March of 2003 had a pass rate of 81% compared to only 46% for a group who did not receive AWEfeedback. It seems rather unlikely that such a short AWE intervention could lead to such a substantialchange in assessment outcomes, indicating that other factors may also be in operation.

However, it is important to be aware that one of the big unknowns of writing feedback received fromteachers is also whether it leads to any generalized improvements in students’ revising ability or in thequality of their texts. Hyland and Hyland (2006) pointed out that research on human feedback rarelylooks beyond immediate correction in a subsequent draft, so research on AWE research is not alonein neglecting this area. Closely connected to whether feedback can lead to generalized improvementsin writing is whether it assists students in developing their ability to revise independently. One ofthe first steps in developing revising skills is that writers are able to notice aspects of their textsthat have not, up to that point, been salient (Schmidt, 1990; Truscott, 1998). Once a feature has beennoticed it becomes available for reflection and analysis. As Hyland and Hyland (2006) pointed out,demonstrating that a student can utilize feedback to edit a draft tells us little about whether thestudent has successfully acquired a feature. Similarly, it tells us little about whether the student hasdeveloped the meta-cognitive skills to be able to notice, and then subsequently evaluate and correcttextual problems in other texts successfully.

Currently, we know little about whether AWE actually promotes independent revising. However,there is some evidence that receiving AWE feedback may not actually encourage students to makechanges either between or within drafts. Attali (2004) reported that 71% of students did not redrafttheir essays and 48% of those who did redraft did this only once. Grimes (2005) reported that a typicalrevision pattern for students was to submit a first draft, correct a few mechanical errors and resubmitas fast as possible to see if the score improved. Warden (2000) found that students who were offereda redrafting opportunity after receiving AWE feedback from QBL actually spent significantly less timerevising their first drafts than students who received AWE feedback on a single draft with no redrafting


opportunity, or who received teacher feedback instead of AWE feedback. Students who received noredrafting opportunity revised their texts before they received any feedback. They then submitted theirtexts for marking, received a mark and AWE feedback, but were not given an opportunity to redraftthe text. In contrast, students who received AWE feedback and had an opportunity to redraft appearedcarried out little independent editing, instead waiting for the program to tell them what was wrongwith their texts and then specifically correcting these errors. While these students were successful incorrecting errors detected by AWE, they made few other changes to their texts. Moreover, this trendcontinued across successive assignments, suggesting that AWE feedback was not leading to muchdevelopment in revising skills. However, it is important to remember that these findings corroboratefindings from revision research that writers – particularly younger writers-revise little and revisesuperficially (Faigley & Witte, 1981; Whalen & Ménard, 1995). It may be that some students simplydo not possess the revising skills needed to allow them to benefit from the revision opportunitiesafforded by AWE.

5. Conclusions and recommendations

This critical review suggests that there is only modest evidence that AWE feedback has a positiveeffect on the quality of the texts that students produce using AWE, and that as yet there is littleclarity about whether AWE is associated with more general improvements in writing proficiency.Paucity of research, heterogeneity of existing research, the mixed nature of research findings, andmethodological issues in some of the existing research are factors that limit our ability to draw firmconclusions concerning the effectiveness of AWE feedback.

Initially, we endeavored to meta-analyze effect sizes for the product studies in this sample. How-ever, due to methodological issues, many of the studies had to be excluded, leaving us with a verysmall but still highly heterogenous sample. Heterogeneity necessitates the inclusion of moderatoranalyses that examine the effects of variables such as AWE program, educational context and whetherAWE feedback was compared with no feedback or teacher feedback. However, with such a small sam-ple, there was insufficient power to conduct moderator analyses. We felt that simply providing anoverall effect size that ignores possible effects of moderator variables was not a viable or meaningfuloption. Instead, by carrying out a critical review we have been able to identify patterns in the existingresearch as well as discussing gaps in the findings, and issues in the methodologies. Below are rec-ommendations that follow from this review that can serve as a guideline for further research in thisarea.

Although this review has not allowed us to differentiate the effectiveness of specific AWE programs,given differences in the objectives of the programs and the nature of the feedback provided, it is likelythat such differences do exist. So far, more research on the effects of AWE has been carried out forCriterion than for other programs. Therefore, more studies examining other programs are called for,and in particular studies comparing the effectiveness of more than one AWE program.

A number of the studies provided only sketchy descriptions of their participants in terms of factorssuch as SES, language background, literacy levels, and computer literacy. Future research needs tobe more rigorous in reporting participant characteristics, in controlling for participant variables and,where appropriate, including these as variables in the research design. In particular, further researchis needed that examines the effectiveness of AWE feedback in ESL and EFL settings, and comparesthese to L1 settings. Given the tremendous diversity of student populations within the United States,not to mention the diversity in potential markets for AWE programs in both English-speaking andEFL contexts outside the United States, it is of particular importance that the effectiveness of AWEfeedback for second language learners be investigated. The commercial programs in use in the UnitedStates were not originally designed for English as a second language populations, even though theyare being marketed with such populations in mind (Warschauer & Ware, 2006).

In addition, further research examining the relative effects of AWE feedback and teacher feedbackis needed, in which greater explanation of the nature and quality of feedback provided by teachersis given and in which it is ensured that the kinds of feedback offered by teachers and AWE programsare more comparable. As there are so many factors in play, it is likely to turn out to be too simplis-tic to make overall pronouncements about whether human feedback or computer feedback is better.


What needs to be disentangled is whether it is really is the source of the feedback that matters,or whether it is other factors such as the way it is delivered, and the nature of the feedback pro-vided that make the difference. It is also important to be aware that, as it is frequently reiterated bydevelopers and researchers alike that AWE feedback is intended to augment teacher feedback ratherthan replace it (e.g., Chen & Cheng, 2008; Kellogg et al., 2010; Philips, 2007), research into the rel-ative effects of different ways of integrating AWE feedback into classroom writing instruction mayhave greater ecological validity. In a qualitative study involving the use of AWE feedback in threeclassrooms, Chen and Cheng (2008) found indications that AWE feedback may be indeed be moreeffective when it is combined with human feedback. However, this study did not examine the effectsof different methods of integration on written production. There are a variety of possible ways ofcombining AWE with teacher feedback, and of scaffolding AWE feedback. To name just a couple, stu-dents can use AWE to help them improve the quality of initial drafts and then submit to the teacherfor feedback, teachers can use AWE as a diagnostic tool for identifying the problems that studentshave with their writing, and/or teachers can provide initial training. Research that investigates differ-ent possibilities for integrating AWE into classroom writing instruction would also be of pedagogicalvalue.

Some might argue that in terms of the effectiveness of AWE feedback, the bottom line is whetherthe scores it generates correlate with external assessment outcomes and whether its repeated use inthe classroom improves students’ test results. However, while it is highly desirable that the transfer ofeffects to AWE feedback to non-AWE texts be established, it is questionable whether external examsprovide the most appropriate means of doing so. Firstly, as Warschauer and Ware (2006) remark,exam writing is generally based on a single draft in timed circumstance, whereas the whole point ofAWE is that it encourages multiple drafting. Secondly, the scoring on exams may be too far removedfrom the aspects for which AWE provides feedback. Thirdly, AWE feedback may not be robust enoughas an instructional intervention to impact noticeably on exam scores. Instead, we would recommendexamining transfer of the effects of AWE feedback in non-test situations using texts that are similarin terms of genres and topics to the AWE texts students have been writing.

The question remains, of course, whether the kinds of writing that AWE feedback give writers theopportunity to engage in actually reflect the kinds of writing that students do in their classrooms.AWE programs generally offer only a limited number of genres, such as persuasive, narrative andinformative genres, though some programs such as My Access! additionally enable teachers to usetheir own prompts (See Grimes & Warschauer, 2010). Moreover, as mentioned, AWE has been accusedof promoting formulaic writing with an unimaginative five-paragraph structure. The way lies open forAWE research to include a greater consideration of genre by controlling for genre as a variable, and bysystematically examining the influence of genre on the effectiveness of AWE feedback, for example,by comparing the effects of AWE when standard prompts are used with the effects when teachers’own prompts are used.

In conclusion, this study has carried out a critical review of research that examines the effectsof formative AWE feedback on the quality of texts that students produce. It has illuminated what isknown and what is not known about the effects of AWE feedback on writing. It could be argued thata limitation of the study is that it takes a narrow view of effectiveness in terms of a single dimension:written production measures. It does not focus on either of the other two dimensions of effectivenessidentified by Lai (2010): the effects on writing processes or perceived usefulness. However, we feel thatLai’s first dimension is an appropriate and valuable focal point for a critical review, because improvingstudents’ writing is central to the objectives of AWE and to claims regarding its effectiveness, bothof which are reflected in the fact that, as this study has shown, the bulk of research conducted sofar focuses on written production. We certainly do applaud AWE research that takes a triangulatedapproach AWE by incorporating the effects of AWE on written production (product perspective), onrevision processes and learning and teaching processes (process perspective) and on writers’ andteachers’ perceptions (perception perspective) (e.g., Choi, 2010; Grimes, 2008). We would also joinin the plea made by Liu et al. (2002) concerning research on computer-based technology: “ratherthan focusing on the benefits and potentials of computer technology, research needs to move towardexplaining how computers can be used to support (second) language learning – i.e., what kind of tasksor activities should be used and in what kinds of settings” (pp. 26–27). Consequently, as the next step,


in a follow-up study we will examine the use of AWE feedback in the classroom, including teachingand learning processes and teacher and learner perceptions.

2Research survey sample

*Attali, Y. (2004). Exploring feedback and revision features of Criterion. Paper presented at the National Council on Measurement inEducation San Diego, April 12–16, 2004.

*Chen, J. F. (1997). Computer generated error feedback and writing process: A link [Electronic Version]. TESL-EJ, 2. Retrieved fromhttp://tesl-ej.org/ej07/a1.html

Chen, C. E., & Cheng, W. (2008). Beyond the design of automated writing evaluation: Pedagogical practices and perceived learningeffectiveness in EFL writing classes. Language Learning and Technology, 12(2), 94–112.

*Chodorow, M., Gamon, M., & Tetreault, J. (2010). The utility of article and preposition error correction systems for Englishlanguage learners: Feedback and assessment. Language Testing, 27(3), 419–436.

*Choi, J. (2010). The impact of automated essay scoring (AES) for improving English language learners essay writing. (Doctoraldissertation. University of Virginia, 2010).

*El Ebyary, K., & Windeatt, S. (2010). The impact of computer-based feedback on students’ written work. International Journalof English Studies, 10(2), 121–142.

*Elliot, S., & Mikulas, C. (2004). The impact of MY Access! ! Use on student writing performance: A technology overview and fourstudies. Paper presented at the Annual Meeting of the American Educational Research Association.

*Foltz, P. W., Laham, D., & Landauer, T. K. (1999). The intelligent essay assessor: Applications to educational technology. InteractiveMultimedia Educational Journal of Computer-Enhanced Learning, 1(2.). Retrieved from www.knowledge-technologies.com

*Franzke, M., Kintsch, E., Caccamise, D., & Johnson, N. (2005). Summary Street: Computer support for comprehension and writing.Journal of Educational Computing Research, 33(1), 53–80.

*Frost, K. L. (2008). The effects of automated essay scoring as a high school classroom Intervention, PhD thesis. Las Vegas: Universityof Nevada.

*Grimes, D. C. (2008). Middle school use of automated writing evaluation: A multi-site case study, PhD thesis. Irvine: University ofCalifornia.

*Grimes, D., & Warschauer, M. (2010). Utility in a fallible tool: A multi-site case study of automated writing evaluation. Journalof Technology, Language, and Assessment, 8(6), 1–43.

*Kellogg, R., Whiteford, A., & Quinlan, T. (2010). Does automated feedback help students learn to write? Journal of EducationalComputing Research, 42, 173–196.

Lai, Y.-H. (2010). Which do students prefer to evaluate their essays: Peers or computer program. British Journal of EducationalTechnology, 41(3), 432–454.

*Riedel, E., Dexter, S. L., Scharber, C., & Doering, A. (2006). Experimental evidence on the effectiveness of automated essay scoringin teacher education cases. Journal of Educational Computing Research, 35(3), 267–287.

*Rock, J. (2007). The impact of short-term use of Criterion on writing skills in 9th grade (Research Report RR-07-07). Princeton, NJ:Educational Testing Service.

Scharber, C., Dexter, S., & Riedel, E. (2008). Students’ experiences with an automated essay scorer. The Journal of Technology,Learning and Assessment, 7(1), 1–44.

*Shermis, M. D., Burstein, J., & Bliss, L. (2004). The impact of automated essay scoring on high stakes writing assessments. PaperPresented at the Annual Meeting of the National Council on Measurement in Education.

*Shermis, M., Garvan, C. W., & Diao, Y. (2008). The impact of automated essay scoring on writing outcomes. Paper presented at theAnnual Meetings of the National Council on Measurement in Education, March 25–27, 2008.

*Schroeder, J. A., Grohe, B., & Pogue, R. (2008). The impact of criterion writing evaluation technology on criminal justice studentwriting skills. Journal of Criminal Justice Education, 19(3), 432–445.

*Wang, F., & Wang, S. (2012). A comparative study on the influence of automated evaluation system and teacher grading onstudents’ English writing. Procedia Engineering, 29, 993–997.

*Warden, C. A. (2000). EFL business writing behavior in differing feedback environments. Language Learning, 50(4), 573–616.Warden, C. A., & Chen, J. F. (1995). Improving feedback while decreasing teacher burden in ROC ESL business English classes.

In P. Porythiaux, T. Boswood, & B. Babcock (Eds.), Explorations in English for professional communications. Hong Kong: CityUniversity of Hong Kong.

Other referencesAnson, C. M. (2006). Can’t touch this: Reflections on the servitude of computers as readers. Machine scoring of student essays.

In P. Freitag Ericsson, & R. Haswell (Eds.), Machine scoring of student essays (pp. 38–56). Logan, Utah: Utah State UniversityPress.

Biber, D., Nekrasova, T., & Horn, B. (2011). The effectiveness of feedback for L1-english and L2-writing development: A meta-analysis.(ETS Research Report RR-11-05). Princeton, NJ: ETS.

Burstein, J., Chodorow, M., & Leacock, C. (2004). Automated essay evaluation: The Criterion online writing service. AI Magazine(Fall), 27–36.

Chandrasegaran, A., Ellis, M., & Poedjosoedarmo, G. (2005). Essay assist: Developing software for writing skills improvement inpartnership with students. RELC Journal, 36(2), 137–155.

2 References marked with an asterisk indicate studies that examine solely or partially the effects of AWE on writing outcomes,and which therefore have been included in the critical review.

http://refhub.elsevier.com/S1075-2935(13)00051-2/sbref0005
























http://tesl-ej.org/ej07/a1.html



























































































































http://www.knowledge-technologies.com/


























































































































































































































































































































































































































































Dikli, S. (2006). An overview of automated scoring of essays. The Journal of Technology, Learning and Assessment, 5(1), 1–35.Faigley, L., & Witte, S. (1981). Analyzing revision. College Composition and Communication, 32, 400–414.Freitag Ericsson, P. (2006). The meaning of meaning. In P. Freitag Ericsson, & R. Haswell (Eds.), Machine scoring of student essays.

Logan Utah: Utah State University Press.Grimes, D. (2005). Assessing automated assessment: Essay evaluation software in the classroom. Paper presented at the Computers

and Writing Conference Stanford, CA.Herrington, A., & Moran, C. (2001). What happens when machines read our students writing? College English, 63(4), 480–499.Hyland, K., & Hyland, F. (2006). Feedback on second language students’ writing. Language Teaching, 39, 83–101.Patterson, N. (2005). Computerized writing assessment: Technology gone wrong. Voices From the Middle, 13(2), 56–57.Philips, S. M. (2007). Automated essay scoring: A literature review (SAEE research series #30). Kelowna, BC: Society for the

Advancement of Excellence Education.Schmidt, R. W. (1990). The role of consciousness in second language learning. Applied Linguistics, 11(2), 129–158.Shermis, M. D., & Burstein, J. (Eds.). (2003). Automated essay scoring: A cross-disciplinary perspective. Hillsdale, NJ: Lawrence

Erlbaum Associates.Shermis, M. D., & Burstein, J. (Eds.). (2013). Handbook of automated essay evaluation: Current applications and new directions. New

York and London: Routledge.Taylor, A. R. (2005). A future in the process of arrival: Using computer technologies for the assessment of learning. TASA Institute,

Society for the Advancement of Excellence in Education.Truscott, J. (1996). The case against grammar correction in L2 writing classes. Language Learning, 46(2), 327–369.Truscott, J. (1998). Noticing in second language acquisition: A critical review. Second Language Research, 14(2), 103–135.Warschauer, M., & Ware, J. (2006). Automated writing evaluation: Defining the classroom research agenda. Language Teaching

Research, 10(2), 1–24.Whalen, K., & Ménard, N. (1995). L1 and L2 writers’ strategic and linguistic knowledge: A model of multiple-level discourse

processing. Language Learning, 44(3), 381–418.Yang, Y., Buckendahl, C. W., Juszkewicz, P. J., & Bhola, D. S. (2002). A review of strategies for validating computer-automated

scoring. Applied Measurement in Education, 15(4), 391–412.Zellermayer, M., Salomon, G., Globerson, T., & Givon, H. (1991). Enhancing writing-related metacognitions through a comput-

erized writing partner. American Educational Research Journal, 28(2), 373–391.

Further reading

*Britt, A., Wiemer-Hastings, P., Larson, A., & Perfetti, C. (2004). Using intelligent feedback to improve sourcing and integrationin students’ essays. International Journal of Artificial Intelligence in Education, 14, 359–374.

Dikli, S. (2007). Automated essay scoring in an ESL setting. (Doctoral dissertation, Florida State University, 2007).*Hoon, T. (2006). Online automated essay assessment: Potentials for writing development. Retrieved from http://ausweb.

scu.edu.au/aw06/papers/refereed/tan3/paper.html*Lee, C., Wong, K. C. K., Cheung, W. K., & Lee, F. S. L. (2009). Web-based essay critiquing system and EFL students’ writing: A

quantitative and qualitative investigation. Computer Assisted Language Learning, 22(1), 57–72.*Matsumoto, K., & Akahori, K. (2008). Evaluation of the use of automated writing assessment software. In C. Bonk, C. Bonk, et al.

(Eds.), Proceedings of World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education 2008 (pp.1827–1832). Chesapeake, VA: AACE.

*Schreiner, M. E. (2002). The role of automatic feedback in the summarization of narrative text, PhD Thesis. University of Colorado.*Steinhart, D. J. (2001). An intelligent tutoring system for improving student writing through the use of latent semantic analysis.

Boulder: University of Colorado.Wade-Stein, D., & Kintsch, E. (2004). Summary Street: Interactive computer support for writing. Cognition and Instruction, 22(3),

333–362.Warschauer, M., & Grimes, D. (2008). Automated writing assessment in the classroom. Pedagogies: An International Journal, 3,

22–36.Yao, Y. C., & Warden, C. A. (1996). Process writing and computer correction: Happy wedding or shotgun marriage? [Electronic

Version]. CALL Electronic Journal from Available at http://www.lerc.ritsumei.ac.jp/callej/1-1/Warden1.html.








































































































































































































































































































































































http://ausweb.scu.edu.au/aw06/papers/refereed/tan3/paper.html

http://ausweb.scu.edu.au/aw06/papers/refereed/tan3/paper.html