38
Assessing Writing 9 (2004) 122–159 Developing a common scale for the assessment of writing Roger Hawkey a,, Fiona Barker b a Consultant in language testing research to Cambridge, ESOL b Validation Officer, Research & Validation, Cambridge, ESOL Available online 23 July 2004 Keywords: Writing assessment; Proficiency; Levels; Criteria; Scales 1. Background to the study This article reports a qualitative analysis of a corpus of candidate scripts from three Cambridge ESOL examination levels, but all in response to the same writing task. The analyses, using both intuitive and computer-assisted approaches, are used to propose key language features distinguishing performance in writing at four pre-assessed proficiency levels, and to suggest how these features might be incorporated in a common scale for writing. The work is part of the Cambridge ESOL Common Scale for Writing (CSW) project, which aims to produce a scale of descriptors of writing proficiency levels to appear alongside the common scale for speaking in the Handbooks for the Main Suite and other Cambridge ESOL international exams. Such a scale would assist test users in interpreting levels of performance across exams and locating the level of one examination in relation to another. Fig. 1 conceptualises the relationship between a common scale for writing and the levels typically covered by candidates for Cambridge ESOL examinations, the Key English Test (KET), the Preliminary English Test (PET), the First Certificate in English (FCE), the Certificate in Advanced English (CAE) and the Certificate of Proficiency in English (CPE), each of which has its own benchmark pass level, the ‘C’ in Fig. 1. Writing tasks set as part of the tests in Table 1 are currently scored by rating degree of task fulfilment and evidence of target language control according Tel.: +44 1840 212 080; fax: +44 1840 211 295. E-mail address: [email protected] (R. Hawkey). 1075-2935/$ – see front matter © 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.asw.2004.06.001

Developing a Common Scale for The

Embed Size (px)

Citation preview

Page 1: Developing a Common Scale for The

Assessing Writing 9 (2004) 122–159

Developing a common scale for theassessment of writing

Roger Hawkeya,∗, Fiona Barkerba Consultant in language testing research to Cambridge, ESOLb Validation Officer, Research& Validation, Cambridge, ESOL

Available online 23 July 2004

Keywords: Writing assessment; Proficiency; Levels; Criteria; Scales

1. Background to the study

This article reports a qualitative analysis of a corpus of candidate scripts fromthree Cambridge ESOL examination levels, but all in response to the same writingtask. The analyses, using both intuitive and computer-assisted approaches, areused to propose key language features distinguishing performance in writing atfour pre-assessed proficiency levels, and to suggest how these features might beincorporated in a common scale for writing. The work is part of the CambridgeESOL Common Scale for Writing (CSW) project, which aims to produce a scaleof descriptors of writing proficiency levels to appear alongside the common scalefor speaking in the Handbooks for the Main Suite and other Cambridge ESOLinternational exams. Such a scale would assist test users in interpreting levels ofperformance across exams and locating the level of one examination in relation toanother.

Fig. 1conceptualises the relationship between a common scale for writing andthe levels typically covered by candidates for Cambridge ESOL examinations, theKey English Test (KET), the Preliminary English Test (PET), the First Certificatein English (FCE), the Certificate in Advanced English (CAE) and the Certificate ofProficiency in English (CPE), each of which has its own benchmark pass level, the‘C’ in Fig. 1. Writing tasks set as part of the tests inTable 1are currently scored byrating degree of task fulfilment and evidence of target language control according

∗ Tel.: +44 1840 212 080; fax: +44 1840 211 295.E-mail address:[email protected] (R. Hawkey).

1075-2935/$ – see front matter © 2004 Elsevier Inc. All rights reserved.doi:10.1016/j.asw.2004.06.001

Page 2: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 123

Fig. 1. Conceptual diagram of a common scale across examination levels and ranges.

to criteria such ascommunicative effectiveness, register, organisation, linguisticrange and accuracy. From the assessment by trained raters of performance onthese and other criteria, a candidate is assigned a score for the writing task set.

As Fig. 1indicates, a common scale may cover “the whole conceptual range ofproficiency” (Common European Framework of Reference for Languages (CEF)

Table 1CEF overall written production (illustrative scale)

Levels

CEF overall written production (illustrative scale)C2 Can write clear, smoothly flowing, complex texts in an appropriate and effective style and

a logical structure which helps the reader to find significant points.C1 Can write clear, well-structured texts of complex subjects, underlining the relevant salient

issues, expanding and supporting points of view at some length with subsidiary points,reasons and relevant examples, and rounding off with an appropriate conclusion.

B2 Can write clear, detailed texts on a variety of subjects related to his/her field of interest,synthesising and evaluating information and arguments from a number of sources.

B1 Can write straightforward connected texts on a range of familiar subjects within his fieldof interest, by linking a series of shorter discrete elements into a linear sequence.

A2 Can write a series of simple phrases and sentences linked with simple connectors like‘and,’‘but’ and ‘because.’

A1 Can write simple isolated phrases and sentences.

Page 3: Developing a Common Scale for The

124 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

2001, p. 40). If examinations for candidates at different levels are located on acommon scale of proficiency, then, to cite the CEF again, “it should be possible,over a period of time, to establish the relationship between the grades on oneexamination in the series with the grades of another” (ibid., p. 41). The investiga-tions and analyses of the common scale for writing project should also inform keywriting assessment issues such as the development and application of criteria, thegeneralisability of proficiency ratings and the use of exam candidate script corporain writing assessment research.

2. Relevant issues in the testing of writing

2.1. Communicative writing constructs

Cambridge ESOL claims to undertake “to assess skills at a range of levels, eachof them having a clearly defined relevance to the needs of language learners” and“to assess skills which are directly relevant to the range of uses to which learn-ers will need to apply the language they have learnt, and cover the four languageskills — listening, speaking, reading and writing.” Saville, inWeir and Milanovic(2003, p. 66), a volume describing the history and recent revision of the Cam-bridge ESOL Certificate of Proficiency in English exam, sees it as the task of theexamination developer to construct “definitions or traits of ability for the purposeof measurement,” and claims that “it is these definitions which are the constructs.”In the context of language testing, Saville continues, a model of language abilityrepresents the construct.

Given the stated Cambridge ESOL aim to assess the four skills as they are“directly relevant to the range ofuses to which learners will need to apply the lan-guagethey have learnt,” it is the construct ofcommunicativelanguage ability thatshould underlie the language tests concerned. The communicative language abilityconstruct, derived from models such asBachman’s (1990)view of language com-petence, comprises pragmatic competences (including grammatical and textualcompetences) and organisational competence (including illocutionary and soci-olinguistic competences). The CEF describes “language use, embracing languagelearning” similarly as comprising “the actions performed by persons who as indi-viduals and as social agents develop a range of competences, both general and inparticular communicative language competences” (2002, p. 9).

A communicativewriting construct in the context of such models of commu-nicative language ability will also entailBachman’s (1990)pragmatic, organisa-tional and sociolinguistic competences.Cumming (1998, p. 61), again taking acontext-rooted view, reminds us that the construct ‘writing’ “refers not only totext in written script but also to the acts of thinking, composing, and encodinglanguage into such text,” these necessarily entailing “discourse interactions withina socio-cultural context.” The CEF, echoing this interactive view of the natureof writing, notes (2001, p. 61) that in “written production activities the language

Page 4: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 125

user as writer produces written text which is received by a readership of one ormore readers,” and exemplifies writing activities (e.g., completing forms; writ-ing articles, reports, memoranda; making notes and taking messages) which areperformed for communicative purposes.

The research described in this article analyses test candidate responses to a writ-ing task that was selected for its potential to engage the interest and communicativeabilities of a wide range of language learners. The status of the task concerned, interms of its appropriateness to the communicative writing construct is explored inSection 5.

2.2. The assessment of communicative proficiency

Hamp-Lyons (1990), whose own participation in the first phase of the CSWproject is described below, notes that once the indirect (often multiple-choice)writing tests of the 1960s and 1970s had been “chased from the battlefield,”directtests of writing held sway. The communicative writing construct invited, in the in-terests of test construct (and content) validity, assessment methods that measuredcommunicativeproficiency. Such methods were likely to involve the developmentof direct tests to elicit candidate performance on tasks with a context, purpose, au-thentic discourse and behavioural outcomes.Milanovic, Saville, and Shen (1992)claim that “direct tests of writing have always been standard practice in Cam-bridge examinations for the assessment of both first (L1) and second language(L2) writing abilities” (p. 62). In Cambridge ESOL exams, according toSaville(2003), “authenticity of test content and the authenticity of the candidate’s interac-tion with that content are important considerations for the examination developerin achieving high validity” (p. 67).

But asHamp-Lyons (1990)notes, direct tests of communicative language abil-ity raise problems to which there are no easy answers, “each aspect – task, writer,scoring procedure, and reader – interacts with the others, creating a complex net-work of effects which to date has eluded our efforts to control” (p. 87).Bachman’s(1990)framework for test method characteristics, which also emphasises the com-plexity of language ability assessment, includes among the variables: testing envi-ronments, test rubric, test input, expected response and relationship between inputand response. Authenticity also remains a major issue.Bachman (1990, 1991)defines the notion in terms of the appropriacy of language users’ response to lan-guage as communication.Bachman and Palmer (1996, pp. 23–25)re-analyse thisnotion intoauthenticityandinteractiveness, the former defined as “the degree ofcorrespondence of the characteristics of a given language task to the features ofa TLU task,” the latter as “the extent and type of involvement of the test taker’sindividual characteristics in accomplishing the test task.”

The assessment of writing through communicative tasks brings a change in therelationship between reliability and validity.Saville (2003, p. 69)notes “a potentialtension between them; when high reliability is achieved, for example by narrowingthe range of task types or the range of skills tested, this restricts the interpretations

Page 5: Developing a Common Scale for The

126 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

that can be placed on performance in the test, and hence its validity for manypurposes.” When language tests were more discrete-item and objective, it waseasier to obtain stable, consistent results free from bias and random error. With thetask-based assessment of communicative writing proficiency, however, validity, inBachman’s (1990, p. 161)sense of maximising the effects of the language abilitieswe want to measure, often decree test tasks that are not susceptible to discrete-point marking but require the kind of rating, for example through criteria and bandscales, that are more vulnerable to problems of intra- and inter-rater reliability.

Writing tasks set as part of the Cambridge ESOL examinations inFig. 1 arerated according to criteria incorporated in band descriptors used to place a candi-date in terms of “what learners can be expected to do” (Cambridge ESOL examHandbooks) and within the proficiency level represented by the exam (s)he hastaken. Cambridge ESOL describes the exams as “linked to the Common EuropeanFramework for Modern Languages, published by the Council of Europe” and as“the only certificated exams referred to in the Framework document as specifi-cally linked to it by a long-term research programme” (Cambridge Examinationsin English, a brief guide). The Cambridge ESOL First Certificate in English (FCE)exam, for example, claims to certify successful candidates at CEF B2 or Vantagelevel.

A further, related problem with communicative test tasks is that, while theyattempt to mirror specific authentic activities, they are also expected to offer gener-alisability to other task performances and extrapolation to future abilities.Morrow(1979, 1990)sees generalisability potential in the analysis of communicative tasksinto their enabling skills or “micro-skills” (seeMunby, 1978), since the “assess-ment of ability in using these skills. . . yields data which are relevant across abroad spectrum of global tasks, and are not limited to a single instance of perfor-mance” (Morrow, 1979, p. 20).Saville and Hawkey (2004)among others, however,note the difficulty of isolating the particular enabling skills actually used in theperformance of tasks.

North (2000)notes that “there are arguments for and against using the samerating categories for different assessment tasks,” but warns that “if the categoriesare based on the task rather than a generic ability model, results from the assess-ment are less likely to be generalisable” (p. 568). North’s “categories” are assess-ment criteria such as range, accuracy, and interaction. TheCEF (2001)suggeststhat a “common framework scale should becontext-freein order to accommodategeneralisable results from different specific contexts.” But the test performancedescriptors concerned also need “to becontext-relevant, relatable or translatableinto each and every relevant context. . .” (p. 21).

Weir (1993)considers rigour in the content coverage of direct performancetasks as one way to increase generalisability. The sample of communicative lan-guage ability elicited from test-takers by a test task must be “as representative aspossible,” in accordance with “the general descriptive parameters of the intendedtarget situation particularly with regard to the task setting and task demands” (p.11).Bachman (2002)proposes, in similar vein, assessments “based on the planned

Page 6: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 127

integration of both tasks and constructs in the way they are designed, developedand used.” For Bachman, a “fundamental aim of most language performance as-sessments” is to identify tasks that “correspond to tasks in ‘real-world’ settingsand that will engage test-takers in language use or the creation of discourse” (p.471).

In the context of the research project described in this article, the constructsand problems mentioned above, communicative language ability, the socio-culturalcontext, task and other effects, generalisability, authenticity, and, of course, validityand reliability, are to a greater or lesser extent at issue. The authenticity and thus thevalidity of the particular task used to collect a range of candidate performances mustbe considered. Inter-rater reliability will have to be established to permit inferencesto be made about communicative performance features typical of different levelsof proficiency, and the issue of generalisability is, of course, critical to any workrelated to the development ofcommonscales. Since the analyses in the researchdescribed below are based on a wide range of learners performing on a single,common task, the generalisability question will demand particularly convincinganswers.

2.3. Bands and scales in the assessment of performance on communicativewriting tasks

Alderson’s (1990)paper on direct test assessment bands remains an insightfulsurvey of the principles and practices of proficiency level descriptor band scalesin the assessment and reporting of direct language test performances. Aldersonnotes that bands represent a range of scores rather than ‘precisely defined’ per-formances, which, he suggests, may help testers avoid a spurious impression ofaccuracy. He also makes the useful distinction between ‘constructor-oriented,’‘assessor-oriented’ and ‘user-oriented’ scales; band descriptions may be used intest development, to rate test performance or to interpret performance for test can-didates or receiving organisations. These categories of scale are not, of course,mutually exclusive. The Cambridge ESOL CSW project is intended eventually toinform test constructors, assessorsandusers.

Alderson’s analysis of band scale development raises issues that must receiveattention in this study. These include: deciding which assessment criteria to includeand how to define them; distinguishing the end of one band or level from the begin-ning of the next (without, e.g., resorting entirely to indefinite distinctions between‘always,’ ‘usually,’ ‘sometimes,’ ‘occasionally’); avoiding long, over-detailed de-scriptions (see alsoCEF, 2001; North, 2000; Porter, 1990), and achieving intra- andinter-rater consistency when bands are used to assess proficiency.Spolsky (1995)expresses stronger doubts about the viability of band scales. Such scales may beattractive for their “easy presentation” of language test results, but risk an “over-simplification” that “ultimately misrepresents the nature of language proficiency”and “leads to necessarily inaccurate, and therefore questionable, statements aboutindividuals placed on such a scale” (p. 350).

Page 7: Developing a Common Scale for The

128 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

The message for language testers is not, however, that scales are not feasible butthat their development and use should take account of the “underlying complexityof writing and its measurement” as part of “an open and contextual approach tolanguage proficiency assessment” (p. 353).

2.4. Developing and revising rating scales

On the question ofhowband scales are developed, suggestions from the Com-mon European Framework are examined here in some detail as they will helpcategorise the approaches of the research that is the subject of this article.

TheCEF (2001)suggests the following on scale development methodologiesin general:

There are a number of possible ways in which descriptions of language proficiencycan be assigned to different levels. The available methods can be categorised inthree groups: intuitive methods, qualitative methods and quantitative methods.Most existing scales of language proficiency and other sets of levels have beendeveloped through one of the three intuitive methods in the first group. The bestapproaches combine all three approaches in a complementary and cumulativeprocess. (p. 207)

Qualitative methods, according to the CEF account (2001, p. 207), “require theintuitive preparation and selection of material and the interpretation of results.” Theuse of quantitative methods involves scale developers in quantifying “qualitativelypre-tested material, and will require the intuitive interpretation of results.”

The CEF then provides examples of intuitive, qualitative and quantitative meth-ods.Intuitive methods are seen as requiring “the principled interpretation of ex-perience” rather than “structured data collection,” probably involving the draftingof a scale using existing scales and other relevant source materials, “possibly afterundertaking a needs analysis of the target group,” after which “they may pilot andrevise the scale, possibly using informants.” This process is seen as being led byan individual, a committee (e.g., a development team and consultants) or as “ex-periential” (the committee approach but over a longer period, developing a “houseconsensus” and possibly with “piloting and feedback”).

The CEF account describesqualitativemethods of scale development as in-volving “small workshops with groups of informants and a qualitative rather thanstatistical interpretation of the information obtained” (ibid., p. 209), the use ofexpert or participant-informant reactions to draft scales, or the analysis of typicalwriting performances using “key features” or “traits” to refine provisional criteriaand scales and relate them to proficiency levels.

The CEF then proposes threequantitativemethods of developing band scales.Discriminant analysissees sets of performances rated and “subjected to a detaileddiscourse analysis to identify key features,” after which multiple regression isused “to determine which of the identified features are significant in determiningthe rating which the assessors gave.” These features can then be incorporated in

Page 8: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 129

the required level descriptors, as inFulcher’s (1996)multi-dimensional scaling,“a descriptive technique to identify key features and the relationships betweenthem” (p. 210) is used on performance ratings to identify features decisive indetermining level and “provides a diagram mapping the proximity or distanceof the different categories to each other.Item response theory(IRT) or latenttrait analysis, for example, using the Rasch model to “scale descriptors of com-municative proficiency,” associates descriptors of communicative performancewith proficiency levels. The generalisation advantages of Rasch analysis aresignificant; such analysis can provide “sample-free, scale-free measurement. . .

— scaling that is independent of the samples” (p. 211).Insights for the study here are also taken fromFulcher (2003, p. 92), who

classifies approaches to rating-scale development as either intuitive (including ex-pert, committee or experiential judgement) or empirical (data-based or -driven,empirically-derived binary choice boundary definition scales, and the ranking ofscaling descriptors by experts). There are some similarities, too, withUpshur andTurner (1995, 1999)andTurner and Upshur (1996), who develop empirically de-rived, binary choice, boundary definition scales (EBBs). This method rank-orderstarget language samples, scores them, then identifies “features that were decisive inallocating the samples to particular bands or score ranges” (Fulcher, 2003, p. 104).

In terms of the CEF categorisation of approaches above, and of the methodssuggested by Fulcher, and by Upshur and Turner, the study described in this article,which is but one channel of inquiry in what Cambridge ESOL (seeSaville, 2003,p. 64) calls “an on-going programme of validation and test revision,” may becharacterised as follows:

• led by an individual researcher using “the principled interpretation of expe-rience” with a co-researcher responsible for the computer linguistic analysisof some data;

• using the “intuitive preparation and selection of material and the interpreta-tion of results,” but also “structured data collection”;

• using data-based or -driven approaches;• using analysis of typical writing performances through “key features” or

“traits,” but also reference to “existing scales and other relevant sources,” todevelop rating criteria for a draft proficiency scale, to be “piloted and revised”;

• using some quantitative methods to validate the sorting of data, for exampleinter-rater reliability statistics on the candidate scripts to be groupedaccording to level;

• the use of “small workshops with groups of informants and a qualitativerather than statistical interpretation of the information obtained” to refineprovisional criteria and scales and relate them to proficiency levels; and

• referring interim work regularly to a committee developing a “houseconsensus,” namely the Cambridge ESOL Writing Steering Group.

Sections 4–8, which describe the study in some detail, will illustrate theseapproaches.

Page 9: Developing a Common Scale for The

130 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

2.5. Example criteria and bands

The current study towards a common scale for writing is informed by the bandsand assessment criteria for writing used in a number of existing band scales.Table 1shows the CEF illustrative six-level scale for “overall written production” (A1Breakthrough to C2 Mastery) (2001, p. 61).

It is possible to infer criteria such as the following from this scale:clarity,fluency, complexity; appropriacy and effectiveness of style; logical structure andconnections; links, helping the reader to find significant points; range/variety oftopics. For its six-level “overall written interaction” scale, the CEF (2001, p. 84)refers to the following overlapping but not identical criteria: clarity, precision, flex-ibility, effectiveness; emotional, allusive and joking language; conveying degreesof emotion; highlighting the significance of events and experiences. Such descrip-tors for writing proficiency, derived as they are according to the CEF’s guidelinesof positiveness, definiteness, clarity, brevity, and independence (pp. 206–207)inform the descriptors identified later for use in a draft scale proposed in thisstudy.

Table 2summarises bands and assessment criteria used for Cambridge ESOLMain Suite and other key international tests of writing.

Table 2Main writing assessment criteria used with Cambridge ESOL main suite and other exams

Exam Bands/levels Main criteria for assessment

Certificate ofProficiency inEnglish (CPE)

6 bands/levels Task realisation: content, organisation,cohesion, range of structures,vocabulary, register and format, targetreader;

Effect on reader General impression: sophistication andrange of language

very positive style, register, formatpositive organization and coherenceachieves desired effect topic developmentnegative errorsvery negativenil

Certificate inAdvanced English(CAE)

6 bands/levels Task specific:

Effect on reader Content; range; organisation andcohesion; register; target reader;

very positive General impression:positive task realisation: coverage,

resourcefulnesswould achieve required

effectorganisation and cohesion

negative appropriacy of registervery negative

Page 10: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 131

Table 2 (Continued)

Exam Bands/levels Main criteria for assessment

language: control,naturalness, range ofvocabulary andstructure, errors

nilFirst Certificate in

English (FCE)6 bands/levels Task specific: content; range; organisation and

cohesion; appropriacy of register and format; targetreader;

Effect on reader General impression: task realisation: full, good,reasonable, not adequate, not at all; coverage ofpoints, relevance, omissions, original output

very positive organisation and linkspositive control of language: range and accuracywould achieve

required effectappropriacy of presentation and register

negativevery negativenil

Preliminary EnglishTest (PET)

5 marks for‘ task’ Task coverage, elaboration, organisation

‘ 5 for language Language range, variety, complexity, errorsKey English Test

(KET)10, 5 and 5 marks forthree tasks

Message communication, grammatical structure,vocabulary, spelling, punctuation

Certificates in EnglishLanguage Skills(CELS)

6 bands/levels Content points, length

format and register: appropriacyorganisation: clarity, intentcohesion: complexity, variety of linksstructure and vocabulary range: range,±distortionaccuracy:± impeding errorsparagraphing, spelling, punctuation

International EnglishLanguage TestingSystem (IELTS)

9 bands/levels Task fulfilment: requirements, exploitation,relevance, arguments, ideas, evidence: logic,development, point of view, support, clarity;coherence and cohesion

Effect on reader Communicative quality : impact on reader, fluency,complexity,

expert Vocabulary and sentence structure: range,appropriacy

very good accuracy, error typesgoodcompetentmodestlimitedextremely limitedintermittentnon-user

Page 11: Developing a Common Scale for The

132 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

These descriptors and criteria also inform decisions on the draft band descriptorsproposed below. Noted, in particular, were the following, used in the descriptionsof proficiency levels for one or more Cambridge ESOL exams, and classified hereunder three headings that appear appropriate superordinates for the descriptors andcriteria covered:

Fulfilment of the task set

task realisation: full, good, reasonable, not adequate, not at all

coverage of points, relevance,

content points, omissions, original output

task fulfilment: requirements, exploitation

relevance, arguments, ideas, evidence: logic,

development, point of view, support, clarity;

length

Communicative command of the target language

communicative quality: impact on reader, effect on target reader

sophistication and range of language

fluency, complexity,

range of vocabulary and structure

language: control, naturalness, structure and vocabulary range

style/register and format

appropriacy of register and format

appropriacy of presentation and register

format and register: appropriacy

Organisation of discourse

organisation and coherence; organisation and links;

organisation: clarity, intent; paragraphing,

cohesion: complexity, variety of links

coherence and cohesion

topic development

Linguistic errors

accuracy: impeding, non-impeding errors

spelling, punctuation

errors

accuracy, error types

distortion

Page 12: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 133

In the study reported here, it will be necessary to identify the characteristicsor key featuresof the writing of candidates who take exams used to certificatelearners at different proficiency levels, and who perform at different proficiencylevels across those exams. The descriptors and criteria listed above will be onereference source for decisions on descriptors and criteria for use in the study. Thegroupings do suggest that, judging from their prominence in the Cambridge ESOLscales sample inTable 2, task realisation, linguistic error (or accuracy?), andtheorganisation of discoursecould figure in the drafting of any new scale. Not,perhaps, belonging under such headings, however, because they are less directlyconnected with the content of response to the task set, with lexico-grammaticalcorrectness or with the way a text is organised, are descriptors or criteria such asappropriacy, fluency, complexity, sophisticationandrange of language,includedabove under the very tentative heading “Communicative command of the targetlanguage.”

Fulcher (2003, p. 96), warning relevantly that “the intuitive approach to scaledevelopment has led to a certain amount of vagueness and generality in the de-scriptors used to define bands,” reminds us of the need to define and distinguishbetween key “components of assessment”:

In language testing, the attention of raters has been drawn to the accuracy of struc-ture and vocabulary in speech as one component of assessment, and the qualityand speed of delivery as a separate component. This is an attempt at constructdefinition: the operational definition of two related but distinct components thatmake up the construct of speaking. (p. 27)

The same need to attempt construct definition applies, of course, to thedevelopment of descriptors and scales for the assessment ofwriting, espe-cially when samples of candidate writing are analysed for features that mighttypify their level of proficiency. On linguistic error or accuracy, for exam-ple, Fulcher (2003)appears to accept teacher definitions of the errors madeby speakers. He notes, in addition, that “some of these errors interfere withcommunication, and others do not.” Fulcher also accepts that accuracy in theuse of a second language may be associated with “students who concen-trate on building up grammatical rules and aim for accuracy” (p. 26), unlikethose “who concentrate on communicating fluently, paying little attention toaccuracy.”

Helpfully for the use of accuracy, in particular lexico-grammatical accuracy, asa criterion in the analysis of candidate scripts in this study, Fulcher refers to low andhigh gravity errors, and to the following types of accuracy error areas:agreement,word order, pronouns and relative clauses, tense, prepositions.All these categoriesof error are used later in this study.

Accuracy is frequently juxtaposed in language assessment contexts withap-propriacy, a construct that will also emerge from the analysis of candidatescripts below. Appropriacy entails the application ofHymes’ (1971)“rules of

Page 13: Developing a Common Scale for The

134 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

use without which the rules of grammar would be useless.” Appropriacy wouldseem to belong mainly underBachman’s (1990)pragmatic competence construct,which includes sociolinguistic competences such assensitivity to dialect, variety,register.

Thefluencyconstruct also appears in CEF descriptors, for example in the globalscales, at C2 level: “Can express him/herselffluentlyand spontaneously. . .,” or inthe Common Reference Level self-assessment grid for writing, again at C2 level:“I can writing clear,smoothly flowingtext . . ..”

The development of descriptors, criteria and bands in this study will be informedboth by relevant existing descriptors, criteria and bands, and by the intuitive andqualitative analyses of candidate scripts. The study being part of an iterative re-search process, insights will also come from previous phases in the CambridgeESOL CSW project.

3. Research methodology

3.1. Lessons from CSW Research Phase 1

The CSW project is phased. Phase 1 is summarised here as background to themore detailed description of Phase 2, the main topic of this paper.

The Phase 1 project design called for a two-fold approach to research towardsthe development of a common scale for writing. A senior Cambridge ESOL ex-aminer, Annette Capel, revisited existing Cambridge exam mark schemes andmodified intuitively the descriptors for the levels represented by the main suite ofCambridge exams. The outcome was a five-band draft common scale for writingusing criteria such as:operational command of written language; length, com-plexity and organisation of texts; register and appropriacy; range of structuresand vocabulary; andaccuracy errors(Saville & Capel, 1995). The scale, shownin Fig. 2, is based on ‘Pass’ level descriptors for each of the five Cambridge MainSuite exam levels.

As an applied linguist with a particular interest in writing assessment,LizHamp-Lyons (1995)was invited to investigate a representative corpus of candidatescripts from PET, FCE, CAE and CPE exams. From this corpus, she proposed“can do,” “can sometimes do,” and “cannot do” statements to characterise theproficiency levels of the scripts, and identified criteria such astask completion;communicative effectiveness; syntactic accuracy and range; lexical appropriacy;chunking, paragraphing and organisation; register control; andpersonal stanceand perspective. As the scripts in the corpus were from candidates taking arange of Cambridge ESOL exams and thus responding to different prompts,Hamp-Lyons noted significant task effects on candidate writing performance.These provided interesting insights into task: performance relationships, butmade more difficult the identification of consistent features of writing at differentlevels.

Page 14: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 135

Fig. 2. Phase 1: draft common scale for writing (Capel, 1995).

Learning from the experience of Phase 1 of the CSW project, the followingdecisions were made on the approach to Phase 2:

1. Insights from existing scales and criteria would continue to inform worktowards the development of a common scale for writing (as they had Capel’swork in Phase 1).

2. Thetabula rasaanalysis of candidate scripts (as performed by Hamp-Lyons)would also be continued.

3. Task effect would be controlled by using a corpus of candidate writing inresponse to thesamecommunicative task across exam levels.

4. The qualitative analyses of scripts would be carried out by a single researcher,thus filtered through the corpus analyst’s own experience and preferences,but taking account (seeSections 4–8) of the analyses of existing scales andof candidate scripts carried out in Phase 1.

Page 15: Developing a Common Scale for The

136 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

5. The intuitive analyses would be backed by some computer corpus analysesof the scripts, to be carried out by a second researcher in close contact withthe first.

6. The work would continue to be monitored by the Cambridge ESOL WritingSteering Group.

3.2. Corpus linguistics and computer analyses

The decision to use computer corpus analysis in Phase 2 of the CSW projectwas partly motivated by advances in this approach since Phase 1.

Corpus Linguistics (CL) is the “study of language based on examples of ‘reallife’ language use,” in this case the written language produced in live and simulatedtest situations in response to ‘real life’ tasks (McEnery & Wilson, 1996, p. 1).Using CL techniques in this study would allow certain checks on the assertion that“actual patterns of use in natural discourse are often quite different from linguists’perceptions, and many patterns simply go unnoticed until they are uncovered byempirical analysis” (Biber, Conrad, & Reppen, 1998, p. 145).

The reasons for the growing popularity of CL techniques in language testinginclude the ease of access to data and the range of research tools available. Com-puterised analyses can reveal many aspects of language use quickly and accurately,reducing the need for painstaking manual analysis and revealing patterns or factsthat may be undetectable to the naked eye. CL techniques are used alongside othermethodologies to reveal important facts about language.

There has been relatively little corpus-based research specifically related tolearner writing. A notable exception isGranger and Rayson (1998)who investi-gated features of learner writing by comparing native and non-native texts fromtwo corpora (the International Corpus of Learner English and the Louvain Corpusof Native English Essays). Granger and Rayson found that the non-native speak-ers “overused three categories significantly: determiners, pronouns and adverbs,and also significantly underused three: conjunctions, prepositions and nouns” (p.123). Equally important isKennedy, Dudley-Evans, and Thorp (2001), which isexamined in greater detail below.

Corpora and corpus linguistic techniques are increasingly used in CambridgeESOL’s research projects for analysing candidate performance or developing ex-amination tasks alongside established methodologies such as qualitative analysisand intuition, and the knowledge and experience of item writers, examiners andsubject officers. Although Cambridge ESOL has been developing corpora for overa decade (most notably the Cambridge Learner Corpus, seeBall, 2001), the re-lated analytical techniques have tended to be used for small-scale research or testvalidation projects such as the comparison of scripts from different versions of thesame examination or the analysis of transcripts of Young Learners speaking tests(see, e.g.,Ball & Wilson, 2002). CL techniques have also been used by ETS, forexample, to study the writing section of the TOEFL computer-based test, and inthe development of the new writing and listening tests for TOEFL 2005.

Page 16: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 137

Size is not everything when using corpora, because of the detailed scrutinythat can be applied to a small corpus.Aston and Burnard (1998, p. 21) note that“it is striking how many descriptive studies have analysed only small corpora (orsmall samples of larger ones), often because of the need to inspect and categorisemanually.” This is relevant to Phase 2 of the CSW study because the total datasubmitted to computer analysis amounted to 18,000 words in 98 scripts, that isthe four sub-corpora selected after the initial analysis of the 53,000 word 288-script corpus (see below). Whatever the size of the corpus,Leech (1998)urgesresearchers to be cautious when drawing general inferences from corpus findingsand to stay alert to “the influence of hidden variables implicit in the way wecollected or sampled the data.” In this study, having built our own small corpus,we were able to avoid this potential problem although we should still bear in mindthat “a corpus is a finite sample of an (in principle) infinite population: we cannoteasily extrapolate from what is found in a corpus to what is true of the languageor language variety it supposedly represents.”

A description of the recent growth in the use of corpora in Cambridge ESOLcan be found in a series of articles detailing key projects over the last 2 years (Ball,2001, 2002; Ball & Wilson, 2002; Boyle & Booth, 2000).

A second reason for applying CL techniques to the data in the study was theuse of similar techniques in a contemporary study at University of Birmingham(UK), where Dr. Chris Kennedy was working with colleagues on the analysis of acorpus of 150 candidates’ scripts in response to writing tasks from the InternationalEnglish Language Test System (IELTS) (Kennedy et al., 2001). The methodologyof the Kennedy et al. project involved re-typing candidate scripts into text files,including all errors, performing a ‘manual analysis’ to note features of interest byband level, performing statistical analyses on essay length, and then applying theConcord and Wordlist tools of the WordSmith Tools program (Scott, 2002). This isa text analysis software package that has three main actions: producing wordlistsfrom texts, converting texts into concordances (contextualised extracts showingwhere a key word or phrase occurs) and identifying the keywords in a text (bycomparing the words in the text with a larger reference list).

Interesting points emerging from the Birmingham research and relevant to Phase2 of the CSW study include features of writing identified at different proficiencylevels, for example:

• longer essays with broader vocabulary range at higher IELTS proficiencylevels;

• more rhetorical questions, interactivity, idioms, colloquial, colourful andmetaphorical language at higher levels; and

• the over-use of explicit (probably rote-learnt) cohesion devices by candidatesat lower writing proficiency levels.

Some of these aspects of high and low level proficiency in writing are echoedin findings in Phase 2 of the CSW study reported below.

Page 17: Developing a Common Scale for The

138 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

4. CSW project Phase 2 research design

The main research questions for Phase 2 of the Common Scale for Writingproject were:

• What are the distinguishing features in the writing performance of ESOLlearners or users across three Cambridge English examination levels, ad-dressing a common task?

• How can these be incorporated into a scale of band descriptors?

The research design for an empirical study to answer these questions was asfollows:

• A corpus of writing performances, all responding to the same communicativetask, was obtained from candidates at three different Cambridge exam levels.

• Each script was graded by more than one experienced and trained rater, usinga single assessment scale.

• All scripts were read and graded by the main researcher, who wrote commentson salient features of each.

• Sub-corpora of scripts, selected at four proficiency levels according to theband scores assigned by raters, were identified for closer analysis by the mainresearcher.

• Analyses of the writing samples in the sub-corpora were carried out, to iden-tify, then check through expert consultation, typical features of the writingsamples for each level.

• The manualanalyses of the sub-corpora were supplemented by computeranalyses run by the second researcher to re-examine some of the character-istics identified as typical of each sub-corpus and explore other features ofpotential interest.

• The characteristics and criteria identified were rationalised into a draft scaleof band descriptions for the proficiency levels specified, this scale to beproposed as a draft common scale for writing.

5. CSW project Phase 2 research Steps 1 and 2

To obtain scripts for the Phase 2 corpus, a communicative writing task wasneeded which was suitable for candidates for Cambridge exams at three levels:FCE, CAE and CPE. After consideration of the Principal Examiner’s report on theDecember 1998 FCE exam session, which suggested it as relevant to candidates’real world and interests, and likely to “engage test-takers in language use or thecreation of discourse” (seeBachman, 2002) a task prompt was selected. The taskrubric was:Competition: Do you prefer listening to live music or recorded music?Write us an article giving your opinion. The best article will be published in the

Page 18: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 139

magazine and the writer will receive£500. The task served its purpose well in thesense that none of the candidates in the study sample misunderstood the topic.

A sample of 108live test scripts was selected from the December 1998 adminis-tration of the exam. Two hundred further international candidates were identified,split between those preparing for the CAE and the CPE exams. Pilot test paperswere sent to six Cambridge-ESOL test centres, with the instruction that the test,including the writing task already completed by the FCE candidates, should beadministered to the candidates for CAE and CPE within a 2-week period. Onehundred and eighty pilot CAE and CPE scripts were returned, to be added to the108 live FCE scripts to form the 288-script (53,000-word) corpus used in Phase 2of the study.

Ten experienced Cambridge writing test examiners were invited to a markingday and oriented to the mark scheme to be used. The 180 pilot scripts (113 CAEcandidates and 67 CPE candidates) were to be marked by examiners, all using themark scheme already used for the same task in the live FCE exam. Once the 180pilot CAE/CPE scripts had been marked, the 108 live FCE candidate scripts wereremarked by members from the same team of examiners.

To identify distinguishing features in writing performance on the common taskacross three proficiency levels, the main researcher then rated all 288 scripts in thecorpus, using the same FCE mark scheme as the experienced raters. This ratingwas carried out without his prior knowledge of which scripts belonged to whichlevel of exam candidates, or of the ratings already assigned by the CambridgeESOL raters.

The overall FCE mark scheme used by all raters included criteria such as taskpoints and own output; range and control of structure and vocabulary; organisationand cohesion; appropriacy of presentation and register, and effectiveness of com-munication of message on target reader. The task-specific mark scheme includedrelevance, range of structure and vocabulary; presentation and register, adjustedfor length, spelling, handwriting, irrelevance. The band ratings assigned were ona scale from 0 to 5 with the option of three selections within a band (e.g., 4.3, 4.2or 4.1). Agreement between the ratings assigned by the script analyst and those ofthe experienced UCLES raters was then checked.

Table 3gives rater identifications, numbers of ratings, average ratings assigned(with the FCE bandings from 1 to 5 divided into 15 points, three for each bandon the FCE assessment scale, thus 5.3 = 15, 5.2 = 14, 4.3 = 12 and so on) andinter-rater correlations (Pearsonr).

Table 3shows reasonably high inter-rater correlations across the raters andbetween the trained raters and the analyst, given that with correlations of ratingsof the same task, Pearsonr levels around .8 and above would be desirable (Hatch& Lazaraton, 1991, p. 441). It should also be noted, however, that the experiencedraters each rated relatively few of the FCE papers (seeTable 3for the number ofratings assigned by each rater).

The scripts provided by candidates at Cambridge ESOL centres who werepreparing to take the CAE or the CPE exam, were each assigned a band rating

Page 19: Developing a Common Scale for The

140 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

Table 3FCE candidate scripts: ratings, standard deviations and inter-rater correlations between experiencedraters and the script analyst

Raters

A B H I K L

No. of ratings 19 12 14 32 21 107Average score assigned 8.72 9.50 10.71 10.90 10.14 9.82S.D. 2.24 2.75 2.53 2.41 2.90 2.68Correlation with analyst ratings (L) .84 .92 .76 .89 .90 –Correlation between analyst ratings

and all other raters combined– – – – – .84

5 ratings in the original data have no rater ID.

by 3 of 10 trained raters, each rater rating an average of 50 scripts, allocateddifferentially from the total of 180 CAE/CPE scripts. The averages and standarddeviations across raters thus differ somewhat since each rater was rating a differentset of scripts, apart from the corpus analyst, who again rated all the (180) scriptsconcerned.Table 4gives the numbers of scripts rated by the raters, standard de-viations for raters rating 50 or more scripts, and inter-rater correlations betweeneach of the raters and the analyst.

It will be noticed that the correlations inTable 4are, on the whole, lower thanthose for the FCE candidate ratings inTable 3. This may be because this studycovers groups of candidates from three different exam levels but, because of thenature of its research questions, each candidate responded to a common taskandwas rated according to the same (FCE) assessment scale. It is possible that ratersfound the FCE scale less appropriate for the CAE and CPE than for the FCEcandidates. After all, the current FCE and CPE (though not the CAE) Handbooksstate that the FCE and CPE mark schemes should beinterpretedat FCE and CPElevels respectively, which might suggest that the schemes are potentially level-specific.

The inter-rater correlations were, however, considered high enough for the pur-pose for which they were intended in the context of the study. That was to facilitatethe division of the corpus into four different levels of target language proficiency,

Table 4CAE/CPE candidate scripts: ratings, standard deviations and inter-rater correlations with the scriptanalyst

Raters

A B D E F G H K L

No. of ratings 50 69 71 48 62 70 51 72 180Average score assigned 10.36 12.14 11.38 9.46 10.19 9.84 10.25 10.04 10.35S.D. 2.84 2.28 2.80 2.11 1.93 2.35 2.69 2.99 2.56Correlation with analyst ratings (L) .75 .72 .68 .75 .75 .70 .62 .74 1.00

Page 20: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 141

each of which could be analysed according to characteristics that might be said tobe representative of the level concerned.

6. CSW project Phase 2 research Step 3

As well as rating all 288 scripts, the corpus analyst had written brief intuitivecomments on the distinguishing performance features of each (seeTable 5).

The four most common features noted in the analyst’s first round of qualitativeanalysis wereimpact, fluency, organisationandaccuracy.

Impacton the reader is a feature of writing already noted (usually called “effecton the reader”) in existing Cambridge ESOL rating scales (seeTable 2) and inher-ent perhaps in the interactive nature of the writing construct as espoused by theCEF and others above. Thirty comments using the termimpactitself or reflectedin adjectives suggesting the making of impact on the reader (e.g., “lively,” “pow-erful”) were made, all but one applied to scripts rated at Bands 5 and 4. From thisa provisional inference could be made that the ability to make an impact through

Table 5Summary of features and occurrences in the first analysis of 288 scripts

Most common featuresand total mentions

Descriptors(positive/negative)

Number ofmentions

Mentionswith Bands5 and 4scripts(n = 161)

Mentionswith Bands3 and 2scripts(n = 127)

Impact, 30 ‘impact,’ ‘ lively,’‘powerful’[‘ +impact’]/NA

30 29 1

Fluency, 83 ‘fluent,’ [‘ +flu-ent’]/‘ awkward,’‘strained,’ ‘ stiff’‘stilted,’‘disjointed,’[‘ -fluent’]

40 37 3

Organisation, 120 ‘well-organised’[‘ + organ-ised’]/‘ disorganised,’[‘ -organised’] ,‘muddled,’‘confused’

19 16 3

101 40 61Accuracy (of vocabulary,

grammatical structure,etc.), 213

‘accuracy,’ ‘ accurate’[‘ + accu-rate’]/ ‘inaccuracy,’inaccurate,’ [‘ +inaccurate’]

52 51 1

161 51 110

Page 21: Developing a Common Scale for The

142 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

writing presupposes a fairly high level of target language proficiency. Theimpactcriterion, though clearly somewhat subjective in nature, was deemed worthy offurther investigation in terms of communicative features, which might be associ-ated with it.Impactwas already seeming a rather broad concept, however, possiblyoverlappingfluency, a term which tended to be applied here to scripts that werealso seen as making an impact on the reader.

Table 5shows thatfluency(seeSection 2.4), referring to ease of use of thetarget language, was mentioned 83 times in the initial analysis. As a positivefeature of writing performance, fluency received very significantly more mentions(37 to 3) with scripts awarded Bands 5 and 4 than with those awarded Bands 3and 2. Descriptors characterisinga lack offluency, however, were mentioned onlysomewhat more frequently with reference to Bands 3 and 2 than with the Bands 4and 5 scripts.

Organisation and cohesion(see alsoSection 2.4), covering the overall structureand coherence of the writing and its use of links, or cohesion devices, was men-tioned 120 times, around 80% of these being negative. Whereas nearly all positivereferences to organisation referred to Bands 4 and 5 scripts, the negative referenceswere more evenly shared by the higher and lower proficiency scripts, at 40% and60%, respectively.

Accuracy(of vocabulary, grammatical structure, etc., seeSection 2.4above,and Section 7.2below for a detailed analysis) was referred to more than anyother feature in the initial comments. This prominence may have been becausethe chosen writing task invited a newspaper article, a discourse mode particularlyliable to lose impact through errors of linguistic accuracy. But it may also be that,even in a language teaching and testing world where communicative approacheshold sway, with emphasis on message rather than form, accuracy plays a key partin the impact of communication on interlocutors. Of the 213 mentions of accuracy,52 were positive, all but one of these referring to Bands 5 and 4 scripts. Of the morethan three times as many references toinaccuracy, more than two-thirds appliedto the scripts assigned Bands 2 and 3.

Despite the subjective nature, some circularity and the fact that the analysis wasperformed by a single individual, the findings were considered by the monitoringcommittee to be significant enough to inform the next research steps.

7. CSW project Phase 2 research Steps 4 and 5

All the ratings of the corpus of 288 scripts as assigned using the FCE assessmentscale were now used to select four sub-corpora according to proficiency level butstill regardless of examination candidacy (CPE, CAE or FCE). The first sub-corpus(n = 29 scripts) consisted of all scripts banded only at 5 on the FCE mark scheme(including scores of 5.3, 5.2 and 5.1) by all raters; the second (n = 18) of scriptsbanded at 4 by all raters, the third (n = 43) of those banded at 3 by all raters, andthe fourth (n = 8 only) all banded at 2.

Page 22: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 143

Each of the scripts in the four sub-corpora (98 scripts, totaling around 18,000words) was then submitted to detailed re-examination by the main researcherinvolving: a re-reading; a characterisation according to features that made afavourable or less favourable impact; a count and classification of errors, and aselection of excerpts (seeTable 6) consideredtypicalof certain of the communica-tive characteristics of the level concerned, these excerpts to be checked with anexpert panel.

The emergence from the first reading of the whole corpus ofimpact on thereaderas a key characteristic of some of the higher-scored scripts would now bere-visited, alongside the other criteria that featured in the initial subjective analysis,fluency, organisation, and lexico-grammatical (in)accuracy. The outcome of thisre-visit was a more rationalised set of features that could form the basis of adraft scale for writing:sophistication of language, accuracyandorganisation andcohesion.

7.1. Sophistication of language

It may be noted that certain of the criteria specified by Capel in her Phase 1 draftcommon scale for writing (seeFig. 2) are not among the emerging Phase 2 criteria,for example:operational command of written language; registerandappropriacy;range of structures and vocabulary.Also missing from the developing list wereHamp-Lyons’ suggested criteria (seeSection 3.1): communicative effectiveness,lexical appropriacy, register controland personal stance and perspective. Theabsence of such criteria from the emerging labels was not, however, because thecriteria concerned were not found relevant to a potential common scale for writing.Their absence as explicit, independent categories was caused by decisions on whichcriteria to specify separately, and which to subsume in others, given the issues ofcriterion definition, independence and of over-detailed or overlapping descriptions(seeAlderson, 1990; CEF, 2001).

A portmanteau criterion of the nature of sophistication of language appearedto be emerging. AsTable 2indicates, a similar criterion, sophisticated use of anextensive range of vocabulary, collocation and expression, entirely appropriate tothe task set already appears in the general mark scheme for the CPE exam. Perhapsthis criterion could be taken to subsume features such as fluency, appropriacy andregister, for example, or Capel’s useful Phase 1 operational command of writtenlanguage, criteria notoriously difficult to distinguish from range of structures andvocabulary. It is difficult to achieve a balance between a criterion that is narrow andthus risks too discrete a rating of a communicative task, and one that is too broad,thus risking overlap with other criteria. But sophistication of language seemed tocover language use making an impact on the reader in addition to accuracy andorganisation, and one with interesting differentiation potential across proficiencylevels.

The corpus analyst’s own comments and examples from scripts seen as makingextra impact may clarify the concept ofsophistication of language. Table 6gives

Page 23: Developing a Common Scale for The

144 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

Table 6Descriptions and samples of ‘sophistication of language’

Corpus analyst comment Script ID Script sample

Band 5 scripts‘exceptionally fluent, stylish

(meaning that the writer isable to adopt a particularstyle to increase impact)thus impactful’

1 ‘Just a year ago, it was easy: me and the guyson a Friday night(in fact, every night was Fri-day night back then),hitting rock clubs, drink-ing cold beer and absorbing loud live music.But it’s different these days. . .’

‘fluent, brisk, thus makesimpact’

2 ‘But live music has something more really.Maybe it is because of that fusion betweenthe artist and you, the crowd effect that makesyou feel things a different way, and enjoy mu-sic you would have disliked on the radio forinstance.’

‘makes powerful statements’ 3 ‘Listening to live music can be an extraordi-nary experience or a total fiasco. . .. Actually Ican’ t think of anything better than the excite-ment of listening to a new CD you have justacquired. Every song is something new andunknown. It’s like discovering a whole newworld, full of new possibilities.’

‘very atmospheric, personaland fluent, withnative-speaker-like idiomand style’

4 ‘... The voice of the fado singer. In it were allthe sorrow and longing of a broken heart, inher face the traces of a life lived to the limit.Not a word did I understand but the musicmoved me as no music had ever done before.’

Band 4 scripts‘effective use of rhetorical

questions’5 ‘Have you ever stood on a stage, singing to

hundreds of people? Have you ever looked attheir faces and seen the pleasure you’ re givingthem? If you have, then you know what livemusic is.’

‘fluent, evocative. . .’ 6 ‘ Close your eyes and feel the sounds of theopera with your ears and your heart; you willnot regret having done it.’

Band 3 scripts‘fluent, some style’ 7 ‘Live music is totally different than recorded

music: it’s warm, it’s real, you can’ t cheat. . .You have to be yourself in front of theaudience.’

‘quite hip’ 8 ‘you cannot describe the thrill you feel listen-ing to your favourite group live. A live concertis really exciting. You can feel a really bigrush, the show can finish, but that continuesfor months and months.’

Page 24: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 145

examples from the Bands 5, 4 and 3 sub-corpora, along with the script analyst’scomments at the time.

The corpus analyst commented on the following script features for their impacton the reader throughsophistication of language:

• (adopting a) style;• use of idiom and colloquial language;• use of rhetoric (words used to influence and persuade);• rich, lively vocabulary and collocation;• humour and irony;• using personal experience to enhance an argument and/or strengthen the

writer:reader relationship; and• variation of sentence and paragraph length.

These features suggest that thesophistication of languagecriterion may indeedsubsumewriting assessment criteria such as those referred to by Capel and Hamp-Lyons in Phase 1, for example:fluency, operational command of written language,appropriacy, range of structures and vocabulary, register control, andpersonalstance and perspective.Thesophistication of languagecriterion is probably simi-lar in its intention and scope to Hamp-Lyons’ “communicative effectiveness,” andindicates features of communication at the upper ranges of proficiency, their ab-sence or partial presence, however, possibly reducing communicative effectivenessat lower levels.

Since these criteria emerged from the analyses of an individual corpus analyst,they were next checked with the views of expert informants. Eight CambridgeESOL exam chairs, chief examiners and subject officers were thus requested tocomment on excerpts taken, unlabeled or characterised, from some of the highest-graded scripts. The hypothesis was that the excerpts would be seen to display someof the characteristics proposed by the corpus analyst as exemplifyingsophisticationof language(Table 7).

On the whole, the features identified by the expert participants in this small-scale study fit with those suggested by the corpus analyst, namely:adopting astyle, use of idiom and colloquial language, use of rhetoric, rich, lively vocabularyand collocation, the use of personal experience to enhance an argument and/orstrengthen the writer:reader relationship. Use of humour and irony, andvariationof sentence and paragraph lengthare not specifically noted by the group, the latterprobably because the participants were dealing with short excerpts rather thancomplete responses to the writing task.

Further corroboration of the sophistication of language concept is found in theBirmingham University study (seeSection 3.2). Kennedy et al. (2001)identify thefollowing characteristics in their IELTS Band 8 (very good user) scripts:

reader awareness, including sophisticated linguistic devices to modify and qual-ify; more interactive, conversational register; more interpersonal, topical, textual

Page 25: Developing a Common Scale for The

146 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

Table 7Expert references to features of a possible “sophistication of language” criterion in high Band 5 re-sponses to the writing task

‘Sophisticated language’ features identified by Expert Group Number ofmentions

Explicit reference to “sophistication” 4“sophisticated approach to the task,” “ attempt at complex, sophisticated

language,” “ fairly sophisticated ideas,” “ sophistication of ideas”Reference to control of language and complexity 15

“controlled use of language” (4), “skilful control of language,” “ good controlof structure,” “ complex language” (3),complex structure” (2), “complexsentences,” “ compound sentence structure,” “ resourceful” (2)

Reference to impact on the reader, tone, feeling 29“positive impression on the reader” (2), “giving the reader something to think

about” (2), “achieving effect,” “ addresses reader directly,” “ communicatesdirectly with the reader,” “ enables reader to imagine the situation,”“enthusiastic tone,” “ lively,” “ persuasive,” “ engaging,” “ spontaneous,”“ refreshing,” “ individual,” “ intimate,” “ evocative” (3),“emotive” (4),“sensations,” “ atmospheric,” “ ambitious” (3), “downbeat”

Reference to naturalness of language, native-speaker competence 20“natural flow” (3), “natural use of words/language” (2), “very natural,”

“colloquial” (2), “ease and familiarity,” “ feels real,” “ could be a nativespeaker,” “ could have been written by a native speaker,” “ almost nativelike,” “ unnatural” (4), “awkwardness” (3)

Reference to register, genre, rhetoric 30“good attempt at genre” (3), “uniform register,” “ register appropriate,”

“suitable style,” “ appropriate to the question,” “ inappropriate,”“stylistically assured,” “ stylised word order” (2), “use of appropriate buzzword,” “ use of contrast” (3), “contrast of emotion with analysis,”“ juxtaposition of the poetic and prosaic,” “ balance,” “ rhetorical devices,”“figurative use of words,” “ imagery,” “ good phrases,” “ personification,”“ repetition,” “ exaggeration,” “ poetic” (2), “ literary,” “ catchy,” “ cool”

Reference to vocabulary 13“ range of vocabulary” (4), “good use of vocabulary” (4), “excellent

vocabulary,” “ (in)appropriate/suitable vocabulary(3),” “ competentvocabulary”

themes; more idiomatic language; more evidence of the range of vocabularyneeded to describe wider experience.

There are significantly fewer examples of such manifestations of sophisticatedlanguage at each lower level of proficiency among the 98 scripts. At Band 3, theanalyst’s comments frequently suggest writing proficiency between a level wherea user has only one way of communicating a particular meaning, and a level where(s)he has the competence and confidence to branch out. This recalls the CEFVantage (B2) level, described as beyond Threshold (B1) level in its refinementof functional and notional categories, with a consequent growth in the availableinventory of exponents “to convey degrees of emotion and highlight the personalsignificance of events and experiences” (2001, p. 76).

Page 26: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 147

The sophistication of languagecriterion also resembles quite closely thesecriteria from the General Mark Scheme of the C2 level CPE exam referred toabove and used for the assessment of writing across a range of tasks:

• sophisticated use of extensive range of vocabulary;• collocation and expression, entirely appropriate to the task set;• effective use of stylistic devices; register and format wholly appropriate;• impressive use of a wide range of structures.

The features shared by the emerging sophistication of language criterion and theother assessment criteria it may subsume, begin to suggest some generalisability forthe criterion. There may also tentative implications that the four levels representedby the four sub-corpora of scripts may tend to relate to the levels of the CPE,CAE, FCE and PET exams, themselves, according to Cambridge ESOL examinformation, linked to CEF Levels C2, C1, B2, and B1, respectively. The inferencesare still intuitive at this stage, however.

7.2. Accuracy

The criterion ofaccuracywas noted as apparently significant across the foursub-corpora in CSW Research Step 3. In Step 4, accuracy errors in the sub-corporascripts were counted by the corpus analyst, and their frequency calculated in rela-tion to average length of response (seeTable 8).

The key inference from this table is that response length and error frequencydiffer across the four sub-corpora. Band 2 candidates appear to write noticeablyless than Band 5 candidates.Table 8 also indicates that the Bands 4 and 3candidates fit between the two extremes in terms both of task response lengthand error frequency. They also make many more errors as counted (manually, asthe WordSmith programme cannot yet fully identify linguistic errors) accordingto the lexico-grammatical error categories specified by the main researcher,namely:word choice, word order; word formapart fromverb forms, verb forms,prepositions/adverbials, number/quantity, articles, deixis, spelling, punctuation.Although the error analysis is subjective, it is likely that the counts of errors per

Table 8Comparisons of task response length and error ranges, averages and frequencies across Bands 5, 4, 3,and 2 sub-corpora

Sub-corpus

Bands Average length oftask response

Range of numberof errors

Average errors Error frequency(per no. of words)

5 210 1–11 5 1 per 424 194 6–18 11 1 per 17.53 188 11–31 17 1 per 11.12 154 8–39 24 1 per 6.4

Page 27: Developing a Common Scale for The

148 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

script are accurate. More doubtful will be some of the counts across individualerror categories, given the occasional difficulty of distinguishing between certainerror types, for example, verb form and number (as for example in *Anotherperson prefer recorded music).

Earlier inferences on the impact of accuracy (seeSection 6) are corroboratedrather than refuted by the error analyses and counts. There are indications, invitingfurther investigation, that:

• very frequent language inaccuracies may significantly reduce the impact ofa text, lessening the reader’s confidence in the writer’s ability to convey theintended meaning, sometimes even obscuring that meaning;

• the negative impact of certain categories of lexico-grammatical error may begreater than that made by others;

• learners with weaker target language competence are more likely to makecertain types of accuracy error.

Table 9below suggests that, whereas the choice of thecorrectword is a problemat all proficiency levels in the corpus, it is four times more frequently evidencedby Band 2 candidates than Band 5. Errors with verb forms are significantly less ofa problem for Bands 4 and 5 candidates, both in terms of frequency of error and interms of rank order of error type. Analyst reaction that such errors aresurprisingat Band 5 appears to support a hypothesis that accuracy errors have a negativeimpact on the reader, some errors more than others.

7.3. Organisation and cohesion

Assessment criteria used across language exams (seeSection 2.4above) suggestthat theorganisationof a piece of writing affects its impact on the reader. Hence theidentification on band scales (see above of criteria such asorganisation and linkingdevices[FCE], organisation and cohesion[CAE], organisation and coherence[CPE],organisation: clarity, intent[CELS],coherence and cohesion[IELTS]). Inthe CSW study,organisationis referred to explicitly in only 7 of the 18 Band 4 sub-corpus rater analyses (five times positively and twice negatively). This suggests agenerally satisfactory handling of the organisation of ideas at this level, with littleof the sometimes forced and over-explicit linking mentioned in the analyses of theBand 3 scripts. Similarly with the four negative-only references to organisationand links in the eight Band 2 scripts, where the message of the text may havealready been lost at the lexico-grammatical level.

The corpus analyst’s identification of errors with links at thecohesionlevel (i.e.,within or between sentences) in the four sub-corpora indicated that they are rela-tively common at Band 5 compared with the accuracy errors analysed inTable 8,although still at an average of only 1.34 per script. This compares interestinglywith the Bands 4, 3 and 2 sub-corpora, where links would be only seventh inthe accuracy rank order, averaging 1.17, 1.19 and 1.5 per script, respectively. It

Page 28: Developing a Common Scale for The

R.H

aw

key,F.Ba

rker

/Asse

ssing

Writin

g9

(20

04

)1

22

–1

59

149

Table 9Error types, occurrences and rank orders across bands

Type

Wordchoice

Verbform

Preposition/adverbial

Number Spelling Article Deixis Punctuation Wordorder

Wordform

Band 5(n = 29)Totals 28 14 22 8 17 14 1 5 7 1Mean/script 1 0.5 0.76 0.28 0.59 0.5 0.03 0.18 0.24 0.03Ranks 1 4= 2 6 3 4= 9= 8 7 9=

Band 4(n = 18)Total 33 25 29 17 17 27 5 11 7 3Mean/script 1.8 1.4 1.61 0.94 0.94 1.5 0.28 0.37 0.61 0.17Ranks 1 4 2 5= 5= 3 9 7 8 10

Band 3(n = 43)Total 153 125 110 88 69 50 22 9 12 7Mean/script 3.6 2.91 2.56 2.05 1.6 1.16 0.51 0.21 0.28 0.16Ranks 1 2 3 4 5 6 7 9 8 10

Band 2(n = 8)Total 32 28 22 19 19 14 8 6 4 2Mean/script 4.0 3.5 2.75 2.38 2.38 1.75 1.0 0.75 0.5 0.25Ranks 1 2 3 4= 4= 6 7 8 9 10

Mean ranks 2 3 3.5 5.8 4.3 4.8 9.5 8.3 9.8 9.8

Page 29: Developing a Common Scale for The

150 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

may be that the Band 5 candidates experiment more with their intra- and inter-sentential links (the termover-ambitiousis used to describe parts of seven of the29 scripts in this sub-corpus). But the Band 5 script analysis also makes 12 posi-tive comments on the macro-organisation (or coherence) of the scripts, comparedwith only three negatives, yet four positive to six negative comments at the micro-or cohesion level. One of the problems emerging on cohesion (borne out by theKennedy et al. study) is the tendency for candidates to learn a set of link wordsor phrases (firstly, therefore, furthermore, etc.) and force them into their writing,sometimes incorrectly or inappropriately, risking a negative impression on thereader.

A portmanteau criterion such asorganisation and cohesionseems to warrant in-clusion in a common scale band set. To separate organisation from links, includingthe latter among errors of accuracy, seems counter-intuitive.

8. CSW project Phase 2 research Step 6

At this stage of the study, a draft working scale for writing was emergingusing three criteria,sophistication of language, accuracy, andorganisation andcohesion. Since the scale was being derived from a close study of written taskresponses from candidates assessed on multiple ratings as at four different levelsof proficiency, the draft scale has the potential width of a common scale (seeFig. 1)although, of course, the responses were all to the same task.

The computerised re-analysis of certain aspects of the sub-corpora was nowinitiated by the co-researcher (also the co-writer of this article). This re-analysissought evidence that might support or undermine the script analyst’s findings inareas where computer corpus analysis applied, and to add to the coverage of thestudy where the computer analysis could do things that themanualanalysis couldnot.

The methodology was informed by that of Kennedy, Dudley-Evans and Thorp(2001) summarised above. The 98 scripts in the four sub-corpora were keyed into Microsoft Word to provide a text file for each candidate’s script. The originallayout of the script was kept in the typed-up version although corrections andcrossings-out were not included. Because these text files were to be analysed invarious ways using WordSmith Tools software, each was re-saved as a correctedversion in which spelling errors, where possible, were corrected. This was doneto enable WordSmith to produce accurate wordlists based on individual or sets oftext files without incorrectly spelt words misrepresenting the vocabulary range ofa candidate.

It was envisaged that the WordSmith Tools software would be used to investi-gate:

• whole script, sentence and paragraph lengths;• title use;

Page 30: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 151

Table 10Sub-corpora mean sentence and paragraph lengths

Bands Mean sentence lengths (words) Mean paragraph lengths (words)

5 19.5 61.44 19.1 48.93 17.3 88.62 13.3 33.1

• vocabulary range;• words in concordances and collocations; and• errors.

The first analysis corroborated evidence from the manual analysis of a correla-tion between response length and proficiency level (seeTable 8). It then revealedthat, while sentences tended to be longer (though not significantly so) in the Bands5 and 4 sub-corpora, a hypothesis of paragraph length as a distinguishing featureof proficiency level, was, contrary to indications in the Birmingham study, notcorroborated (seeTable 10).

Further manual analysis is suggested to checkvariationsof sentence length insequence, an element of sophisticated language apparently used for effect by someof the CSW Band 5 writers. The WordSmith software cannot yet handle such ananalysis.

The use or not of a title in the task response was found to vary according to level(seeTable 11), CSW Band 5 candidates being more likely, at 48%, to use a titlethan Bands 3, 4 or 2 candidates. Given that the task discourse type, a newspaperarticle, invites the inclusion of a title, its presence or absence is relevant to taskfulfilment. It may also be that titling should be considered as an aspect of theorganisationcriterion (see above).

The computer analysis of therange of vocabularyof each sub-corpus supportsinferences from the manual analysis.Table 12illustrates the vocabulary range ineach sub-corpus in terms of the normal and standardised type:token ratios, andthe average number of different types found at each level. We would anticipatethat higher-level scripts would display a greater range of vocabulary, i.e., a highertype:token ratio, than lower level scripts.

Table 11Use of a title in candidate scripts

Bands Title % No title %

5 14 48 15 524 4 22 14 783 14 33 29 672 1 13 7 87

Page 31: Developing a Common Scale for The

152 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

Table 12Normal and standardised type:token ratios

Band Tokens Types Type:tokenratio

Standardised type:token ratio

No. ofscripts

Av. typesper script

5 6130 1191 19.43 38.97 29 414 3112 619 19.89 32.77 18 343 7999 1116 13.95 32.77 43 262 1206 342 28.36 30.10 8 43

The type:token ratio column expresses the number of different words in eachsub-corpus as a percentage of the total number of words in that sub-corpus. Thenormal type:token ratios are indeed higher at Bands 4 and 5, indicating that morelexical items are used at these proficiency levels. The normal type:token ratio forthe lower-proficiency Band 2 scripts, however, appears high, this anomaly probablyaccounted for by the small number of Band 2 scripts in this sub-corpus (8). Thestandardisedtype:token ratio measure, computed everyn words rather than oncefor the whole text (the default is every 1000 words), permits a comparison acrosstexts of different lengths and is thus a more appropriate measure for analysing theCSW data. This measure confirms an increased vocabulary range as proficiencylevels increase, from 30 new words in every 100 at Band 2 level, 33 new words atBands 3 and 4 and 39 words at Band 5 level. Range of vocabulary is thus possiblya feature distinguishing proficiency levels.

Figures on words occurring only once were obtained by comparing the fre-quency wordlists produced by WordSmith Tools for each level, as a possible mea-sure of the use of less frequent vocabulary items.

Table 13suggests that, while candidates at all levels produce some less usualvocabulary items, candidates at higher levels may produce more one-off words.While this could be an indication of the command of a richer vocabulary, anaspect of our criterion of sophistication of language, the percentage differences inTable 13are not significant, with the small Band 2 sample appearing once againto contradict the trend.

The lengths of words in each sub-corpus was calculated using the Statisticstable in the Wordlists application in WordSmith Tools to ascertain whether wordlength correlated with proficiency level. The percentage of different word lengthsfound in each sub-corpus was, however, almost identical (Table 14).

Table 13Single-occurrence words

Band No. % of types

5 655 554 317 513 575 522 198 58

Page 32: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 153

Table 14Percentages of words with 1–11 letters

Word lengths (in letters)

1 2 3 4 5 6 7 8 9 10 11

Band 5 5 19 20 18 13 7 7 4 4 2 1Cumulative % 24 44 62 75 82 89 93 97 99 100Band 4 5 19 20 21 13 7 5 4 3 1 0Cumulative % 24 44 65 78 85 89 94 97 98Band 3 5 19 19 20 13 7 6 4 3 1 0Cumulative % 24 43 63 76 83 89 93 96 97Band 2 4 22 18 18 14 9 3 5 3 2 1Cumulative % 26 44 62 76 85 88 93 96 98 99

The finding that difference in word lengths across levels is not significant some-what contradicts the Birmingham study, which notes significantly more wordslonger than 12 letters in its IELTS Band 8 (i.e., “very good user”) scripts. Thesedifferences may well relate to the difference between the FCE exam prompt usedin this study and typical IELTS test tasks. The single task to which all the can-didates in our study responded was, it will be recalled, on a relatively straight-forward, everyday-view journalistic topic. The writing corpora used by Kennedyet al., however, will have been responses to more academic questions, requiringa more formal language register and inviting, perhaps, the use of some longerwords.

In the Kennedy et al. study, the collocates of “I” and “it” were identified aspossible indications of the organisation of task response, and of impact on thereader. A new analysis procedure was thus trialled on the CSW sub-corpora, usingconcordances and collocational information to investigate words occurring oneplace to the right of “I” (its collocates). Results showed a range of collocates, mostof which, unsurprisingly, are verb forms.

Some verbs occur at all levels (e.g.,preferandthink) whilst others occur at somelevels only. The data here appear inconclusive, however, apart, perhaps, from anapparent stronger inclination for candidates writing at higher levels to use thefirst person “I” in their responses. This could be an aspect of the use ofpersonalexperience to enhance a general argument and/or strengthen the writer:readerrelationship, seen as part of sophistication of language inSection 7.1. Collocationalanalysis certainly has possibilities for future analyses of writing corpora.

The final analysis using WordSmith aimed to investigate errors at each level,based on the wordlists produced from the original text files (note that the correctedtext files were used for all other analyses). These wordlists were saved in an Excelspreadsheet and the Excel spell-checker was used to identify incorrectly spelt itemsin these lists. This analysis was intended as a check on the manual error analysis(seeSection 7above). But lexico-grammatical errors could not be easily analysedusing the WordSmith Tools, so were not re-checked at this stage in the study. In

Page 33: Developing a Common Scale for The

154 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

fact WordSmith Tools software can be used to identify some lexico-grammaticalerrors but only if the texts concerned have been pre-coded for parts of speech.

The production of concordances for specific words or phrases in the corpuswas also trialled during the computerised analysis. This remains an avenue forfuture research when more detailed analysis of sophisticated language, accuracyor organisation is called for on new corpora of examination scripts.

The computerised corpus analyses of the CSW scripts, which facilitated bothcross-checking and original analyses, proved revealing and helpful. On matterswhere the manual analysis could be replicated by the computer analysis, mostof the script analyst’s findings were corroborated. Additional features were alsoindicated as significant by the computer corpus analysis, namely titling, greaterword length and vocabulary range across levels.

9. CSW project Phase 2 research Step 7

9.1. The draft common scale for writing

Three criteria,sophistication of language, organisation and cohesionandac-curacyhave thus been identified by multiple ratings of sub-corpora of scripts forfour levels. Qualitative analyses of these scripts have identified and exemplifiedfeatures consistently characterising their proficiency levels. Some of these features

Level 5 Level 4 Level 3 Level 2

prefer am listen listen

have prefer think prefer

listen think like think

like like prefer would

think can have

always have

don’t was

want don’t

enjoy feel

had would

would enjoy

can love

can’t want

find had

never buy

went choose

really

Page 34: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 155

have been checked with expert informants, and through computer corpus analyses,which had also directed attention to other features, some already suggested by themanual analyses, some new.

A scale of level descriptions for four bands of proficiency in writing could nowbe drafted using insights from the study and from previous related scales. Thedraft scale would initially, since it had been derived through assessor-oriented use,be worded in negative as well as positive can-do terms (see the CEF distinctionbetween positive and negative formulation,CEF, 2001, p. 205).

The draft scale attempts to avoid the complexity and length against whichAlderson (1990), Porter (1991) andNorth (2000)warn (seeSection 2.2). Theproblem of distinguishing one band from the next only by the use of distinctionsbetweenalways, usually, sometimes, occasionally(also seeAlderson, 1990inSection 2.2), is not entirely solved, but such descriptors are as far as possible com-bined with other distinguishing criterial features. The refining of the descriptorsand broadening of their generalisability is being attempted through their applica-tion to candidate writing corpora from other Cambridge ESOL exams in responseto a range of tasks at various levels (see conclusions below).

9.2. Number of levels

One of the key considerations in the imminent comparisons between the draftdescriptors and other scales must be the number of levels to be covered. The fourlevels so far identified from the four sub-corpora of scripts might represent a firststep towards a matching with CEF Levels A2 to C2, although these, with theiraim of avoiding negative references, sometimes appear to indicate a rather higherperformance level in some aspects.Fig. 4gives examples from the CEF “analysisof functions, notions, grammar and vocabulary necessary to perform the commu-nicative tasks described on the scales” (2001, p. 33) at four levels of proficiency.

Fig. 4. CEF B1 to C2 level description extracts.

Page 35: Developing a Common Scale for The

156 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

Fig. 3. Draft four-level scale for writing.

There are certainly similarities of features here with the draft scale inFig. 3above, indicating that the draft scale may indeed have possibilities for refinementinto a common scale. But the four corpora derived from our analyses do not includea level representing CEF A2 (Waystage). This is confined to communication suchas “short, basic descriptions of events and activities” (2001, p. 34), thus shouldbe investigated by the analysis of scripts in response to a briefer, simpler com-municative writing task than the one performed by the corpus of 288. Below thisA2 level of proficiency would be a level similar to CEF Level A1 (Breakthrough),“the lowest level of generative language use” (2001, p. 33).

It is likely that the common scale will eventually havesix levels, but empiricalresearch is still required on the two lower levels.

10. Conclusions and further research

This study has analysed an extensive corpus of scripts written by candidates atthree exam levels, FCE, CAE and CPE (offered by Cambridge ESOL for certifi-

Page 36: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 157

cation at CEF B2, C1 and C2 levels, respectively) in response to a single commu-nicative task. The two-stage qualitative analyses by one researcher but with someconsultation with expert opinion and regular feedback from the Cambridge ESOLWriting Steering Group, have been supported where feasible by computer corpusanalyses conducted by the second researcher and co-writer of this article.

Features derived from the analyses of the corpus have been incorporated in adraft four-level assessor-oriented scale based on three criteria, sophistication oflanguage, organisation and links, and accuracy.

This draft scale is being used for further research designed to increase its gener-alisability for potential use in a common scale for writing. The draft scale is beingapplied to corpora of IELTS scripts with previous ratings from Bands 3 to 9; and toBusiness English Certificates (BEC) and Certificates in English Language Skills(CELS) exam scripts at Preliminary, Vantage and Higher levels. With each of thesethree corpora the scripts concerned cover a wide range of writing tasks. Results sofar indicate that the draft scale derived from the study described in this paper doeshave generalisability across exams and writing tasks and will be useful in helpingto specify relationships between proficiency levels measured by Cambridge ESOLMain Suite, Business English and IELTS test bands.

On a broader research front, the study would seem to offer useful insights intothe writing construct, with the identification and development of criteria whichare relevant to the communicative language testing construct, and which should beuseful for the assessment of writing beyond Cambridge ESOL exams. Methodolog-ically, the study appears to support the use of learner corpora in the investigationof target language proficiency levels and the use of a combination of qualitativeand computer-linguistic analytic approaches, including those starting fromtabularasaand analyst intuition, though checked through expert opinion and referenceto existing criteria and scales.

Acknowledgements

We would like to acknowledge the involvement of Nick Saville, Janet Bojan,Annette Capal and Liz Hamp-Lyons in Phase 1 of the CSW Project, and CambridgeESOL team members, Chris Banks, Neil Jones, Tony Green, Nick Saville, StuartShaw, Lynda Taylor and Beth Weighill for their work on Phase 2. Thanks also toCyril Weir for comments on an early draft of the paper.

References

Alderson, C. (1990). Bands and scores. In: C. Alderson & B. North (Eds.),Language testing in the1990s(pp. 71–86). London: Modern English Publications and the British Council.

Aston, G., & Burnard, L. (1998).The BNC handbook: Exploring the British national Corpus withSARA. Edinburgh, UK: Edinburgh University Press.

Bachman, L. (1990).Fundamental considerations in language testing. Oxford: Oxford UniversityPress.

Page 37: Developing a Common Scale for The

158 R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159

Bachman, L. (1991). What does language testing have to offer?TESOL Quarterly, 25 (4), 671–704.Bachman, L. (2002). Some reflections on task-based language performance assessment.Language

Testing, 19 (4), 453–476.Bachman, L., & Palmer, A. (1996).Language testing in practice: Designing and developing useful

language tests. Oxford: Oxford University Press.Ball, F. (2001). Using corpora in language testing. In:Research notes(Vol. 6). Cambridge: Cambridge

ESOL.Ball, F. (2002). Developing wordlists for BEC. In:Research notes(Vol. 8). Cambridge: Cambridge

ESOL.Ball, F., & Wilson, J. (2002). Research projects related to YLE speaking tests. In: Research notes (Vol.

7). Cambridge: Cambridge ESOL.Biber, D., Conrad, S., & Reppen, R. (1998).Corpus linguistics: Investigating language structure and

use. Cambridge: Cambridge University Press.Boyle, A., & Booth, D. (2000). The UCLES/CUP learner corpus. In:Research notes(Vol. 1). Cam-

bridge: Cambridge ESOL.Council of Europe. (2001).Common European framework of reference for languages: Learning, teach-

ing, assessment. Cambridge: Cambridge University Press.Cumming, A. (1998). Theoretical perspectives on writing.Annual Review of Applied Linguistics, 18,

61–78.Fulcher, G. (1996). Testing tasks: Issues in task design and the group oral.Language Testing, 13 (1).Fulcher, G. (2003).Testing second language speaking. London: Pearson Longman.Granger, S., & Rayson, P. (1998). Automatic profiling of learner texts. In: S. Granger (Ed.),Learner

English on Computer(pp. 119–131). London: Longman.Hamp-Lyons, L. (1990). Second language writing: Assessment issues. In: B. Kroll (Ed.),Second

language writing assessment issues and options. New York: Macmillan.Hamp-Lyons, L. (1995).Summary report on Writing Meta-Scale Project(UCLES EFL Internal Report).Hatch, E., & Lazaraton, A. (1991).The research manual: Design and statistics for applied linguistic.

Boston: Heinle and Heinle.Hymes, D. (1971).On communicative competence. Philadelphia, PA: University of Philadelphia Press.Kennedy, C., Dudley-Evans, T., & Thorp, D. (2001).Investigation of linguistic output of academic

writing Task 2(British Council-funded IELTS Research Project 1999–2000, Final Report).Leech, G. (1998). Preface. In: S. Granger (Ed.),Learner English on computer(pp. XIV–XX). London:

Addison Wesley Longman Limited.McEnery, T., & Wilson, A. (1996).Corpus linguistics(2nd ed.). Edinburgh: Edinburgh University

Press.Milanovic, M., Saville, N., & Shen, S. (1992).Studies on direct assessment of writing and speaking

(UCLES EFL Internal Report).Morrow, K. (1979). Communicative language testing: Revolution or evolution? In: C. Brumfit & K.

Johnson (Eds.),The communicative approach to language teaching. Oxford: Oxford UniversityPress.

Morrow, K. (1990). Evaluating communicative tests. In: S. Anivan (Ed.),Current developments inlanguage testing. Singapore: SEAMEO Regional Centre.

Munby, J. (1978).Communicative syllabus design. Cambridge: Cambridge University Press.North, B. (2000). Linking language assessments: An example in the low stakes context.System, 28,

555–577.Porter, D. (1990). Affective factors in the assessment of oral interaction. In: S. Anivan (Ed.),Current

developments in language testing. Singapore: SEAMEO Regional Language Centre.Saville, N. (2003). The process of test development and revision within UCLES EFL. In: C. J. Weir

& M. Milanovic (Eds.),Continuity and innovation: Revising the Cambridge proficiency in Englishexamination 1913–2002. Cambridge: Cambridge University Press.

Saville, N., & Capel, A. (1996).Common scale writing(Interim Project Report: UCLES EFL).

Page 38: Developing a Common Scale for The

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122–159 159

Saville, N., & Hawkey, R. (2004). The IELTS impact study: Investigating washback on teaching ma-terials. In: L. Cheng & Y. Watanabe (Eds.),Washback in language testing: Research contexts andmethods. New Jersey: Lawrence Erlbaum Associates Inc.

Scott, M. (2002).WordSmith Tools version 3. Oxford: Oxford University Press. Available at:http://www.lexically.net/wordsmith/version3/index.html.

Spolsky, B. (1995).Measured words. Oxford: Oxford University Press.Turner, J., & Upshur, C. (1996, August).Scale development factors as factors of test method. Paper

presented at the 18th Language Testing Research Colloquium, Tampere, Finland.Upshur, J., & Turner, C. (1995). Constructing rating scales for second language tests.English Language

Teaching Journal, 49 (1), 3–12.Upshur, J., & Turner, C. (1999). Systematic effects in the rating of second language speaking ability:

Test method and learner discourse.Language Testing, 16 (1), 82–111.Weir, C. J. (1993).Communicative language testing. New York: Prentice-Hall.Weir, C. J., & Milanovic, M. (Eds.). (2003).Continuity and innovation: Revising the Cambridge

proficiency in English examination 1913–2002. Cambridge: Cambridge University Press.