Summative assessment: dealing with the ‘measurement fallacy’€¦ · the ‘measurement fallacy’ Mantz Yorke a a Department of Educational Research, Lancaster University, Lancaster,

This article was downloaded by: [Univ of Salford]On: 13 October 2011, At: 05:26Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Studies in Higher EducationPublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/cshe20

Summative assessment: dealing withthe ‘measurement fallacy’Mantz Yorke aa Department of Educational Research, Lancaster University,Lancaster, UK

Available online: 03 Nov 2010

To cite this article: Mantz Yorke (2011): Summative assessment: dealing with the ‘measurementfallacy’, Studies in Higher Education, 36:3, 251-273

To link to this article: http://dx.doi.org/10.1080/03075070903545082

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representationthat the contents will be complete or accurate or up to date. The accuracy of anyinstructions, formulae, and drug doses should be independently verified with primarysources. The publisher shall not be liable for any loss, actions, claims, proceedings,demand, or costs or damages whatsoever or howsoever caused arising directly orindirectly in connection with or arising out of the use of this material.

http://www.tandfonline.com/loi/cshe20

http://dx.doi.org/10.1080/03075070903545082

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

Studies in Higher EducationVol. 36, No. 3, May 2011, 251–273

ISSN 0307-5079 print/ISSN 1470-174X online© 2011 Society for Research into Higher EducationDOI: 10.1080/03075070903545082http://www.informaworld.com

Summative assessment: dealing with the ‘measurement fallacy’

Mantz Yorke*

Department of Educational Research, Lancaster University, Lancaster, UKTaylor and FrancisCSHE_A_454972.sgm10.1080/03075070903545082Studies in Higher Education0307-5079 (print)/1470-174X (online)Original Article2010Society for Research into Higher [email protected]

Much grading of student work is based, overtly or tacitly, on assumptions derivedfrom scientific measurement. However, the practice of grading and the cumulationof grades into an overall index of achievement are socially constructed activitiesthat fall a long way short of what is expected of scientific measurement. Ifscientific measurement is an unattainable ideal, how should the summativeassessment of student achievement be approached? The case is argued that theprofessional judgement of assessors has to be given prominence, and that thisimplies a sustained commitment to developmental work at institutional andsectoral levels. Some suggestions along these lines are outlined.

Keywords: academic achievement; grading; academic success; assessment;measurement

Signals and noiseThe grading of student work is of obvious importance for students and others, yet it isan activity that is compromised by flaws of various kinds. This article argues thatgrading tends to be treated inappropriately, and often implicitly, as an act akin tomeasurement, and that assessment practices at different levels reflect this. Grading isquasi-measurement (since grades do not possess the characteristics of true measures),and is affected by a number of socially-driven factors whose influence is generally notwell understood. The very strong and the very weak student will stand out irrespectiveof the grading system being used, because the signal is strong enough to overpowerthe ‘noise’ that is inherent in the practice of grading. It is for the bulk of the studentpopulation, where differences between individuals tend to be small, but can have largeimpact on opportunities, that uncertainty in grading is of the greatest significance.

Grades are signals or indicators of performance, and have value as such – but eventhen the signals can be quite badly compromised by ‘noise’ from various sources. Asan analogy from astronomy, the belief that canals existed on Mars was fostered byvarious intrusions of ‘noise’ into the light reflected from the planet’s surface:telescope optics at the time were relatively rudimentary; the turbulence in the Earth’satmosphere affected the ‘seeing’; and there were those who wanted to believe in lifebeyond the Earth. Grading is similarly affected by a host of factors that seriouslycompromise many of the practices that are involved in assessing summatively, andreporting the outcomes to interested parties. If grades are fallible, what can be done toimprove matters?

It is proposed that, if broad categories are used as the basis of grading, then gradingbecomes more overtly judgemental and less a matter of measurement. Assessment

*Email: [email protected]

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com

252 M. Yorke

becomes more a social science activity and less one driven by attempts to measurescientifically. How judgements are made, and how assessors can develop their exper-tise, are then issues that have to be addressed.

Do the flaws really matter?Some might wish to argue that, although assessment practices have their flaws, thepractices function well enough for the purposes to which they are put, so why rock theboat? This article takes the opposite view and argues that the flaws really do matter,for a variety of reasons that emerge at various points in the text. Three are picked outhere which have particular resonance for higher education in the UK (elsewhere, otherconsiderations may have greater salience).

(1) Politically, there is a growing concern about the appropriateness of thehonours degree classification as an index of a student’s achievement, as isevident from a number of exchanges in the report Students and universitiespublished by the Innovation, Universities, Science and Skills Committee of theHouse of Commons (2009). The classification depends on grades and the waythat these are brought together. The emerging political requirement is thatsomething should be done to improve matters. Mitigating the flaws in grading(elimination of them is probably too ambitious) would be an important contri-bution.

(2) There is a disparity in the profiles of honours degree classifications in differentdisciplines in the UK. This makes it difficult for a user of the classification asan overall index of attainment, since they cannot be sure that the standard of afirst-class honours degree is equivalent in, say, engineering, business studies,nursing and graphic design.

(3) Reasons 1 and 2 (and others) provide higher education institutions (and thesector as a whole) with a prod to do what they should in any event be doing,because of the need to be able to justify any claim that they are adoptingassessment practices that are as good as can reasonably be expected, given theresources available.

World viewsTwo major world views are in competition where grading is concerned – the realistand the relativist. The former broadly characterises assessment in science-basedsubjects; the latter is more associated with the arts, humanities and social sciences.However, the realist and relativist views are Weberian ‘ideal types’ which do not existin a pure form, since each is in practice inflected by the other – and in any case somesubjects (such as economics and nursing) combine elements of both science and thesocial sciences. Similar dichotomisations appear elsewhere in the assessment litera-ture: for example, Hager and Butler (1996) drew a distinction between a scientificmeasurement model of assessment and a model in which judgement dominated, andElton and Johnston (2002) differentiated positivist from interpretivist approaches.

Whatever the particular dichotomisation preferred, its ideal types are useful in thatthey offer points of reference for discussion about assessment. Table 1, developedfrom Yorke (2008a, 27), summarises the distinction between the realist and relativistperspectives.

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com

Studies in Higher Education 253

Table 1 gives an impression of even-handedness, though the evidence presented inthis article comes down heavily on the side of assessment as an exercise in interpreta-tion and judgement (i.e. an activity based in social science), rather than as the exerciseof scientific measurement. The argument for interpretation and judgement is at itsstrongest where the ‘situatedness’ of the performance has to be taken into account.Govaerts et al. (2007), writing with specific reference to in-training assessment inmedicine, criticise the psychometric approach to assessment, arguing that it does nottake sufficient account of the context of medical workplaces. They argue for a socialpsychological approach to performance assessment in which an acknowledgement ofcontext plays a part.

A belief in assessors’ capacity to measure student achievements implies that theassessor can utilise the kind of measurement that is typical of science (say, withrespect to mass, length and time), in which numerical measurement has over centuriesdemonstrated its power. This belief might be termed the ‘measurement fallacy’.Attempts to measure student achievement are scientistic rather than scientific (saveperhaps in some relatively trivial aspects of performance, such as routine calcula-tions). The discussion that follows argues (not for the first time – Bloxham [2009] andothers have done so) that the assessment of student achievement is so compromisedby factors, ranging from the capacities of the individual assessor to the institutional

Table 1. Assessment, as viewed from realist and relativist standpoints.

Realist Relativist Comment

Standards are objectively defined

Standards are normative and consensual

The realist’s objectivity reflects underlying values

Performances can be measured against these standards

Performances are assessed with reference to these standards

The difference here is between measurement and judgement

The assessor is objective and detached

The assessor interprets the extent to which the performance relates to the standards

The underlying distinction is between context-free assessment (the realist’s position) and context-relevant assessment (the relativist’s position)

Values play no part in the assessment

Value positions are embedded in the norming of standards

The situation of the student is not taken into account

The context of the assessment is taken into account

Explicit criteria and rubrics are invoked

There are broad statements of expectations

The issue is the degree of detail that can be encompassed by criteria

Measurements are taken as true and reliable representations of achievement

Assessments are judgements of the extent to which achievements fit with expectations

The distinction is between measurement and judgement

Tasks are set by assessors Tasks may be selected by students to suit their strengths and interests

The selection of project topics is a feature of both the sciences and the social sciences

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com

254 M. Yorke

procedures and practices relating to assessment, that any pretensions to measurement(such as those implicit in the computation of grade-point averages and honours degreeclassifications) should be abandoned.

It is easy to knock down the sandcastle of ‘grading as measurement’. The greaterchallenge is to come up with something that has more secure foundations, and that isrobust enough for practical purposes. The article concludes by sketching out what thismight involve.

Why ‘measurement’ is inappropriate as a conceptThe ‘measurement’ approach assumes there is a scale against which performances canbe placed. A quantitative scale must embody both order and additivity. There is gener-ally relatively little difficulty in ordering students’ performances along a dimensionsuch as strong–weak. It is the requirement of additivity that is the main problem: onecannot assert with any confidence that a mark of 60 indicates a performance that is 1.5times as good as one awarded 40, and so on. If the grading scale is not quantitative inthis way, then standard arithmetical operations on the grades are illegitimate (Dalziel1998).

A measurement scale should incorporate all of the achievement that is to beexpected of the student in respect of what is to be assessed (and hence will not caterfor the solving of unbounded problems, such as those demanding creativity). It shouldbe context-free. Lastly, the assumption that a grade is a measure carries with it furtherassumptions about the grade’s psychometric properties, such as validity, reliabilityand so on.

To add to the challenges, measurement is inappropriate as an undergirding conceptin assessment because of the impact of a host of interconnected variables, whichinclude the following (the interconnections are to some extent hidden by treating themseparately):

● the assessor’s approach to grading;● norm- versus criterion-referencing;● subject or study unit norms;● psychosocial pressures;● assessment criteria;● the assessment of complex achievements;● assessment regulations;● grading scales and the combination of grades;● grade boundaries.

The earlier part of the list is dominated by aspects of assessment that have a particularrelevance for assessors as individuals, whereas the latter part reflects primarily insti-tutional and sectoral issues (for a more detailed consideration than is possible here, seeYorke 2008a).

The assessor’s approach to gradingThere is a consensus amongst psychologists who study the making of judgements thatthere are basically two information processing mechanisms at work in the brain. One

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com


mechanism is rule-governed and deals sequentially with components of the matter athand; the other is more intuitive and associative, and is much faster because it involvesparallel processing (Kahneman and Frederick 2002) and the adoption of heuristics(Stanovich and West 2002). These mechanisms find practical representation in theways in which assessors go about their tasks – judging serially against a set of criteriaand reaching an overall judgement by combining in some way the component judge-ments, or judging holistically. In practice, the distinction is likely to be less clear-cutthan this.

An analytical or ‘menu-marking’ approach (Hornby 2003) involves the alloca-tion of marks against each criterion, with the overall mark being determined bysimple summation. In its ideal form, the assessor works through the student’s work anumber of times, concentrating on each assessment criterion in turn. Practicalconsiderations render unlikely such a level of diligence in marking. In reality, theassessor is likely to be attending to multiple criteria simultaneously, which placesconsiderable demands on the cognitive system. In a well-known article, Miller(1956) suggested that humans’ capacity for simultaneously processing ‘chunks’ ofinformation was limited to around seven chunks, plus or minus two. A review of theliterature by Cowan (2000) suggests that Miller might have erred on the side of opti-mism, and that the figure could be closer to four. The processing of chunks is proba-bly not the same as assessing against criteria, but the work of Miller and Cowanpoints to the need for conservatism regarding what can be realistically expected ofassessors.

The analytical approach can be construed as a weak form of a measurement-oriented approach to assessment. As Sadler (2009a) and others have pointed out, aproblem that surfaces from time to time with the menu-marking approach is that theoverall mark is discrepant from the marker’s more holistic appreciation of the essay’smerits. The whole may be more, or it may be less, than the sum of the parts. In psycho-logical terms, the rule-governed and holistic processing mechanisms can come upwith different answers. A perceived discrepancy may encourage a marker to find away of overriding the mechanism of the menu-marking, and instead submitting a markthat they believe is more appropriate (see Grainger, Purnell, and Zipf 2008, 135; and,for a study that provided empirical evidence of overriding, see Baume and Yorke, withCoffey 2004).

Sadler (2009a) also notes that the menu may omit features that the assessor wishesto value. Walvoord and Anderson (1998, 99) suggest a ‘fudge factor’ which can beinvoked as an ‘extra’ in order to accommodate components of the achievement thatwere not envisaged when the assessment task was specified: however, it is a matter ofcontention as to whether matters such as effort or improvement should be incorporatedin grading: Sadler’s (2010) discussion of fidelity in grading strongly suggests that theyshould not.

Assessors vary in a number of respects, even when they come from the same disci-plinary area. For example:

● their interpretation of assessment criteria may vary, and/or unstated criteria maybe applied (Webster, Pepper, and Jenkins 2000; Woolf 2004);

● they may mark work against different reference points, perhaps reflecting theextent to which they have been socialised into disciplinary or broader culturalpractices (Hand and Clewes 2000; Yorke, Bridges, and Woolf 2000; Johnston2004);

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com

256 M. Yorke

● they may understand in different ways what marking involves, and act accord-ingly (Ekstrom and Villegas 1994; Yorke, Bridges, and Woolf 2000);

● they occupy different positions on the generosity–meanness dimension asregards grading;

● the order in which student work is assessed is likely to have some effect;● there may be attempts to adjust grading in the interests of fairness;● personal circumstances, such as tiredness, may obtrude into the assessment

process.

Double-marking, ideally ‘blind’, may mitigate some of the variation between individ-uals. However, the pressures on contemporary higher education in the UK are suchthat the scale of double-marking is reduced, often to a sample of the submitted assign-ments rather than to the totality – and in any case the virtues of double-marking as acheck on standards are not as clear-cut as some believe (Brooks 2004).

Norm- versus criterion-referencingAssessment in higher education remains influenced by expectations that the gradesfrom a reasonable number of students will fall roughly into a normal distribution.Student achievements can be banded – for example, such that 10% obtain an A, 25%a B, 30% a C, 25% a D and 10% an E. Many readers will have experienced somethingof this sort. The key here is that students are being referenced against the cohort’smean (and hence against each other), and only a relatively small proportion is awardedthe top grade.

In recent times, curricula have become increasingly specified in terms of expected(or intended) learning outcomes, and students are graded on the extent to which theyhave attained the expected outcomes. This is a variant of mastery learning. There isno reason in theory to prevent every student from being awarded an A, provided thatthey have attained the specified outcomes (it is possible to infer from data given byJohnson [2003] that this was sometimes the case at Duke University in the USA, theonly variation being in the nature of the A [i.e. A+, A or A!]). In reality, distributionsare rarely so extreme, and the typical distribution is skewed much less dramaticallytowards the upper end of the scale. Figure 1 illustrates the likely difference betweennorm- and criterion-referenced grading. This skew that can occur under criterion-referencing is probably a contributory factor to the steady upward drift in honoursdegree classifications that has been detected since data were first systematicallycollected for the UK by the Higher Education Statistics Agency (HESA) in theacademic year 1994–95 (see Yorke 2009b).Figure 1. An illustration of possible differences in grade distributions under norm- and criterion-referenced assessment.Some academics believe that student achievement is (or should be) graded roughlyaccording to the normal curve. Criterion-referenced assessment can, with such a mind-set, lead to resistance to the award of ‘too many As’. The European Commissionattempted to use the 10-25-30-25-10 distribution of grades (as noted above) in theEuropean Credit Transfer and Accumulation System, in order to facilitate the cross-national movement of students, but found that this kind of norm-referencing did notfit with practical experience. As a consequence, it was forced to adopt a more flexibleapproach (see European Commission 2009, Annex 3).

It is probably the case that most grading takes place without awareness of the distri-butional assumptions that are embedded in the process. Further, anyone drawing onthe grade as a source of information may apply their own (probably tacit) assumptions.

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com


Subject or study unit normsAcademic ‘insiders’ (and particularly those who have to deal with assessments froma range of subject areas) know that the pattern of grades varies between subjectdisciplines. Bridges et al. (1999) pointed to the potential for inequity. At the ‘macro’level, statistics produced by HESA in the UK show publicly the variability that existsin the distribution of honours degree classifications. For instance, an engineeringstudent is more than three times as likely as a law student to obtain first-class honours(see Yorke 2009a), a magnitude of discrepancy that is unlikely to be explained bydifferences in entry qualifications. It is more likely that the discrepancy arises becausethe full grading scale tends to be used to a greater extent in science-based subjects thanin the social sciences. Further, the need in law to demonstrate all-round capabilitiesmay make it particularly difficult for the latter student to get a ‘first’ (see Attwood2009). There is a need for a better understanding of disciplinary differences and theirimplications for assessment, since grades and overall indexes of attainment (e.g.honours degree classification or grade-point average) conceal the ‘mix’ of assessmentdemands on which they are based.

The variability is also present at study unit (or module) level: Johnson (2003)provides evidence from Duke University as part of his study of ‘grade inflation’ in theUSA, and examples from module-level in the UK are given by Yorke et al. (1996) andYorke (2008b). The norms in operation are in part disciplinary, and in part local to theinstitution.

Figure 1. An illustration of possible differences in grade distributions under norm- andcriterion-referenced assessment.

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com

258 M. Yorke

Psychosocial pressuresGrading is subject to a number of psychosocial pressures. Amongst the pressuresbearing on grading are the following.

● To award a grade higher than is merited. This can occur for reasons such asencouragement of students; offering them a chance to redeem failure by openingthe door to ‘compensation’ because of better performances elsewhere, or to aresitting of the assessment; and giving them an improved chance in the labourmarket. Associated with this is, in some professionally-related subjects, the needto resolve the tension between the duality of role of mentor and assessor. Thepressure is particularly strong where the ethos of the profession is nurturing, asin nursing, social work or teacher education (see, for example, Brandon andDavies 1979; Hawe 2003). Some assessors may also feel under pressure toaward higher grades, because low grades may be perceived as a signal of theirlow capability as teachers – an issue flagged by Johnson (2003) in relation totenure in the USA.

● The influence of the interaction between assessor and student. Dennis,Newstead, and Wright (1996) found that roughly 30% of the variance in themarking of students’ dissertations in psychology was attributable to influencesrelating to the supervisor/assessor (largely related to the assessor’s knowledgeof the student).

● The tension between, on the one hand, retention and completion (with their impli-cations for institutional funding) and, on the other, the need for academics touphold standards and in some subject areas to act as gatekeepers to a profession.

● Justifying the grading awarded, not only for the purposes of quality assurance,but also to enable evidence to be provided in response to any appeals lodged bystudents.

● The time-consuming nature of the appeals process, should a student be unhappywith the grade they have been awarded.

Assessment criteriaCurriculum designers are tempted to provide detailed specifications of assessmenttasks and criteria, which is unexceptionable and understandable from the point of viewof transparency. Indeed, the Quality Assurance Agency (QAA) in the UK seems tobe implicitly expressing the view that the problems of assessment can substantially besolved if only assessment criteria are expressed with sufficient clarity. A similar infer-ence can be drawn from Walvoord and Anderson (1998, ch. 5) regarding practice inthe USA.

In Section 6 of the first edition of the QAA’s Code of practice for the assuranceof academic quality and standards in higher education institutions were exhorted topublish and implement consistently clear criteria for the grading of students’ work. Asimilar sentiment is expressed in the second edition (QAA 2006). However, in reportsof audits carried out whilst the first edition of the Code was current, there seems tohave been a tension between expectations that institutions will have generic assess-ment criteria (for purposes of consistency) and subject-specific assessment criteria (inorder to reflect disciplinary interests): see QAA (2008, para. 38ff.) Such tension isunavoidable and irresolvable. It cannot be mitigated by the use of apparently generic

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com


terms (such as ‘critically evaluate’), since these mean different things in differentsubject areas.

The exhortation to apply generic assessment criteria reflects a desire to standardiseassessments in qualitative terms. However, this kind of standardisation is a chimera.Sadler, in a number of articles (see, for example, Sadler 1987, 2005, 2009a, b), andothers have pointed to a number of problems in the application of criteria that undercutany belief that assessment criteria, if only stated clearly and applied consistently, canmake a major contribution to the resolution of the problems of assessment. Theseproblems include:

● differential meanings applied to terms;● variation in understanding of terms, even within a subject discipline;● variation in the ways in which achievements against multiple criteria are

combined to provide an overall mark;● different ways of dealing with achievements that fall outside the formal specifi-

cation of the work.

The fuzziness inherent in generic assessment criteria, which is widespread, is exem-plified by the criteria relating to ‘conceptualisation’ which formed part of a study byPrice and Rust (1999, 134):

A/1st: Able to recognise consistency and reconcile inconsistency between infor-mation using cognitive and hypothesising skills;

B+/2.1: Consistent understanding demonstrated in a logical and lucid manner;B/2.2: Demonstrate understanding in a style which is mostly logical, consistent

and flowing;C/3rd: Attempts to demonstrate a logical and coherent understanding of the

subject area but aspects become confused or are underdeveloped;Refer/Fail: Understanding of the assignment not apparent, or lacks a logical and

coherent framework, or the subject is confused or underdeveloped.

The fuzziness is apparent as soon as one tries to resolve a borderline performance oneway or another (it is not eliminated if the criteria are given a discipline-specific inflec-tion). There is no precise decision-making process to which an assessor can appeal.Resolution is only possible through the exercise of professional judgement, whichimplies developmental work on the part of academics regarding the practice ofassessment (an important matter that is discussed later).

The assessment of complex achievementsMost of the achievements expected of students in higher education are complex. Writ-ing a good essay, for example, requires inter alia the capacity to appreciate what isneeded to support the argument, to be evaluative and selective in the deployment ofevidence, to present a coherent argument, and to write with some literary style whilstpaying attention to conventions such as citation. The assessment criteria for an essay,for example, might be expressed in terms akin to the following, with associatedweightings:

● Introduction, 5%● Relevance of essay to the topic set, 10%

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com

260 M. Yorke

● Structure of the argument, including conclusions, 25%● Coverage of relevant sources, including accuracy, 25%● Quality of analysis, 25%● Literary style and presentation, 10%

Whether the assessor operates primarily according to ‘menu-marking’ or holistically,the determination of an overall grade for an essay is challenging. More challengingstill is the assessment of ‘generic’ achievements (such as those located under thebanner of ‘employability’ or ‘graduate attributes’ that have risen to prominence ashigher education institutions have responded to political pressure to ensure that grad-uates are suited to the labour market). Take, for example, the ‘wicked competences’whose assessment by academics across a range of subjects was investigated by Knightand Page (2007). Knight and Page described these competences as ‘achievements thatcannot be neatly pre-specified, take time to develop and resist measurement-basedapproaches to assessment’ (2). They elicited from their respondents competenceswhich they grouped under the following broad headings:

● developing supportive relationships;● emotional intelligence;● group work;● listening and assimilating;● oral communication;● professional subject knowledge;● relating to clients;● self-management;● ‘taking it onwards – acting on diagnoses’.

Whilst one might take issue with the labelling of these as competences, the kinds ofachievement that underlie them are, as the authors pointed out, not easy to grade.When the locus of the performance is the workplace, the challenge of grading is yetmore severe. However, their online survey of 83 ‘key informants’ from broadly voca-tional subject areas found that the informants’ view, in general, was that the listedcompetences were ‘not … especially difficult to assess’ (2). Perhaps these informantsfelt able to recognise the manifestation of the ‘competences’ when they saw them, butit is doubtful whether such ‘competences’ could be captured adequately in a ‘menu’such as that provided for the marking of an essay, since their social components willoften be demonstrated in particular – and perhaps quite varied – contexts. For exam-ple, assessing a trainee teacher’s performance in a tough inner-city school probablyrequires a balancing of the assessment criteria that would differ from that when theperformance being assessed is in a middle-class school in a leafy suburb.

Assessment tasks such as a set of multiple-choice or short-answer questions, wherethe performance can be judged straightforwardly (quite often in binary right/wrongterms), present a different kind of problem. A way has to be found to map the student’sprofile of responses on to a broad grade. At its simplest, this might be achieved bydividing up the range of possible performances pro rata according to the grading scalebeing used: this might be appropriate when the tasks are of roughly equivalent diffi-culty. Where tasks differ markedly in difficulty (and evidence might be obtained fromstatistics in the relevant item bank), the grade attained by the student might bedetermined by the profile of their responses.

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com


Assessment regulationsBrumfield’s (2004) survey of US institutions showed considerable variation betweeninstitutions as regards their assessment regulations. A study of assessment regulationsin 35 varied higher education institutions in the UK also revealed considerable insti-tutional differences (Yorke et al. 2008). A feature of the latter survey was the variationin the way that a student’s performances in curricular units were combined to give thehonours degree classification: some institutions used arithmetic averaging, somedetermined the classification on the basis of the profile of the student’s achievements(the number of module grades in each of the grade-bands available), and someawarded under certain circumstances the higher class where there was a discrepancybetween the ‘averaging’ and ‘profiling’ methods. In Australia, likewise, there arediffering approaches to the determination of honours at the bachelor’s level (seeAustralian Vice-Chancellors’ Committee 2002).

If there is institutional variation in assessment regulations, then only a rough andready interpretation of a student’s overall grade (such as magna cum laude honors inthe USA, or upper second class honours in the UK) becomes possible. There is nocommon yardstick within the respective higher education sectors. Those outside thehigher education system, such as employers, interpret the level of the overall award inthe light of their appreciations (not always accurate) of the institution from which theaward was gained. Students from prestigious institutions have a built-in advantage.

Grading scales and the combination of gradesIn the UK, the majority of institutions have opted to use for grading purposes a scalerunning from 0 to 100 points (it is inappropriate to treat the grades as percentagesbecause there is usually no referent of perfection). A minority have opted to use scalesof around 20 points in length. The shorter scales have the advantage that it is psycho-logically easier for the assessor to award the maximum number of points on the scale(for outstanding work at the relevant level of the study programme) than it is to award100 points, which is typically construed as 100% and hence perfection. There is someempirical evidence that shorter scales give rise to wider spreads of grades (Yorke etal. 2002), but that the effect seems to be dissipated when such grades are combined inthe determination of honours degree classifications.

Since grading scales are ordinal rather than interval (let alone ratio) in character,grades should not be treated as if they possess the mathematical properties inherentin an interval scale (Dalziel 1998). The point is strengthened when account is takenof the variation in use of the grading scale between subject disciplines and/or studyunits. The widely-used practice of averaging ‘percentages’ in determining a student’soverall grade is fundamentally flawed, as is acknowledged in a – perhaps rueful –comment from the Vice-Chancellor of the University of Bedfordshire: ‘It alwaysused to strike me as a chemist that I would be telling my students not to average theunaverageable, and then I would walk into an examination board and do exactlythat!’ (Innovation, Universities, Science and Skills Committee of the House ofCommons 2009, vol. 1, para. 262). It should be noted that the use of grade-pointsderived from ‘percentages’ is akin to skimming plaster over cracks in a subsidingwall. The fundamental problem of ‘measurement’ does not go away.

If an overall categorisation of a student’s performances is required, it is preferablethat this be based on a ‘profiling’ approach, which embeds an ordinality (which oughtto extend into the grades awarded in respect of components of study units as well),

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com

262 M. Yorke

rather than on averaging (with its assumption of an interval scale). However, theprofiling is affected by the grade distribution in the subject. Judging by the degreeclassifications reported by HESA for higher education in the UK, the grades awardedin subjects allied to medicine (63.2% of awards in 2008 at upper second or first classhonours level) tend to lie closer to the upper end of the range than those in computerscience (56.2%) and business and administrative studies (53.4%). Entry qualificationsare not likely to be a major factor in these outcomes.

Grade boundariesIn scales running from 0 to 100, it is not unusual to find that the distribution of gradesin UK institutions is jagged, even when the number of grades is large (and hence whena smooth distribution would be anticipated). In UK grading practice, peaks at gradesof 40 and 60 are sometimes preceded by marked dips at 39 and 59. Similar effects canbe noticed at other multiples of 10, but those at 40 and 60 are of particular signifi-cance: 40 is typically taken as the threshold for a pass, and a grade of 60 signals thatthe achievement is of upper second class honours standard. It should be noted thatgrading norms tend to be higher in the USA and Australia, and hence the ‘banding’ ofperformances differs.

The pass grade is important, not only to students (for obvious reasons), but also toinstitutions. Institutions have to organise opportunities for students to redeem failure,they have to deal with any student appeals, and non-continuation has an impact onfunding. There is an implicit pressure (economic and social) on them to give thestudent the benefit of the doubt when deciding whether to award a pass. The frequencyof grades at 40 is often exaggerated because of the widely-adopted practice of‘capping’ a grade awarded for a retaken assessment at the threshold level for a pass.

In the UK, the dividing line between upper and lower second class honours atbachelor’s level is important, since an ‘upper second’ can be an entry ticket to presti-gious opportunities whereas those with a ‘lower second’ tend not to be admitted. Yetthe ‘cut’ between upper and lower second class is made at the point in the distributionof grades where ‘noise’ (i.e. measurement error) can have a particularly adverse effecton the discrimination between signals that differ in relatively small but importantways. The same applies to other ‘cut-points’ on the classification scale, where thedividing line between third class and fail (or, more likely than an unequivocal fail, theaward of the degree without honours because the required number of credits forhonours was not reached), and that between upper second and first class, are also ofsignificance.

Institutional markers may wish to minimise the uncertainty that exists at border-lines (e.g. 39 and 59) in order to pre-empt wrangling at an examination board. Thereis some anecdotal evidence that external examiners have, on occasion, asked internalmarkers to make definite decisions in respect of borderline performances, and not toleave them to be adjudicated by those who are not ideally placed to offer an opinion.Considerations such as these may help to explain – where they occur – relatively lowfrequencies of grades such as 39 and 59.

All grading, unless of a performance for which there are unambiguous markers ofachievement (such as calculating correctly), is to some extent touched with uncer-tainty. Psychometrics tells us that, to get a grade that is acceptably reliable (settingaside considerations of validity), it is necessary to assess a performance on a numberof occasions in order to reduce the ‘error of measurement’ to an acceptable level. In

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com


higher education it is not practicable in many circumstances to undertake multipleassessments of a particular type of performance. Perhaps this is why, in many UKinstitutions, there is an implicit appreciation of the fallibility of grades when theyallow reconsideration of overall performances that fall just below an honours classifi-cation boundary. It is not unusual for such reconsideration to take place when thearithmetical average is within 1 or 2 points of 40, 50, 60 and 70, where the ‘percentagescale’ is used for grading. Such reconsideration is applied to mean grades just belowthe borderline; rarely – if ever – when they are marginally above. In other words, it isa one-way bet: students can gain but not lose.

From measurement to judgementIf measurement is flawed, what can be done? The key is to acknowledge the pervasiveeffect of professional judgement in the assessment process – for example, in gradingan essay, or rather earlier in the assessment process, such as when making choicesabout assessment tasks (the achievement of some of which – such as solvingnumerical problems – might be felt to be amenable to a measurement approach). If ajudgemental approach is to be adopted, then the implications of the qualifier ‘profes-sional’ have to be addressed.

The challenge of assessment is at its most severe when the student work has to beassessed against a multiplicity of criteria (which may overlap). A parsimoniousapproach to assessment would bear in mind the outcomes specified for the assessmenttask, but would allow the assessor the freedom to make a judgement that would bemore holistic than marking against separate criteria – say, on two or three broadly-described dimensions at most. As noted above, the specification of expected outcomesis more challenging when the assessment relates to performances in naturalisticsettings – as with creativity, all that can probably be done is to list some broad criteriaand to assess (judge) how the student has responded to the set task – hence Eisner’s(1985) argument for ‘connoisseurship’.

Knight (2002), whilst criticising the relatively unquestioned application, tosummative assessment, of concepts derived from psychometrics, did not entirelyescape from their thrall since he used some psychometric terminology when offeringhis alternative stance. The more one shifts perspective from scientific measurementtowards judgement, the more problematic some of the widely-used concepts become.Referring to evaluation rather than to the assessment of student achievements, Gubaand Lincoln (1989) suggested that, whereas positivist approaches (i.e. based on scien-tific measurement) emphasised validity, reliability, objectivity and generalisability,the constructivist approach would focus on, respectively, credibility, dependability,confirmability and transferability. There is some commonality between a context ofevaluation and one of summative assessment, though the mapping of the details of theconstructivist model of evaluation as set out by these authors would need some adap-tation for the assessment of student achievement. Other aspects of assessment, suchas fairness, efficiency, ‘cheat-proofness’, intelligibility and utility would – it wouldseem – be relatively unaffected by the model chosen. The more that summative assess-ment is seen to have an influence on learning, the greater is the ‘reactivity’ (the extentto which behaviour is subsequently influenced, which is of particular significancewhen assessment is conducted in naturalistic settings such as workplaces, and theassessor is looking for examples of the way in which the student has learned from theirworkplace and other experiences).

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com

264 M. Yorke

Research on interviewing in simulated employment settings suggests that judge-ments of interviewees tend to be accurate whether or not notes are taken (Sanchez andde la Torre 1996; Middendorf and Macan 2002), though it should be appreciated thatthe taking of notes had some practical value in respect of good employment practices.Einhorn’s (2000/1972) study of expert judges’ interpretation of histology slides fromHodgkin’s disease patients who subsequently died is not entirely comforting, becausethe judges varied to some extent in their interpretations. Despite the variability of thejudges, Einhorn was prepared to allow that expert judges might nevertheless reach theright result by various means. From the point of view of summative assessment, it maybe the case that assessors can do likewise – but the assessors would need to havedeveloped appropriate expertise first. It can be inferred from these examples that ajudgemental approach to summative assessment might be a viable proposition (eventhough it will not be perfect – but current methods fall short of perfection, too).

Bias is a potential problem when applying the judgemental model (though it is notabsent from the scientific measurement model either: it is just less obvious there).Elander and Hardman (2002) list a range of studies of assessment situations where thecombination of statistical data proved superior to the judgement of individuals. Theymention clinical assessment, student selection, parole board decisions and the predic-tion of business failure. As they observe, the situations have in common the availabil-ity of a criterion external to the assessment against which the outcomes of theassessments can be weighed. This is operation with respect to a correspondence theoryof truth. In the assessment of student achievement there is typically no such externalcriterion. The overall grade is derived from some combination (depending on the insti-tutional assessment regulations) of the grades already awarded. This reflects a coher-ence theory of truth. If there is no external criterion, the effect of bias might bemitigated (it is unlikely to be eliminated) through collegial discussion focusing onstandards – a point that is developed later.

Knight (2002, 285) suggested that it would be better:

to explore assessment as complex systems of communication, as practices of sense-making and claim-making. This is about placing psychometrics under erasure whilerevaluing assessment practices as primarily communicative practices.

Further, Knight (2006) has argued that assessment is essentially ‘local’ in character.One cannot make sense of the outcomes of assessment unless one knows about thetask, the circumstances under which it was undertaken, and so on. The judgementalapproach to assessment has, arguably, a greater potential for communication than anapproach based on notions of measurement. Realising this potential implies the non-trivial task of educating interested parties as to what summative assessment can, andcannot, offer them. Knight’s (2007) comments on what an institution can, and cannot,warrant could make a useful starting point.

Adopting a judgemental approachWhat would follow from the adoption of an overtly judgemental approach to summa-tive assessment?

(1) It would make it unambiguously clear that summative assessments of complexachievements are almost always judgements and not measurements. The use

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com


of finely-graded scales, such as the so-called percentage scale, tends to seduceassessors into believing that assessment can be conducted with a precisionwhich it manifestly does not possess. The eradication of the false conscious-ness regarding precision would be beneficial across the higher educationsystem, and beyond it.

(2) It would deal to some extent with the variation manifested in the use of the‘percentage scale’, where the mark distributions differ considerably betweensubject disciplines and within subject disciplines too (for example, courses inbiology involve laboratory exercises and essay-writing, and economicsinvolves studies that are mathematically-based and studies that are rooted inthe social sciences). When students comment, as they often do, to the effectthat a percentage mark would be fairer than an honours degree classification(see examples in Yorke 2008a, 135 and Innovation, Universities, Science andSkills Committee of the House of Commons 2009, vol. 2, at Q179 and Q229),they demonstrate a touching and misguided faith in the commensurability ofpercentage marks across and within modules and subject areas, and in thecapacity of assessors from across the range of subject areas to calibrateextremely fine discriminations with consistency as regards academic stan-dards. Academics, too, are not immune from an unsustainable belief in theaccuracy with which students’ work can be marked: one institution, for exam-ple, operates a scale of marking for essays in which some letter grades aretranslated to a single percentage mark, such as 59%. These particular percent-ages may well be signalling ‘borderline performance, perhaps consider anupgrade’: one might infer that such a grade is more of a signal of borderlinestandard than it is a precise measure.

(3) It would require that students accept that assessment grades are judgements ofthe merits of their work, or combinations of pieces of work, and not precisemeasurements. Hence the overall assessment across the programme representsa judgement which cannot be replaced by the apparent precision of a percent-age mark.

The relationship between standard and levelConsideration needs to be given to the issue of the standards to be applied at differentstages in a programme of study. On a developmental rationale, work produced in thefinal year of (full-time) study would be assessed against higher standards than thatproduced in the penultimate year. It is unclear whether UK institutions that have anundifferentiated ‘Part 2’ to the honours programme (Part 2 being comprised of themodules that count towards the classification, and would typically be the second andthird years of a full-time bachelor’s programme) have resolved this issue. The passinggrades might, for ease of reference, be related to those of the honours degree classifi-cation. Academics in the UK develop over time an understanding of the broad mean-ing of these labels for their own subject discipline against standards determined for theparticular stage of the programme.

The grade to be awarded could be determined in a manner similar to that suggestedby Sadler (2005), in which outcomes are divided into primary and secondary impor-tance, and whose level of achievement could be categorised according to a scale suchas: all achieved; most achieved; some achieved; none achieved. The grade awarded tothe work would be determined with reference to the particular combination of

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com

266 M. Yorke

achievements regarding primary and secondary outcomes. However, it is highlyimprobable that the grade could be determined mechanistically via mapping rules,since each would require a conflation of achievement levels, intended outcomes andthe categorisation of the outcomes as essential or desirable. Assessors would gener-ally, for reasons of practicality, be reluctant to undertake such a fine level of scrutinyof each piece of submitted work. They might, however, be prepared to use a limitednumber of key assessment criteria to produce an initial judgement that would berefined with reference to other criteria deemed relevant. Gigerenzer, Todd, and theABC Research Group (1999) would probably suggest something even simpler, apply-ing a heuristic to a very small number of key pieces of evidence and basing a roughand ready judgement on what the heuristic produced. This is not so very different fromadvice to focus assessment at the module level on a small number of key learningoutcomes, instead of trying to assess all of those that are stated for the module (see,for example, Knight and Yorke 2006, 9).

The findings of Elander and Hardman (2002), regarding the marking of essay-typeproductions by students, can be interpreted as pointing in a similar direction, sincethey found that seven aspects of the graded work gave rise to three or fewer dimen-sions following principal components analysis of their data. Their principal compo-nents analyses produced a large first component with no more than two othercomponents. Drawing on Kline (1994), they argued that the first (large) componentled to the underestimation of the importance of smaller components, although there isargument about the number of components that should be taken into account wheneigenvalues are close to unity (as was the case with a number of the factors extractedby Elander and Hardman: see Zwick and Velicer 1986; Cliff 1988; Kline 1994, forcontrasting views). Whatever statistical position one takes, the findings of Elander andHardman press towards parsimony in the number of dimensions against which essay-type work should be assessed. Grainger, Purnell, and Zipf (2008), qualitatively, foundmarkers tending to converge on content and technical aspects of the submitted workas indicators of quality (they might, perhaps, have meant ‘standards’).

The part/whole conundrumThe tension between achievements at various levels in an academic programme drawsattention to the philosophical issue of the relationship between the parts and the whole.Can, for instance, a student succeed in terms of the programme as a whole whilsthaving not achieved success in a few of the programme’s component parts? As Knightand Yorke (2003) have argued, following Claxton’s (1998) ideas regarding slowlearning, some kinds of achievement may not be demonstrated adequately within thetight confines of a single module, and may require the longer span of a programme tobe done justice. Whereas the answer given by Knight and Yorke to the question wouldbe ‘yes’ (subject to limitations such as a mandatory requirement to pass in particularcurricular components), those with a strong credit accumulation perspective wouldprobably take the opposite view.

Compensation (of weak performances by strong ones) is a perennial issue. Forsome modules, such as those that bear on public safety, a failing performance cannotbe permitted to be subject to compensation. In others, compensation between modulecomponents may be acceptable. For example, compensation might be permitted whenone piece of written work in, say, a sociology module, is substandard but the majorityof material submitted for the module is of an adequate standard. Requiring a pass in

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com


every component of a module is a stringent criterion, and can build up a heavy backlogof reassessments (which may be to both students’ and staff’s disadvantage).

Are lengthy grading scales necessary?The question has been raised (Yorke 2010) as to whether the grading of students’work needs to be as finely grained as it often is. An analysis of the records of 144students whose studies were predominantly in modules of law showed that, had thegrading been undertaken with reference to a scale considerably shorter than those typi-cally used, there would have been relatively little difference to the students’ overallgradings (in this case, their honours degree classifications) for the programme as awhole. The study was in effect a simulation based on real data, but what it was unableto do was to test whether marking to a very short scale from the outset would haveeffects on marks at both module and overall levels.

Gigerenzer, Todd, and the ABC Research Group (1999) referred to the utility ofusing what they termed ‘fast and frugal heuristics’ in decision-making. For somepurposes, rough and ready information can have greater practical value than finely-detailed information. Law (2004, 105ff.) gives the example of a senior manager at theDaresbury Laboratory (a facility funded by the Science and Engineering ResearchCouncil) who realised, from monthly time-sheets completed in an approximate wayby employees, that a development project was running seriously behind schedule. Theinformation-value of the rough and ready time-sheet data was, for the manager’spurposes, sufficient to provide the ‘big picture’ view he needed. To follow the apho-rism, less turned out to be more. For other managerial purposes, information wouldhave been needed at a finer level of detail. The (rather trite) lesson for the assessmentof student work is that what is needed by way of information depends on the purposesthat the outcomes serve.

The grading of student work is an exercise in decision-making (often in consider-able detail, and sometimes with anguish), but has received insufficient research atten-tion. It is reasonable to ask whether it is necessary to attempt to assess summativelyusing lengthy grading scales, or whether short scales (of, say, no more than fivepoints) would be ‘good enough’ to serve the interests of students, employers and theacademy. The University of Wolverhampton has recently adopted a six-letter scale forthose undergraduate assessments that precede those on which the honours degree clas-sification is based: from the research perspective, it could be viewed as a pilot fieldstudy. One can envisage academics adopting different stances towards the innovation,and its evaluation ought to be illuminating.

A collateral benefit to academics of adopting an approach based on judgement andshort grading scales is that they might find themselves able to undertake gradingcomparatively rapidly, thereby freeing up time for other aspects of academic work. Ina period of considerable resource pressure on institutions, this has an obvious attraction.

The development of expertise as an assessorHigher education is, in general, thinly populated with academics who are experts inassessment (there are probably rather more who consider themselves experts). Ifjudgement is to have a greater emphasis in assessment, then it has to be trustworthy.Gaining an acceptable level of trust is no easy task, since it requires an appropriatelevel of expertise that is widely acknowledged as such: as Carless (2009, 85) observes,

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com

268 M. Yorke

‘there is a need for higher levels of assessment literacy at all levels of an institutionfrom senior management to frontline teaching staff’. Further, he points to the criticalrole of middle and senior management in this respect, arguing that they may be mostin need of professional development regarding their knowledge and understanding ofassessment. Such development implies a social constructivist approach to the appre-ciation by academic staff of tacit and explicit knowledge about assessment: Rust,O’Donovan, and Price (2005) make the point, but their argument curves away fromdeveloping it.

Whatever view one takes regarding summative assessment, the disarray noted byKnight (2002), Rust (2007) and Yorke (2008a) – again evident in this article – will notbe susceptible to a ‘quick fix’. There is a need for a substantial developmentalprogramme at sectoral and institutional level.

Moving towards a more judgemental approach to summative assessment shouldnot lead to an unthinking dismissal of ‘menu-marking’. Menu-marking provides aframework against which the inexperienced can develop their expertise as assessors.Over time, and with reflection on the process of assessment, the crutch of ‘menu-marking’ can be discarded and greater confidence placed on the capacity to judge ina more holistic way. In a small-scale empirical study, Ecclestone (2001) found thatnovices were more rule-governed than competent or expert assessors, though therelationship between the level of competence of the assessor and the actual assessmentoutcomes was not clear-cut. It is possible to think in terms of mapping the develop-ment as an assessor against the Dreyfus brothers’ model for the development of exper-tise, in which the novice follows a set of rules and passes through levels of competencebefore (perhaps) reaching expert status with the capacity to judge holistically whilstnot necessarily being able to articulate in detail the basis of the judgement (see Dreyfusand Dreyfus 2005). From the point of view of the development of people as assessors,assessment practice that is in part tacit and bordering on the ineffable (however accu-rate it is) is not the most helpful exemplification for the less expert. The ‘proficient’(using the Dreyfus brothers’ terminology) assessor who can explain how they arrivedat their judgement, drawing on the relevant assessment criteria, is likely to be moreuseful in this regard.

Expertise in assessment is not a generic capability. Whereas one might be expertin assessing, say, an essay on the strength of the case for the invasion of Iraq or anevaluation of the evidence-base for homeopathy, it does not follow that the expertisetransfers to the assessment of a student’s contribution to a group activity focusing onthe same issues, or to performance in a work-based situation. More broadly, theassessment of what are often called ‘generic skills’ involves capabilities that differfrom those required for the assessment of academic work. Such capabilities cannot bereadily disregarded in higher education (perhaps on the grounds that they are ‘notacademic’), since many kinds of programme deliberately intertwine academic andprofessional learning.

A key issue is how assessors can acquire proficiency in the assessment of standards.There are perhaps two main dimensions: to develop an understanding of standards, andto move from the strict rule-governed behaviour of the novice assessor to the adoptionof a more holistic appreciation of the merits of a performance.

One of the arguments for peer-assessment by students is that, by discussing variedexamples of student work, they can come to appreciate whether or not a piece of workis of high standard, and hence what the desired standards are. The argument can beextended to academic staff, as Rust (2007) makes clear in his article on the scholarship

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com


of assessment. Examples of staff collectivity in engaging with assessment are givenby Saunders and Davis (1998) and Jawitz (2009). It should be feasible for institutionaland national disciplinary communities to organise developmental activities of a simi-lar kind, supported where appropriate by colleagues who have specific expertise inassessment methodology: this is consistent with a recommendation made by theAssessment Standards Knowledge Exchange Centre for Excellence in Teaching andLearning to the Innovation, Universities, Science and Skills Committee of the Houseof Commons in the UK (see Innovation, Universities, Science and Skills Committeeof the House of Commons 2009, at Ev 194). However, the establishment of functionalcommunities of practice regarding assessment within a disciplinary area is not withoutits problems (see Price 2005), and groups of assessors may find themselves headingoff down the track of ever-increasing specification unless they see such specificationas a necessary ground-clearing step on the route to enhanced judgement regarding thestandards of submitted work. Evidence-based collegial discussion about standardsmight help to reduce unconscious biases that may have built up in assessors over time(Ecclestone [2001, 308] alerts of this danger, and Lievens [2001] touches on the issueof assessor bias in respect of assessment centres).

With greater experience, rule-governed information processing by assessors canevolve into associative processing (Kahneman and Frederick 2002), which is consis-tent with the developmental trajectory from novice to expert that has been articulatedby Dreyfus and Dreyfus (2005).

TrustThe acceptance of a judgemental approach to assessment makes overt the need for trustin the professional judgement of assessors. The need is just as necessary where theassessments are taken to be based on a measurement approach: it is just that the judge-mental approach means that the issue of trust has to be faced openly rather than leftrelatively unquestioned. In the context of assessing practical achievements in medicaleducation (but with implicitly a much wider relevance), Govaerts et al. (2007) makea very useful contribution towards opening up areas for debate and development.

Carless (2009) points out that there has been an erosion in the preparedness totrust: one of the points that he makes is that the move towards more explicitness inaccountability for assessment has had the paradoxical effect of decreasing trust. There-establishment of trust in academics’ judgement will take time, and cannot beachieved by a simple reversion to previous practices. Necessary conditions for trustinclude:

● developmental work to ensure that assessors have an adequate level of expertise;● the ability and willingness on the part of assessors to provide justifications for

their judgements (there are obvious connections with formative assessment);● institutional procedures that allow for challenge to judgements (where these

make a material difference). It needs to be borne in mind that, for some kinds ofassessment, student anonymity is not a feasible proposition (and may militateagainst the dialogue that can assist learning);

● an appropriate level of quality assurance as regards the assessment process;● dealing openly with the fact that judgements (as of course do all assessments)

have political implications at various levels, for individuals, institutions and thehigher education sector as a whole;

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com

270 M. Yorke

● a campaign to educate those outside the higher education system about whatsummative assessment can and cannot do (in other words, a debunking ofwidely-held myths). Such a campaign would need to include professional andstatutory bodies because of the influence that they wield in respect of somehigher education programmes.

Concluding commentThe evidence presented in this article demonstrates that summative assessment isflawed, for a variety of reasons. Acknowledging this is a necessary first step if highereducation is to mitigate the manifest disarray. The integrity of summative assessmentswould be more secure if the flaws in current practice were addressed, and acknowl-edgement were made of the limitations that exist regarding the precision with whichgrading can be undertaken.

Dealing with the flaws in summative assessment will require a sustained develop-mental effort at sectoral and institutional levels. Patching up current summativeassessment practices with sticking plaster, whilst tactically attractive, is strategicallyquite inadequate.

AcknowledgementThe constructive comments of anonymous referees are very much appreciated and have helpedin refining the argument I have presented here.

ReferencesAttwood, R. 2009. Why is it so difficult for law students to get a first? The Times, October 15.

http://business.timesonline.co.uk/tol/business/law/student/article6872381.ece (accessedNovember 24, 2009).

Australian Vice-Chancellors’ Committee. 2002. Grades for honours programs (concurrentwith pass degree), 2002. http://www.avcc.edu.au/documents/universities/key_survey_summaries/Grades_for_Degree_Subjects_Jun02.xls (accessed November 24, 2009).

Baume, D., and M. Yorke, with M. Coffey. 2004. What is happening when we assess, andhow can we use our understanding of this to improve assessment? Assessment and Evalu-ation in Higher Education 29, no. 4: 451–77.

Bloxham, S. 2009. Marking and moderation in the UK: False assumptions and wastedresources. Assessment and Evaluation in Higher Education 34, no. 2: 209–20.

Brandon, J., and M. Davies. 1979. The limits of competence in social work: The assessment ofmarginal work in social work education. British Journal of Social Work 9, no. 3: 295–347.

Bridges, P., B. Bourdillon, D. Collymore, A. Cooper, W. Fox, C. Haines, D. Turner, H. Woolf,and M. Yorke. 1999. Discipline-related marking behaviour using percentages: A potentialcause of inequity in assessment. Assessment and Evaluation in Higher Education 24, no. 3:285–300.

Brooks, V. 2004. Double marking revisited. British Journal of Educational Studies 52, no. 1:29–46.

Brumfield, C. 2004. Current trends in grades and grading practices in higher education:Results of the 2004 AACRAO survey. Washington, DC: American Association ofCollegiate Registrars and Admissions Officers.

Carless, D. 2009. Trust, distrust and their impact on assessment reform. Assessment andEvaluation in Higher Education 34, no. 1: 79–89.

Claxton, G. 1998. Hare brain, tortoise mind. London: Fourth Estate.Cliff, N. 1988. The eigenvalues-greater-than-one rule and the reliability of components.

Psychological Bulletin 103, no. 2: 276–79.

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com


Cowan, N. 2000. The magical number 4 in short-term memory: A reconsideration of mentalstorage capacity [plus subsequent commentaries]. Behavioral and Brain Science 24, no. 1:87–185.

Dalziel, J. 1998. Using marks to assess student performance: Some problems and alternatives.Assessment and Evaluation in Higher Education 23, no. 4: 351–66.

Dennis, I., S.E. Newstead, and D.E. Wright. 1996. A new approach to exploring biases ineducational assessment. British Journal of Psychology 87, no. 4: 515–34.

Dreyfus, H.L., and S.E. Dreyfus. 2005. Expertise in real world contexts. Organization Studies26, no. 5: 779–92.

Ecclestone, K. 2001. ‘I know a 2:1 when I see it’: Understanding criteria for degree classifica-tions in franchised university programmes. Journal of Further and Higher Education 25,no. 3: 301–13.

Einhorn, H.J. 1972/2000. Expert judgement: Some necessary conditions and an example. InJudgement and decision making: An interdisciplinary reader, ed. T. Connolly, H.R.Arkes, and K.R. Hammond, 2nd ed., 324–35. Cambridge: Cambridge University Press.[Abridged version of ‘Expert judgement and mechanical combination’, OrganizationalBehavior and Human Performance 7, no. 1: 86–106.]

Eisner, E.W. 1985. The art of educational evaluation: A personal view. London: FalmerPress.

Ekstrom, R.B., and A.M. Villegas. 1994. College grades: An exploratory study of policies andpractices. New York: College Entrance Examination Board.

Elander, J., and D. Hardman. 2002. An application of judgment analysis to examination mark-ing in psychology. British Journal of Psychology 93, no. 3: 303–28.

Elton, L., and B. Johnston. 2002. Assessment in universities: A critical review of research.http://www.heacademy.ac.uk/resources/detail/resource_database/id13_assessment_in_universities (accessed November 24, 2009).

European Commission. 2009. ECTS users’ guide. Brussels: European Commission. http://ec.europa.eu/education/lifelong-learning-policy/doc/ects/guide_en.pdf (accessed September13, 2009).

Gigerenzer, G., P. Todd, and the ABC Research Group. 1999. Simple heuristics that make ussmart. Oxford: Oxford University Press.

Govaerts, M.J.B., C.P.M. van der Vleuten, L.W.T. Schuwirth, and A.M.M. Muitjens. 2007.Broadening perspectives on clinical performance assessment: Rethinking the nature of in-training assessment. Advances in Health Sciences Education 12, no. 2: 239–60.

Grainger, P., K. Purnell, and R. Zipf. 2008. Judging quality through substantive conversationsbetween markers. Assessment and Evaluation in Higher Education 33, no. 2: 133–42.

Guba, E.G., and Y.S. Lincoln. 1989. Fourth generation evaluation. London: Sage.Hager, P., and J. Butler. 1996. Two models of educational assessment. Assessment and Evalu-

ation in Higher Education 21, no. 4: 367–78.Hand, L., and D. Clewes. 2000. Marking the difference: An investigation of the criteria used

for assessing undergraduate dissertations in a business school. Assessment and Evaluationin Higher Education 25, no. 1: 5–21.

Hawe, E. 2003. ‘It’s pretty difficult to fail’: The reluctance of lecturers to award a failinggrade. Assessment and Evaluation in Higher Education 28, no. 4: 371–82.

Hornby, W. 2003. Assessing using grade-related criteria: A single currency for universities?Assessment and Evaluation in Higher Education 28, no. 4: 435–54.

Innovation, Universities, Science and Skills Committee of the House of Commons. 2009.Students and universities. Eleventh Report of Session 2008–09, 2 volumes. London: TheStationery Office.

Jawitz, J. 2009. Learning in the academic workplace: The harmonisation of the collective andthe individual habitus. Studies in Higher Education 34, no. 6: 601–14.

Johnson, V.E. 2003. Grade inflation: A crisis in college education. New York: Springer.Johnston, B. 2004. Summative assessment of portfolios: An examination of different

approaches to agreement over outcomes. Studies in Higher Education 29, no. 3: 395–412.Kahneman, D., and S. Frederick. 2002. Representativeness revisited: Attribute substitution

in intuitive judgment. In Heuristics and biases: The psychology of intuitive judgment,ed. T. Gilovich, D. Griffin, and D. Kahneman, 49–81. Cambridge: Cambridge Univer-sity Press.

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com

272 M. Yorke

Kline, P. 1994. An easy guide to factor analysis. London: Routledge.Knight, P.T. 2002. Summative assessment in higher education: Practices in disarray. Studies

in Higher Education 27, no. 3: 275–86.Knight, P.T. 2006. The local practices of assessment. Assessment and Evaluation in Higher

Education 31, no. 4: 435–52.Knight, P.T. 2007. Grading, classifying and future learning. In Rethinking assessment in

higher education: Learning for the longer term, ed. D. Boud and N. Falchikov, 72–86.Abingdon: Routledge.

Knight, P.T., and A. Page. 2007. The assessment of ‘wicked’ competences. Report to the Prac-tice-Based Professional Learning Centre. http://www.open.ac.uk/cetl-workspace/cetlcon-tent/documents/460d21bd645f8.pdf (accessed November 24, 2009).

Knight, P.T., and M. Yorke. 2003. Assessment, learning and employability. Maidenhead:Open University Press.

Knight, P.T., and M. Yorke. 2006. Employability: Judging and communicating achievements.York: The Higher Education Academy. http://www.heacademy.ac.uk/resources/detail/ourwork/employability/employability338 (accessed November 24, 2009).

Law, J. 2004. After method: Mess in social science research. London: Routledge.Lievens, F. 2001. Assessor training strategies and their effects on accuracy, interrater reliabil-

ity and discriminant validity. Journal of Applied Psychology 86, no. 2: 225–64.Middendorf, C.H., and T.H. Macan. 2002. Note-taking in the employment interview: Effects

on recall and judgments. Journal of Applied Psychology 87, no. 2: 293–303.Miller, G.A. 1956. The magic number seven, plus or minus two. Psychological Review 63,

no. 2: 81–97.Price, M. 2005. Assessment standards: The role of communities of practice and the scholarship

of assessment. Assessment and Evaluation in Higher Education 30, no. 3: 215–30.Price, M., and C. Rust. 1999. The experience of introducing a common criteria assessment

grid across an academic department. Quality in Higher Education 5, no. 2: 133–44.Quality Assurance Agency for Higher Education (QAA). 2006. Code of practice for the

assurance of academic quality and standards in higher education. Section 6: Assessmentof students. Gloucester: QAA.

Quality Assurance Agency for Higher Education (QAA). 2008. Outcomes from institutionalaudit: Assessment of students. Second series. Gloucester: QAA.

Rust, C. 2007. Towards a scholarship of assessment. Assessment and Evaluation in HigherEducation 32, no. 2: 229–37.

Rust, C., B. O’Donovan, and M. Price. 2005. A social constructivist assessment processmodel: How the research literature shows us this could be best practice. Assessment andEvaluation in Higher Education 30, no. 3: 231–40.

Sadler, D.R. 1987. Specifying and promulgating achievement standards. Oxford Review ofEducation 13, no. 2: 191–209.

Sadler, D.R. 2005. Interpretations of criteria-based assessment and grading in higher educa-tion. Assessment and Evaluation in Higher Education 30, no. 2: 176–94.

Sadler, D.R. 2009a. Indeterminacy in the use of preset criteria for assessment and grading.Assessment and Evaluation in Higher Education 34, no. 2: 159–79.

Sadler, D.R. 2009b. Transforming holistic assessment and grading into a vehicle for learning.In Assessment, learning and judgement in higher education, ed. G. Joughin, 45–63.Dordrecht: Springer.

Sadler, D.R. 2010. Fidelity as a precondition for integrity in grading academic achievement.Assessment and Evaluation in Higher Education 35, no. 6: 727–43.

Sanchez, J.I., and P. de la Torre. 1996. A second look at the relationship between ratingand behavioral accuracy in performance appraisal. Journal of Applied Psychology 81,no. 1: 3–10.

Saunders, M.N.K., and S.M. Davis. 1998. The use of assessment criteria to ensure consistencyof marking: Some implications for good practice. Quality Assurance in Education 6,no. 3: 162–71.

Stanovich, K., and R. West. 2002. Individual differences in reasoning. In Heuristicsand biases: The psychology of intuitive judgement, ed. T. Gilovich, D. Griffin, and D.Kahneman, 421–40. Cambridge: Cambridge University Press.

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com


Walvoord, B.E., and V.J. Anderson. 1998. Effective grading: A tool for learning and assess-ment. San Francisco: Jossey-Bass.

Webster, F., D. Pepper, and A. Jenkins. 2000. Assessing the undergraduate dissertation.Assessment and Evaluation in Higher Education 25, no. 1: 71–80.

Woolf, H. 2004. Assessment criteria: Reflections on current practices. Assessment andEvaluation in Higher Education 29, no. 4: 479–93.

Yorke, M. 2008a. Grading student achievement in higher education: Signals and shortcomings.Abingdon: Routledge.

Yorke, M. 2008b. Faulty signals? Inadequacies of grading systems and a possible response. InAssessment, learning and judgement, ed. G. Joughin, 65–84. Dordrecht: Springer.

Yorke, M. 2009a. Honours degree classifications: What we can and cannot tell from thestatistics. Gloucester: Quality Assurance Agency for Higher Education.

Yorke, M. 2009b. Trends in honours degree classifications, 1994–95 to 2006–07, for England,Wales and Northern Ireland. http://www.heacademy.ac.uk/resources/detail/publications/trends_in_honours_degree_classifications (accessed November 24, 2009).

Yorke, M. 2010. How finely grained does summative assessment need to be? Studies inHigher Education 35, no. 6: 677–89.

Yorke, M., G. Barnett, P. Bridges, P. Evanson, C. Haines, D. Jenkins, P. Knight, D. Scurry,M. Stowell, and H. Woolf. 2002. Does grading method influence honours degree classifi-cation? Assessment and Evaluation in Higher Education 27, no. 3: 269–79.

Yorke, M., P. Bridges, and H. Woolf. 2000. Mark distributions and marking practices in UKhigher education. Active Learning in Higher Education 1, no. 1: 7–27.

Yorke, M., A. Cooper, W. Fox, C. Haines, P. McHugh, D. Turner, and H. Woolf. 1996.Module mark distributions in eight subject areas and some issues they raise. In Modularhigher education in the UK, ed. N. Jackson, 105–7. London: Higher Education QualityCouncil.

Yorke, M., H. Woolf, M. Stowell, R. Allen, C. Haines, M. Redding. D. Scurry, G. Taylor-Russell, W. Turnbull, and L. Walker. 2008. Enigmatic variations: Honours degreeassessment regulations in the UK. Higher Education Quarterly 62, no. 3: 157–80.

Zwick, W.R., and W.F. Velicer. 1986. Comparison of five rules for determining the numberof components to retain. Psychological Bulletin 99, no. 3: 432–42.

Stud

ies i

n H

ighe

r Edu

catio

n 20

11.3

6:25

1-27

3. D

ownl

oade

d fro

m w

ww

.tand

fonl

ine.

com

Documents

Summative assessment: dealing with the ‘measurement fallacy’€¦ · the ‘measurement fallacy’ Mantz Yorke a a Department of Educational Research, Lancaster University, Lancaster,