Score comparability of holistic and by-part marking models

© Cambridge English Language Assessment 2015

Score Comparability of

Holistic and By-Part

Marking Models

Nahal Khabbazbashi and Evelina Galaczi Cambridge English Language Assessment

Language Testing Forum

Department of Education, University of Oxford

November 2015


Marking by part:

Is it really worth the effort?


Why mark by part? • A single score for a number of tasks may not provide a

clear reflection of a candidate’s speaking proficiency (O’Sullivan, 2007; O’Sullivan & Nakatsuhara, 2011; Taylor and Galaczi,

2011)

• Effects of task types on candidate conversational styles (O’Sullivan & Nakatsuhara, 2011)

• Examiners’ final decision making may be impacted by

particularly good or bad performance on a given task (O’Sullivan & Nakatsuhara, 2011)

• Noticeable differences in band scores of candidates on

different parts of the IELTS speaking test (Nakatsuhara,

2011)


Why not mark by part? • Practical constraints

• Cognitive load on examiners

• Sufficient speech sample for reliable scoring

Part 1

Part 2

Part 3

……..

Part n

Part 1

Part 2

Part 3

……..

Part n

Aggregate Scores Overall Mark =

?

It’s an empirical question!


Assessment Context

Part X Read aloud (sound check)

Part 1 Short interview questions (personal topics)

Part 2 Describing and comparing two photographs

Part 3 Responding to a series of questions on a familiar topic

Part 4 Long-turn on an abstract topic (prompts provided)

Cambridge English Placement Test


Marking Model Cambridge English Placement Test

Part X Read aloud (sound check)

Part 1 Short interview questions (personal topics)

Part 2 Describing and comparing two photographs

Part 3 Responding to a series of questions on a familiar topic

Part 4 Long-turn on an abstract topic (prompts provided)

Single Holistic

Mark

Coherence & Discourse Management

Language Resource

Pronunciation

Hesitation & Extent

Criteria


Research Questions

RQ1: What is the relationship between Holistic and By-

Part marking models in an online speaking test?

RQ2: To what extent are the different parts of the speaking

test statistically distinct?

RQ3: Do raters perform differently when test parts from

different candidates are allocated at random compared to

marking the same candidates’ performances on different

parts?


Method Test Set

CEPT Trial Data (N=240)

Live examiners (N=10)

Range of L1s

Range of CEFR Levels

Rating Matrix

Designed to ensure that:

(a) candidates in test set receive at least

one holistic mark and one mark for

each test part;

(b) raters are linked through common

performances to meet requirements

of Many-Facet Rasch Measurement

(MFRM); and

(c) halo effects are minimised.

Rating Process

Raters (N=4)

Two rounds of rating

(counter-balanced design)

Analysis

MFRM using FACETS

(Linacre, 2011)

Separate analyses for different

marking models

(Holistic, By-Part, Random)


Rater

Candidates 1-30 + CB By Part Candidates 1-30 + CB Holistic

Candidates 31-60 + CB Holistic Candidates 61-90 + CB By Part

Part 1 (61-70)

Part 2 (71-80)

Part 3 (81-90)

Part 4 (201-210)

Parts 1-4 (CB)

Isolated Parts

Part 1 (1-10)

Part 2 (11-20)

Part 3 (21-30)

Part 4 (221-230)

Isolated Parts

Candidates 61-90 + CB Holistic Candidates 31-60 + CB By Part

Candidates 91-120 + CB By Part Candidates 91-120 + CB Holistic

Part 1 (11-20)

Part 2 (1-10)

Part 3 (31-40)

Part 4 (41-50)

Parts 1-4 (CB)

Isolated Parts

Part 1 (91-100)

Part 2 (201-210)

Part 3 (221-230)

Part 4 (101-110)

Isolated Parts

Round 1 Round 2

Ra

ter 1

Ra

ter 2

Rating Matrix


Results (RQ1) Marking models & impact on candidate scores (group level)

Marking Model Holistic (%) By Part

(%)

CE

FR

Lev

el A1 (1) 3.3 3.8

A2 (2) 10.9 15.0

B1 (3) 48.8 52.1

B2 (4) 26.2 22.0

C1 (5) 8.3 5.4

C2 (6) 2.5 1.7

Mean (N=240) 3.78 3.65

SD (N=240) 0.91 0.83

• Similar distribution of CEFR levels & mean average scores

BUT

• By-Part model resulted in a lower percentage of candidates at higher

CEFR levels and a statistically significant lower overall mean (p=0.01)

with a small effect size.


Results (RQ1) Marking models & impact on candidate CEFR classifications

(individual level)

• 66.3% of candidates received the exact same CEFR classification

for the Holistic and By-Part marking models

• Majority of disagreements fall within an adjacent upper/lower band

(98%)

• Trend of higher CEFR levels when Holistic marking is used

Marking Model Holistic vs. By Part %

of

Ag

reem

ents

-2 0.0

-1 10.0

Exact (0) 66.3

1 22.4

2 1.3

Results (RQ2)

How statistically distinct are the different test parts?

• Narrow spread of difficulty

• Two statistically distinct difficulty strata

• Moderate separation reliability

Part Measure

(logits)

Fair-M

Average

Model

S.E.

Infit

Mean

Square

Part 1 0.17 3.57 0.08 1.51

Part 2 0.01 3.60 0.08 0.67

Part 3 -0.23 3.65 0.08 0.67

Part 4 0.06 3.59 0.08 1.09

Separation 1.41; Strata 2.21; Reliability 0.67

Model, Fixed (all same) Chi-square: 12.0 d.f.: 3

Significance (probability): .01


Raters’ Perspective Preference expressed for marking by part

‘A candidate's performance can vary quite widely in different

parts of the test, so to give an overall mark could be difficult to do

and/or less reliable .To give marks for each criterion in each part

of the test is overcomplicated, and unrealistic to expect examiners

to do while listening to a test. I am confident however in marking

by part, as I feel the marks awarded are an accurate reflection of

candidates' levels.’

BUT

Not enough language elicited in each part

‘Each section has its limits and there will not always be enough

evidence to mark each part’


Cand Rater

Marking

Condition P1 P2 P3 P4

31 MG Part 3.5 4 3.5 3.5

31 MG Random 3.5 4 3 3.5

71 ML Part 2 3 3 3

71 ML Random 2 3 3 3

135 MW Part 5 5 5.5 5

135 MW Random 4.5 5 6 6

233 NA Part 3.5 3 3 3

233 NA Random 3.5 3 3 3

RQ3: Do raters perform differently when test parts from

different candidates are allocated at random compared to marking

the same candidates’ performances on different parts?


Results (RQ3)

Marking

Condition

Measure

(logits)

Fair-M

Average

Model

S.E.

Infit Mean

Square

Part

(candidate

known) -0.05 3.63 0.04 0.91

Random

(candidate

unknown) 0.05 3.60 0.06 1.37

Separation 0.00; Strata 0.33; Reliability 0.00

Model, Fixed (all same) Chi-square: 1.8 d.f.: 1 Significance

(probability): .18

• No statistically significant difference between the

two conditions

• Higher Infit Mean Square for the Random

condition but within acceptable control limits


Summary

• Similar distribution of CEFR levels and mean

average scores for the two marking models

• Statistically significant lower mean scores in the

By-Part marking model with a small effect size

• 66.3% of candidates received the same exact

CEFR classification

• 98% fell within adjacent upper or lower bands

with a trend of higher CEFR levels with Holistic

marking


Summary • Different test parts exhibited similar difficulty

measures

• Raters did not perform differently when test

parts were allocated at random

• Rater preference for marking by part in

increasing test validity and reliability


So what?

• The choice of marking model does have an

impact on CEFR classifications

• Stronger arguments for marking by part

BUT

• Test purpose, assessment contexts and practical

constraints are key considerations

• Alternative approach? e.g. mark by part but

match specific criteria to task types


Thank you!


Linacre, J. M. (2011). Facets computer program for many-facet Rasch

measurement. Beaverton, Oregon: Winsteps.

Nakatsuhara, F. (2011) The relationship between test-takers' listening

proficiency and their performance on the IELTS Speaking Test. In IELTS

Research Reports Osborne, J. (ed.), vol. 12, IDP: IELTS Australia and

British Council. Available online at:

www.ielts.org/researchers/research/volume_12.aspx ]

O’Sullivan, B. (2007) Modelling performance in tests of spoken language,

Frankfurt, Peter Lang

O'Sullivan, B. and Nakatsuhara, F. (2011) Quantifying conversational styles in

group oral test discourse. In Language testing: theories and practices., ed.

B. O'Sullivan, Basingstoke: Palgrave MacMillan, pp. 164-185.

Taylor, L. and Galaczi, E. (2011) Scoring validity. In Examining speaking:

research and practice in assessing second language speaking., ed. L.

Taylor, Studies in Language Testing 30, Cambridge: UCLES/Cambridge

University Press, pp. 171-233.

References

http://www.ielts.org/researchers/research/volume_12.aspx

Documents

Score comparability of holistic and by-part marking models