Upload
buikhanh
View
219
Download
0
Embed Size (px)
Citation preview
© Cambridge English Language Assessment 2015
Score Comparability of
Holistic and By-Part
Marking Models
Nahal Khabbazbashi and Evelina Galaczi Cambridge English Language Assessment
Language Testing Forum
Department of Education, University of Oxford
November 2015
© Cambridge English Language Assessment 2015
Marking by part:
Is it really worth the effort?
© Cambridge English Language Assessment 2015
Why mark by part? • A single score for a number of tasks may not provide a
clear reflection of a candidate’s speaking proficiency (O’Sullivan, 2007; O’Sullivan & Nakatsuhara, 2011; Taylor and Galaczi,
2011)
• Effects of task types on candidate conversational styles (O’Sullivan & Nakatsuhara, 2011)
• Examiners’ final decision making may be impacted by
particularly good or bad performance on a given task (O’Sullivan & Nakatsuhara, 2011)
• Noticeable differences in band scores of candidates on
different parts of the IELTS speaking test (Nakatsuhara,
2011)
© Cambridge English Language Assessment 2015
Why not mark by part? • Practical constraints
• Cognitive load on examiners
• Sufficient speech sample for reliable scoring
Part 1
Part 2
Part 3
……..
Part n
Part 1
Part 2
Part 3
……..
Part n
Aggregate Scores Overall Mark =
?
It’s an empirical question!
© Cambridge English Language Assessment 2015
Assessment Context
Part X Read aloud (sound check)
Part 1 Short interview questions (personal topics)
Part 2 Describing and comparing two photographs
Part 3 Responding to a series of questions on a familiar topic
Part 4 Long-turn on an abstract topic (prompts provided)
Cambridge English Placement Test
© Cambridge English Language Assessment 2015
Marking Model Cambridge English Placement Test
Part X Read aloud (sound check)
Part 1 Short interview questions (personal topics)
Part 2 Describing and comparing two photographs
Part 3 Responding to a series of questions on a familiar topic
Part 4 Long-turn on an abstract topic (prompts provided)
Single Holistic
Mark
Coherence & Discourse Management
Language Resource
Pronunciation
Hesitation & Extent
Criteria
© Cambridge English Language Assessment 2015
Research Questions
RQ1: What is the relationship between Holistic and By-
Part marking models in an online speaking test?
RQ2: To what extent are the different parts of the speaking
test statistically distinct?
RQ3: Do raters perform differently when test parts from
different candidates are allocated at random compared to
marking the same candidates’ performances on different
parts?
© Cambridge English Language Assessment 2015
Method Test Set
CEPT Trial Data (N=240)
Live examiners (N=10)
Range of L1s
Range of CEFR Levels
Rating Matrix
Designed to ensure that:
(a) candidates in test set receive at least
one holistic mark and one mark for
each test part;
(b) raters are linked through common
performances to meet requirements
of Many-Facet Rasch Measurement
(MFRM); and
(c) halo effects are minimised.
Rating Process
Raters (N=4)
Two rounds of rating
(counter-balanced design)
Analysis
MFRM using FACETS
(Linacre, 2011)
Separate analyses for different
marking models
(Holistic, By-Part, Random)
© Cambridge English Language Assessment 2015
Rater
Candidates 1-30 + CB By Part Candidates 1-30 + CB Holistic
Candidates 31-60 + CB Holistic Candidates 61-90 + CB By Part
Part 1 (61-70)
Part 2 (71-80)
Part 3 (81-90)
Part 4 (201-210)
Parts 1-4 (CB)
Isolated Parts
Part 1 (1-10)
Part 2 (11-20)
Part 3 (21-30)
Part 4 (221-230)
Isolated Parts
Candidates 61-90 + CB Holistic Candidates 31-60 + CB By Part
Candidates 91-120 + CB By Part Candidates 91-120 + CB Holistic
Part 1 (11-20)
Part 2 (1-10)
Part 3 (31-40)
Part 4 (41-50)
Parts 1-4 (CB)
Isolated Parts
Part 1 (91-100)
Part 2 (201-210)
Part 3 (221-230)
Part 4 (101-110)
Isolated Parts
Round 1 Round 2
Ra
ter 1
Ra
ter 2
Rating Matrix
© Cambridge English Language Assessment 2015
Results (RQ1) Marking models & impact on candidate scores (group level)
Marking Model Holistic (%) By Part
(%)
CE
FR
Lev
el A1 (1) 3.3 3.8
A2 (2) 10.9 15.0
B1 (3) 48.8 52.1
B2 (4) 26.2 22.0
C1 (5) 8.3 5.4
C2 (6) 2.5 1.7
Mean (N=240) 3.78 3.65
SD (N=240) 0.91 0.83
• Similar distribution of CEFR levels & mean average scores
BUT
• By-Part model resulted in a lower percentage of candidates at higher
CEFR levels and a statistically significant lower overall mean (p=0.01)
with a small effect size.
© Cambridge English Language Assessment 2015
Results (RQ1) Marking models & impact on candidate CEFR classifications
(individual level)
• 66.3% of candidates received the exact same CEFR classification
for the Holistic and By-Part marking models
• Majority of disagreements fall within an adjacent upper/lower band
(98%)
• Trend of higher CEFR levels when Holistic marking is used
Marking Model Holistic vs. By Part %
of
Ag
reem
ents
-2 0.0
-1 10.0
Exact (0) 66.3
1 22.4
2 1.3
Results (RQ2)
How statistically distinct are the different test parts?
• Narrow spread of difficulty
• Two statistically distinct difficulty strata
• Moderate separation reliability
Part Measure
(logits)
Fair-M
Average
Model
S.E.
Infit
Mean
Square
Part 1 0.17 3.57 0.08 1.51
Part 2 0.01 3.60 0.08 0.67
Part 3 -0.23 3.65 0.08 0.67
Part 4 0.06 3.59 0.08 1.09
Separation 1.41; Strata 2.21; Reliability 0.67
Model, Fixed (all same) Chi-square: 12.0 d.f.: 3
Significance (probability): .01
© Cambridge English Language Assessment 2015
Raters’ Perspective Preference expressed for marking by part
‘A candidate's performance can vary quite widely in different
parts of the test, so to give an overall mark could be difficult to do
and/or less reliable .To give marks for each criterion in each part
of the test is overcomplicated, and unrealistic to expect examiners
to do while listening to a test. I am confident however in marking
by part, as I feel the marks awarded are an accurate reflection of
candidates' levels.’
BUT
Not enough language elicited in each part
‘Each section has its limits and there will not always be enough
evidence to mark each part’
© Cambridge English Language Assessment 2015
Cand Rater
Marking
Condition P1 P2 P3 P4
31 MG Part 3.5 4 3.5 3.5
31 MG Random 3.5 4 3 3.5
71 ML Part 2 3 3 3
71 ML Random 2 3 3 3
135 MW Part 5 5 5.5 5
135 MW Random 4.5 5 6 6
233 NA Part 3.5 3 3 3
233 NA Random 3.5 3 3 3
RQ3: Do raters perform differently when test parts from
different candidates are allocated at random compared to marking
the same candidates’ performances on different parts?
© Cambridge English Language Assessment 2015
Results (RQ3)
Marking
Condition
Measure
(logits)
Fair-M
Average
Model
S.E.
Infit Mean
Square
Part
(candidate
known) -0.05 3.63 0.04 0.91
Random
(candidate
unknown) 0.05 3.60 0.06 1.37
Separation 0.00; Strata 0.33; Reliability 0.00
Model, Fixed (all same) Chi-square: 1.8 d.f.: 1 Significance
(probability): .18
• No statistically significant difference between the
two conditions
• Higher Infit Mean Square for the Random
condition but within acceptable control limits
© Cambridge English Language Assessment 2015
Summary
• Similar distribution of CEFR levels and mean
average scores for the two marking models
• Statistically significant lower mean scores in the
By-Part marking model with a small effect size
• 66.3% of candidates received the same exact
CEFR classification
• 98% fell within adjacent upper or lower bands
with a trend of higher CEFR levels with Holistic
marking
© Cambridge English Language Assessment 2015
Summary • Different test parts exhibited similar difficulty
measures
• Raters did not perform differently when test
parts were allocated at random
• Rater preference for marking by part in
increasing test validity and reliability
© Cambridge English Language Assessment 2015
So what?
• The choice of marking model does have an
impact on CEFR classifications
• Stronger arguments for marking by part
BUT
• Test purpose, assessment contexts and practical
constraints are key considerations
• Alternative approach? e.g. mark by part but
match specific criteria to task types
© Cambridge English Language Assessment 2015
Thank you!
© Cambridge English Language Assessment 2015
Linacre, J. M. (2011). Facets computer program for many-facet Rasch
measurement. Beaverton, Oregon: Winsteps.
Nakatsuhara, F. (2011) The relationship between test-takers' listening
proficiency and their performance on the IELTS Speaking Test. In IELTS
Research Reports Osborne, J. (ed.), vol. 12, IDP: IELTS Australia and
British Council. Available online at:
www.ielts.org/researchers/research/volume_12.aspx ]
O’Sullivan, B. (2007) Modelling performance in tests of spoken language,
Frankfurt, Peter Lang
O'Sullivan, B. and Nakatsuhara, F. (2011) Quantifying conversational styles in
group oral test discourse. In Language testing: theories and practices., ed.
B. O'Sullivan, Basingstoke: Palgrave MacMillan, pp. 164-185.
Taylor, L. and Galaczi, E. (2011) Scoring validity. In Examining speaking:
research and practice in assessing second language speaking., ed. L.
Taylor, Studies in Language Testing 30, Cambridge: UCLES/Cambridge
University Press, pp. 171-233.
References