Showcasing the potential of error-annotated learner corpora for profiling research

Showcasing the potential of error-annotated learner

corpora for profiling research

Jennifer ThewissenCentre for English Corpus Linguistics

(CECL)

1

Profiling research

Definition Finding ‘criterial

features’ that discriminate between different levels of proficiency (e.g. Hawkins & Buttery, 2010)

CEF levels C2 C1 B2 B1 A2 A1

2

Feature we focussed on

Construct of accuracy, viz. errors

Focus on four proficiency levels, viz. B1, B2, C1, C2

Aim = See whether errors constituted a «criterial feature» to distinguish these levels

3

Data & methodology

4

5

International Corpus of Learner English (Granger et al., 2009)

L1 Total scripts Total tokens

FR 74 50060

GE 71 49540

SP 78 51385

Total 223 150985

Threefold analysis

Error annotation, i.e. error tagging phase

CEF rating phase

Error counting phase

6

7

Error annotation

Broad error categories Description

F Form, spelling errors

G Grammatical errors

L Lexical errors

X Lexico-grammatical errors

Q Punctuation errors

W Word missing, word redudant, word order

S Sentence unclear, incomplete

8

Error tagging examples

The fast spread of television can transform it into a double-edged (FS) wheapon

$weapon$.

I will try to give several (XNUC) proofs $proof$ of the truth of the sentence.

46 error subcategories Result: a detailed error profile per text

9

The CEF rating procedure

Individual rating of the 223 learner scripts according to the linguistic descriptors in the Common European Framework of Reference for Languages (CEF) (Council of Europe, 2001)

B1, B2, C1 or C2 (with + and – increments)

2 professional raters (+ 1 rater in cases of wide disagreement) (r = 0.70)

Tracking development

10

CEF scoreError

profile

Development:Progress?

Stabilisation?Regression?

11

Error counting: potential occasion analysis (GNN)

Learner corpussample

Error-tagged data

Total noun-number errors

POS-taggeddata (CLAWS7)

Total nouns used

12

Statistical analyses: ANOVA & Ryan (GNN)

CEF score N Ryan-derived groupings

C2 28 0,32

C1 67 0,70 0,70

B2 62 0,99 0,99

B1 66 1,23

GNN = [B1/B2]>[B2/C1]>[C1/C2]

Results for profiling research

13

14

4 main error developmental patterns

Error developmental patterns

Illustration

Improvement-only pattern B1>B2>C1>C2

Improvement & stabilisation pattern e.g. B1>[B2/C1/C2]

Stabilisation-only pattern [B1/B2/C1/C2]

Partly regressive pattern B2>B1

Two dominating error patterns

Dominating error patterns

Number of error

categories

Examples

B1>[B2/C1/C2] 17 (37%) SpellingUncountable nounsLexical phrasesAdjective number errorsUnclear sentences

[B1/B2/C1/C2] 16 (35%) TensesPunctuation confusionVerb complementationNoun complementation

15

16

Where do progress and stabilisation mainly occur? Discriminating power of errors

Adjacent proficiency levels

Number of discriminating error

types

B1>B2 20

B2>C1 3

C1>C2 2

[B2/C1/C2] 33

Preliminary observations for profiling research

17

Some concluding remarks

Errors (negative features) Stronger discriminatory power

between certain levels (viz. B1 vs. B2) than others (viz. B2 vs. C1 vs. C2)

Need to capture other features than errors (e.g. positive features)

Conclusion for profiling research: errors are useful but they are not enough in and of themselves

18

Documents

Showcasing the potential of error-annotated learner corpora for profiling research