Andrew Rosenberg and Ed Binkowski 5/4/2004

Augmenting the kappa statistic to determine interannotator

reliability for multiply labeled data points

Andrew Rosenbergand

Ed Binkowski5/4/2004

HLT/NAACL '04 2

Overview

• Corpus Description

• Kappa Shortcomings

• Kappa Augmentation

• Classification of messages

• Next step: Sharpening method

HLT/NAACL '04 3

Corpus Description

• 312 email messages exchanged between the Columbia chapter of the ACM.

• Annotated by 2 annotators with one or two of the following 10 labels– question, answer, broadcast, attachment

transmission, planning, planning scheduling, planning-meeting scheduling, action item, technical discussion, social chat

HLT/NAACL '04 4

Kappa Shortcomings

• Kappa is used to determine interannotator reliability and validate gold standard corpora.

• p(A) - # observed agreements / # data points• p(E) - # expected agreements / # data points• How do you determine agreement with an

optional secondary label?

)(1

)()(

Ep

EpApK

HLT/NAACL '04 5

Kappa Shortcomings (ctd.)

• Ignoring the secondary label isn’t acceptable for two reasons.– It is inconsistent with the annotation guidelines.– It ignores partial agreements.

• {a,ba} - singleton matches secondary• {ab,ca} - primary matches secondary• {ab,cb} - secondary matches secondary• {ab,ba} - secondary matches primary, and vice

versa

• Note: The purpose is not to inflate the kappa value, but to accurately assess the data.

HLT/NAACL '04 6

Kappa Augmentation

• When a labeler employs a secondary label, consider it as a single annotation divided between two categories

• Select a value of p, where 0.5≤p≤1.0, based on how heavily to weight the secondary label– Singleton annotations assigned a score of 1.0– Primary p– Secondary 1-p

HLT/NAACL '04 7

Kappa Augmentation:Counting Agreements

• To calculate p(A), sum agreement scores and divide by number of messages.

• Partial agreements are counted as follows:

• p(E) is calculated using the relative frequencies of label use based on annotation vectors.

Annotator 1{a}

Annotator 2{ba}

a b c d

1 0 0 0

a b c d

1-p p 0 0

Score = 1*(1-p) + 0*p = (1-p)

HLT/NAACL '04 8

Classification of messages

• This augmentation allows us to classify messages based their individual kappa’ values at different values of p. – Class 1: high kappa’ at all values of p.

• Use in ML experiments.

– Class 2: low kappa’ at all values of p.• Discard.

– Class 3: high kappa’ at p = 1.0• Ignore the secondary label.

– Class 4: high kappa’ at p = 0.5• Use to revise annotation manual.

• Note: mathematically kappa’ needn’t be monotonic w.r.t. p, but with 2 annotators it is.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

HLT/NAACL '04 9

Corpus Annotation Analysis:Class Distribution

Constant High (Class 1): 82 0.262821

Constant Low (Class 2): 150 0.480769

Low to High (Class 3): 40 0.128205

High to Low (Class 4): 40 0.128205

Total Messages 312

HLT/NAACL '04 10

Next step: Sharpening method

• How can a gold standard corpus be obtained when an annotation effort yields a low kappa?

• In determining interannotator agreement with kappa, etc., two available pieces of information are overlooked:– Some annotators are “better” than others– Some messages are “easier to label” than others

• By limiting the contribution of known poor annotators and difficult messages, we gain confidence in the final category assignment of each message.

HLT/NAACL '04 11

Sharpening Method (ctd.)

• Ranking Annotators– “Better” annotators have a higher agreement with

the group• Ranking messages

– Messages with high variance over annotations are more consistently annotated

• To improve confidence in annotations:1. Weight annotator contributions, and recompute

message rankings.2. Weight message contributions, and recompute

annotator rankings.3. Repeat until convergence.

HLT/NAACL '04 12

Thank you.

[email protected]@juno.com

Documents

Andrew Rosenberg and Ed Binkowski 5/4/2004