Upload
ezekiel-hewitt
View
26
Download
3
Embed Size (px)
DESCRIPTION
Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points. Andrew Rosenberg and Ed Binkowski 5/4/2004. Overview. Corpus Description Kappa Shortcomings Kappa Augmentation Classification of messages Next step: Sharpening method. - PowerPoint PPT Presentation
Citation preview
Augmenting the kappa statistic to determine interannotator
reliability for multiply labeled data points
Andrew Rosenbergand
Ed Binkowski5/4/2004
HLT/NAACL '04 2
Overview
• Corpus Description
• Kappa Shortcomings
• Kappa Augmentation
• Classification of messages
• Next step: Sharpening method
HLT/NAACL '04 3
Corpus Description
• 312 email messages exchanged between the Columbia chapter of the ACM.
• Annotated by 2 annotators with one or two of the following 10 labels– question, answer, broadcast, attachment
transmission, planning, planning scheduling, planning-meeting scheduling, action item, technical discussion, social chat
HLT/NAACL '04 4
Kappa Shortcomings
• Kappa is used to determine interannotator reliability and validate gold standard corpora.
• p(A) - # observed agreements / # data points• p(E) - # expected agreements / # data points• How do you determine agreement with an
optional secondary label?
)(1
)()(
Ep
EpApK
HLT/NAACL '04 5
Kappa Shortcomings (ctd.)
• Ignoring the secondary label isn’t acceptable for two reasons.– It is inconsistent with the annotation guidelines.– It ignores partial agreements.
• {a,ba} - singleton matches secondary• {ab,ca} - primary matches secondary• {ab,cb} - secondary matches secondary• {ab,ba} - secondary matches primary, and vice
versa
• Note: The purpose is not to inflate the kappa value, but to accurately assess the data.
HLT/NAACL '04 6
Kappa Augmentation
• When a labeler employs a secondary label, consider it as a single annotation divided between two categories
• Select a value of p, where 0.5≤p≤1.0, based on how heavily to weight the secondary label– Singleton annotations assigned a score of 1.0– Primary p– Secondary 1-p
HLT/NAACL '04 7
Kappa Augmentation:Counting Agreements
• To calculate p(A), sum agreement scores and divide by number of messages.
• Partial agreements are counted as follows:
• p(E) is calculated using the relative frequencies of label use based on annotation vectors.
Annotator 1{a}
Annotator 2{ba}
a b c d
1 0 0 0
a b c d
1-p p 0 0
Score = 1*(1-p) + 0*p = (1-p)
HLT/NAACL '04 8
Classification of messages
• This augmentation allows us to classify messages based their individual kappa’ values at different values of p. – Class 1: high kappa’ at all values of p.
• Use in ML experiments.
– Class 2: low kappa’ at all values of p.• Discard.
– Class 3: high kappa’ at p = 1.0• Ignore the secondary label.
– Class 4: high kappa’ at p = 0.5• Use to revise annotation manual.
• Note: mathematically kappa’ needn’t be monotonic w.r.t. p, but with 2 annotators it is.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
HLT/NAACL '04 9
Corpus Annotation Analysis:Class Distribution
Constant High (Class 1): 82 0.262821
Constant Low (Class 2): 150 0.480769
Low to High (Class 3): 40 0.128205
High to Low (Class 4): 40 0.128205
Total Messages 312
HLT/NAACL '04 10
Next step: Sharpening method
• How can a gold standard corpus be obtained when an annotation effort yields a low kappa?
• In determining interannotator agreement with kappa, etc., two available pieces of information are overlooked:– Some annotators are “better” than others– Some messages are “easier to label” than others
• By limiting the contribution of known poor annotators and difficult messages, we gain confidence in the final category assignment of each message.
HLT/NAACL '04 11
Sharpening Method (ctd.)
• Ranking Annotators– “Better” annotators have a higher agreement with
the group• Ranking messages
– Messages with high variance over annotations are more consistently annotated
• To improve confidence in annotations:1. Weight annotator contributions, and recompute
message rankings.2. Weight message contributions, and recompute
annotator rankings.3. Repeat until convergence.