View
216
Download
0
Category
Preview:
Citation preview
Real-World Semi-Supervised Learning of POS-Taggers for
Low-Resource Languages
Dan Garrette, Jason Mielens, and Jason Baldridge
Proceedings of ACL 2013
Semi-Supervised Training
HMM with Expectation-Maximization (EM)
Need:
Large raw corpus
Tag dictionary
[Kupiec, 1992][Merialdo, 1994]
Previous Works: Supervised LearningProvide high accuracy for POS tagging (Manning, 2011).
Perform poorly when little supervision is available.
Semi-SupervisedDone by training sequence models such as HMM using the EM algorithm.
Work in this area has still relied on relativelylarge amounts of data.(Kupiec, 1992; Merialdo,1994).
Previous Works: Goldberg et al.(2008)Manually constructed lexicon for Hebrew to
train HMM tagger.Lexicon was developed over a long period of
time by expert lexicographers. Tackstrom et al. (2013)Evaluated use of mixed type and token constraints generated by projecting information from high resource language to low resource languages.
Large parallel corpora required.
Low-Resource Languages
6,900 languages in the world
~30 have non-negligible quantities of data
No million-word corpus for anyendangered language
[Maxwell and Hughes, 2006][Abney and Bird, 2010]
Low-Resource Languages
Kinyarwanda (KIN)Niger-Congo.Morphologically-rich.
Malagasy (MLG)Austronesian.Spoken in Madagascar.
Also, English
Collecting Annotations
• Supervised training is not an option.
•Semi-supervised training:
•Annotate some data by hand in 4 hours,
(in 30-minute intervals) for two tasks.
•Type supervision.
•Token supervision.
Tag Dict Generalization
These annotations are too sparse!
Generalize to the entire vocabulary
Tag Dict Generalization
Haghighi and Klein (2006) do this witha vector space.
We don’t have enough raw data
Das and Petrov (2011) do this witha parallel corpus.
We don’t have a parallel corpus
Tag Dict Generalization
Strategy: Label Propagation
• Connect annotations to raw corpus tokens
• Push tag labels to entire corpus
[Talukdar and Crammer. 2009]
Morphological Transducers• Finite-state transducers are used for morphological analysis.
• FST accepts a word type and producesa set of morphological features.
•Power of FSTs:•Analyze out-of-vocabulary items by looking for known affixes and guessing the stem of the word.
Tag Dict GeneralizationPREV_<b> NEXT_thug
TOK_the_4 TOK_the_1
TYPE_the
PREV_the
TOK_the_9 TOK_thug_5
TYPE_thug
NEXT_walks
TOK_dog_2
TYPE_dog
PRE1_t PRE2_th SUF1_e SUF1_g PRE1_d PRE2_do
Tag Dict GeneralizationType Annotations
_the__DT_____dog_NN____
TYPE_the
PREV_<b>
PRE2_th PRE1_t
TYPE_thug
PREV_the
SUF1_g
TYPE_dog
NEXT_walks
TOK_the_4 TOK_the_1 TOK_thug_5 TOK_dog_2
Tag Dict GeneralizationType Annotations
_the_________dog________
TYDTthe
PREV_<b>
PRE2_th PRE1_t
TYPE_thug
PREV_the
SUF1_g
TYNNog
NEXT_walks
TOK_the_4 TOK_the_1 TOK_thug_5 TOK_dog_2
Tag Dict GeneralizationType Annotations
_the________dog
TYPE_the
PREV_<b>
PRE2_th PRE1_t
TYPE_thug
PREV_the
SUF1_g
TYPE_dog
NEXT_walks
TOK_the_4 TOK_the_1 TOK_thug_5 TOK_dog_2
Token Annotationsthe dog walksDT NN VBZ
Tag Dict GeneralizationType Annotations
_the________dog
TYPE_the
PREV_<b>
PRE2_th PRE1_t
TYPE_thug
PREV_the
SUF1_g
TYPE_dog
NEXT_walks
TODTe_4TOK_the_1 TOK_thug_5
TOKNN_2
Token Annotationsthe dog walks____________
Model Minimization
[Ravi et al., 2010; Garrette and Baldridge, 2012]
• LP graph has a node for each corpus token.• Each node is labelled with distribution over POS tags.•Graph provides a corpus of sentences labelled with noisy tag distributions.
•Greedily seek the minimal set of tagbigrams that describe the raw corpus.•Now use, HMM trained by EM.
Overall Accuracy
KIN usin
g all t
ypes
MLG
using h
alf ty
pes and half
toke
ns
ENG using a
ll typ
es and m
axim
al am
ount of d
ata0.00%
20.00%
40.00%
60.00%
80.00%
100.00%Accuracy
Accuracy
All of these values were achieved using both FST and affix LP features.
Results
Types versus Tokens
Mixing Type and Token Annotations
Morphological Analysis
Annotator Experience
Conclusion•Type Annotations are the most useful input from a linguist.
•We can train effective POS-taggers on low resource languages given only a small amount of unlabeled text and a few hours of annotation by a non-native linguist.
Recommended