Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason...

Real-World Semi-Supervised Learning of POS-Taggers for

Low-Resource Languages

Dan Garrette, Jason Mielens, and Jason Baldridge

Proceedings of ACL 2013

Semi-Supervised Training

HMM with Expectation-Maximization (EM)

Large raw corpus

Tag dictionary

[Kupiec, 1992][Merialdo, 1994]

Previous Works: Supervised LearningProvide high accuracy for POS tagging (Manning, 2011).

Perform poorly when little supervision is available.

Semi-SupervisedDone by training sequence models such as HMM using the EM algorithm.

Work in this area has still relied on relativelylarge amounts of data.(Kupiec, 1992; Merialdo,1994).

Previous Works: Goldberg et al.(2008)Manually constructed lexicon for Hebrew to

train HMM tagger.Lexicon was developed over a long period of

time by expert lexicographers. Tackstrom et al. (2013)Evaluated use of mixed type and token constraints generated by projecting information from high resource language to low resource languages.

Large parallel corpora required.

6,900 languages in the world

~30 have non-negligible quantities of data

No million-word corpus for anyendangered language

[Maxwell and Hughes, 2006][Abney and Bird, 2010]

Kinyarwanda (KIN)Niger-Congo.Morphologically-rich.

Malagasy (MLG)Austronesian.Spoken in Madagascar.

Also, English

Collecting Annotations

• Supervised training is not an option.

•Semi-supervised training:

•Annotate some data by hand in 4 hours,

(in 30-minute intervals) for two tasks.

•Type supervision.

•Token supervision.

Tag Dict Generalization

These annotations are too sparse!

Generalize to the entire vocabulary

Haghighi and Klein (2006) do this witha vector space.

We don’t have enough raw data

Das and Petrov (2011) do this witha parallel corpus.

We don’t have a parallel corpus

Strategy: Label Propagation

• Connect annotations to raw corpus tokens

• Push tag labels to entire corpus

[Talukdar and Crammer. 2009]

Morphological Transducers• Finite-state transducers are used for morphological analysis.

• FST accepts a word type and producesa set of morphological features.

•Power of FSTs:•Analyze out-of-vocabulary items by looking for known affixes and guessing the stem of the word.

Tag Dict GeneralizationPREV_ NEXT_thug

TOK_the_4 TOK_the_1

TYPE_the

PREV_the

TOK_the_9 TOK_thug_5

TYPE_thug

NEXT_walks

TOK_dog_2

TYPE_dog

PRE1_t PRE2_th SUF1_e SUF1_g PRE1_d PRE2_do

Tag Dict GeneralizationType Annotations

_the__DT_____dog_NN____

TYPE_the

PREV_

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYPE_dog

NEXT_walks

TOK_the_4 TOK_the_1 TOK_thug_5 TOK_dog_2

_the_________dog________

TYDTthe

PREV_

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYNNog

NEXT_walks

_the________dog

TYPE_the

PREV_

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYPE_dog

NEXT_walks

Token Annotationsthe dog walksDT NN VBZ

_the________dog

TYPE_the

PREV_

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYPE_dog

NEXT_walks

TODTe_4TOK_the_1 TOK_thug_5

TOKNN_2

Token Annotationsthe dog walks____________

Model Minimization

[Ravi et al., 2010; Garrette and Baldridge, 2012]

• LP graph has a node for each corpus token.• Each node is labelled with distribution over POS tags.•Graph provides a corpus of sentences labelled with noisy tag distributions.

•Greedily seek the minimal set of tagbigrams that describe the raw corpus.•Now use, HMM trained by EM.

Overall Accuracy

KIN usin

g all t

using h

alf ty

pes and half

ENG using a

ll typ

es and m

ount of d

ata0.00%

20.00%

40.00%

60.00%

80.00%

100.00%Accuracy

Accuracy

All of these values were achieved using both FST and affix LP features.

Results

Types versus Tokens

Mixing Type and Token Annotations

Morphological Analysis

Annotator Experience

Conclusion•Type Annotations are the most useful input from a linguist.

•We can train effective POS-taggers on low resource languages given only a small amount of unlabeled text and a few hours of annotation by a non-native linguist.

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason...

Documents

Abstract of “Creating Algorithms for Parsers and Taggers ...cs.brown.edu/research/pubs/theses/phd/2006/genzel.pdf · Abstract of “Creating Algorithms for Parsers and Taggers for

Technical Design Report of the Taggers for the KLOE …P).pdfLNF - 10/14(P) April 13, 2010 Technical Design Report of the Taggers for the KLOE-2 Experiment The KLOE-2 Collaboration

Jason A. Estrella GIS Specialist TPWD-Wildlife Jason ...chapter.ser.org/texas/files/2015/11/Jason-Estrella-TXSER-2015... · Jason A. Estrella GIS Specialist TPWD-Wildlife Jason Hardin

Jason Foscolo LLC | FOOD LAW jason@foodlawfirm.com (631) 903 - 5055 jason@foodlawfirm.com Food Law for Start-ups

Instalacion Taggers Firefox y Explorer

Islam Beltagy, Cuong Chau, Gemma Boleda, Dan Garrette, Katrin Erk, Raymond Mooney The University of Texas at Austin Richard Montague Andrey Markov Montague

Jason Schultz (SBN 212600) jason@efforg 2 Corynne McSherry

Jason vs jason x

PLN Tagging1 Tagging POS Tagging Taggers basados en reglas Taggers estadísticos Otras aproximaciones

Creating Unity Renewal Countdown - SC Department … LPN / RN / APRN Licensure Annette M. Disher, Program Coordinator dishera@llr.sc.gov Edwina Garrett, Examinations garrette@llr.sc.gov

POS Tagging Taggers basados en reglas Taggers estadísticos ...adimen.si.ehu.es/~rigau/teaching/EHU/PLN/Curs2006-2007/Apunts/05... · PLN Tagging 1 Tagging ... Es un algoritmo de

ROYAL LEPAGE SUSSEX - JASON SOPROVICH soprovich. com · SUSSEX - JASON SOPROVICH soprovich. com ROYAL LEPAGE SUSSEX - JASON SOPROVICH Jason Soprovich Royal LePage Sussex - Jason Soprovich

United Nations Environment Programme Tel: (254 20) 7621234 ... · vi UNEP peer reviewers and contributors Meriem Ait Ali Slimane, Cristina Battaglino, Matthew Billot, Garrette Clark,

Abstract of â€œCreating Algorithms for Parsers and Taggers for

Inverse Kinematics Jason Clark (jason@essentialmath.com)

Leif GrönqvistColloquia Linguistica1 Part II: The development of Automated Syntactic Taggers Leif Grönqvist Göteborg University

City Research Online Zhang-CBF_Final_20090923.pdf · also thank Vincent Mangematin, Marc Anderson, Rudi Durand, Bernard Garrette, Xavier Castañer, Mike Wright and the three anonymous

Part of Speech Tagger for Turkish - CmpE WEBgungort... · measurements. This is known as a Hidden Markov Model. [6] Hidden Markov Model taggers and visible Markov Model taggers may

Evaluating the Performance of Automated Part-of-Speech Taggers on an L2 Corpus

jason and the argonautswordman.onlinewebshop.net/cosmicomics/heroesmonsters/argonau… · Title: jason and the argonauts Author: jason and the argonauts Subject: jason and the argonauts