Corpora and Statistical Methods Lecture 9

PowerPoint Presentation

Albert GattCorpora and Statistical MethodsLecture 9POS Tagging overview; HMM taggers, TBL taggingPart 2The taskAssign each word in continuous text a tag indicating its part of speech.Essentially a classification problem.

Current state of the art:taggers typically have 96-97% accuracyfigure evaluated on a per-word basisin a corpus with sentences of average length 20 words, 96% accuracy can mean one tagging error per sentenceSources of difficulty in POS taggingMostly due to ambiguity when words have more than one possible tag.need context to make a good guess about POScontext alone wont suffice

A simple approach which assigns only the most common tag to each word performs with 90% accuracy!

The information sourcesSyntagmatic information: the tags of other words in the context of wNot sufficient on its own. E.g. Greene/Rubin 1977 describe a context-only tagger with only 77% accuracy

Lexical information (dictionary): most common tag(s) for a given worde.g. in English, many nouns can be used as verbs (flour the pan, wax the car)however, their most likely tag remains NNdistribution of a words usages across different POSs is uneven: usually, one highly likely, other much lessTagging in other languages (than English)In English, high reliance on context is a good idea, because of fixed word order

Free word order languages make this assumption harderCompensation: these languages typically have rich morphologyGood source of clues for a tagger

Evaluation and error analysisTraining a statistical POS tagger requires splitting corpus into training and test data.Often, we need a development set as well, to tune parameters.

Using (n-fold) cross-validation is a good idea to save data.randomly divide data into train + testtrain and evaluate on testrepeat n times and take an average

NB: cross-validation requires the whole corpus to be blind.To examine the training data, best to have fixed training & test sets, perform cross-validation on training data, and final evaluation on test set.EvaluationTypically carried out against a gold standard based on accuracy (% correct).

Ideal to compare accuracy of our tagger with:baseline (lower-bound):standard is to choose the unigram most likely tagceiling (upper bound): e.g. see how well humans do at the same taskhumans apparently agree on 96-7% tagsmeans it is highly suspect for a tagger to get 100% accuracyHMM taggersUsing Markov modelsBasic idea: sequences of tags are a Markov Chain:Limited horizon assumption: sufficient to look at previous tag for information about current tagTime invariance: The probability of a sequence remains the same over timeImplications/limitationsLimited horizon ignores long-distance dependencese.g. cant deal with WH-constructionsChomsky (1957): this was one of the reasons cited against probabilistic approaches

Time invariance:e.g. P(finite verb|pronoun) is constantbut we may be more likely to find a finite verb following a pronoun at the start of a sentence than in the middle!

NotationWe let ti range over tagsLet wi range over wordsSubscripts denote position in a sequence

Use superscripts to denote word types:wj = an instance of word type j in the lexicontj = tag t assigned to word wj

Limited horizon property becomes:

Basic strategyTraining set of manually tagged textextract probabilities of tag sequences:

e.g. using Brown Corpus, P(NN|JJ) = 0.45, but P(VBP|JJ) = 0.0005

Next step: estimate the word/tag probabilities:

These are basically symbol emission probabilitiesTraining the tagger: basic algorithmEstimate probability of all possible sequences of 2 tags in the tagset from training data

For each tag tj and for each word wl estimate P(wl| tj).

Apply smoothing.Finding the best tag sequenceGiven: a sentence of n wordsFind: t1,n = the best n tags

Application of Bayes ruledenominator can be eliminated as its the same for all tag sequences.

Finding the best tag sequenceThe expression needs to be reduced to parameters that can be estimated from the training corpus

need to make some simplifying assumptionswords are independent of eachothera words identity depends only on its tag

The independence assumptionProbability of a sequence of words given a sequence of tags is computed as a function of each word independently

The identity assumption

Probability of a word given a tag sequence = probability a word given its own tagApplying these assumptions

Tagging with the Markov ModelCan use the Viterbi Algorithm to find the best sequence of tags given a sequence of words (sentence)Reminder:probability of being in state (tag) j at word i on the best path

most probable state (tag) at word i given that were in state j at word i+1

The algorithm: initialisation

Assume that P(PERIOD) = 1 at end of sentence Set all other tag probs to 0Algorithm: induction stepfor i = 1 to n step 1:for all tags tj do:

Probability of tag tj at i+1 on best path through i Most probable tag leading to tj at i+1 Algorithm: backtrace

State at n+1

for j = n to 1 do:retrieve the most probable tags for every point in sequenceCalculate probability for the sequence of tags selectedSome observationsThe model is a Hidden Markov Modelwe only observe words when we tag

In actuality, during training we have a visible Markov Modelbecause the training corpus provides words + tags True HMM taggersApplied to cases where we do not have a large training corpus

We maintain the usual MM assumptions

Initialisation: use dictionary:set emission probability for a word/tag to 0 if its not in dictionary

Training: apply to data, use forward-backward algorithm

Tagging: exactly as before

Documents

Corpora and Statistical Methods Lecture 9