Click here to load reader
Upload
gauri
View
45
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Corpora and Statistical Methods Lecture 9. Albert Gatt. Part 2. POS Tagging overview; HMM taggers, TBL tagging. The task. Assign each word in continuous text a tag indicating its part of speech. Essentially a classification problem. Current state of the art: - PowerPoint PPT Presentation
Citation preview
PowerPoint Presentation
Albert GattCorpora and Statistical MethodsLecture 9POS Tagging overview; HMM taggers, TBL taggingPart 2The taskAssign each word in continuous text a tag indicating its part of speech.Essentially a classification problem.
Current state of the art:taggers typically have 96-97% accuracyfigure evaluated on a per-word basisin a corpus with sentences of average length 20 words, 96% accuracy can mean one tagging error per sentenceSources of difficulty in POS taggingMostly due to ambiguity when words have more than one possible tag.need context to make a good guess about POScontext alone wont suffice
A simple approach which assigns only the most common tag to each word performs with 90% accuracy!
The information sourcesSyntagmatic information: the tags of other words in the context of wNot sufficient on its own. E.g. Greene/Rubin 1977 describe a context-only tagger with only 77% accuracy
Lexical information (dictionary): most common tag(s) for a given worde.g. in English, many nouns can be used as verbs (flour the pan, wax the car)however, their most likely tag remains NNdistribution of a words usages across different POSs is uneven: usually, one highly likely, other much lessTagging in other languages (than English)In English, high reliance on context is a good idea, because of fixed word order
Free word order languages make this assumption harderCompensation: these languages typically have rich morphologyGood source of clues for a tagger
Evaluation and error analysisTraining a statistical POS tagger requires splitting corpus into training and test data.Often, we need a development set as well, to tune parameters.
Using (n-fold) cross-validation is a good idea to save data.randomly divide data into train + testtrain and evaluate on testrepeat n times and take an average
NB: cross-validation requires the whole corpus to be blind.To examine the training data, best to have fixed training & test sets, perform cross-validation on training data, and final evaluation on test set.EvaluationTypically carried out against a gold standard based on accuracy (% correct).
Ideal to compare accuracy of our tagger with:baseline (lower-bound):standard is to choose the unigram most likely tagceiling (upper bound): e.g. see how well humans do at the same taskhumans apparently agree on 96-7% tagsmeans it is highly suspect for a tagger to get 100% accuracyHMM taggersUsing Markov modelsBasic idea: sequences of tags are a Markov Chain:Limited horizon assumption: sufficient to look at previous tag for information about current tagTime invariance: The probability of a sequence remains the same over timeImplications/limitationsLimited horizon ignores long-distance dependencese.g. cant deal with WH-constructionsChomsky (1957): this was one of the reasons cited against probabilistic approaches
Time invariance:e.g. P(finite verb|pronoun) is constantbut we may be more likely to find a finite verb following a pronoun at the start of a sentence than in the middle!
NotationWe let ti range over tagsLet wi range over wordsSubscripts denote position in a sequence
Use superscripts to denote word types:wj = an instance of word type j in the lexicontj = tag t assigned to word wj
Limited horizon property becomes:
Basic strategyTraining set of manually tagged textextract probabilities of tag sequences:
e.g. using Brown Corpus, P(NN|JJ) = 0.45, but P(VBP|JJ) = 0.0005
Next step: estimate the word/tag probabilities:
These are basically symbol emission probabilitiesTraining the tagger: basic algorithmEstimate probability of all possible sequences of 2 tags in the tagset from training data
For each tag tj and for each word wl estimate P(wl| tj).
Apply smoothing.Finding the best tag sequenceGiven: a sentence of n wordsFind: t1,n = the best n tags
Application of Bayes ruledenominator can be eliminated as its the same for all tag sequences.
Finding the best tag sequenceThe expression needs to be reduced to parameters that can be estimated from the training corpus
need to make some simplifying assumptionswords are independent of eachothera words identity depends only on its tag
The independence assumptionProbability of a sequence of words given a sequence of tags is computed as a function of each word independently
The identity assumption
Probability of a word given a tag sequence = probability a word given its own tagApplying these assumptions
Tagging with the Markov ModelCan use the Viterbi Algorithm to find the best sequence of tags given a sequence of words (sentence)Reminder:probability of being in state (tag) j at word i on the best path
most probable state (tag) at word i given that were in state j at word i+1
The algorithm: initialisation
Assume that P(PERIOD) = 1 at end of sentence Set all other tag probs to 0Algorithm: induction stepfor i = 1 to n step 1:for all tags tj do:
Probability of tag tj at i+1 on best path through i Most probable tag leading to tj at i+1 Algorithm: backtrace
State at n+1
for j = n to 1 do:retrieve the most probable tags for every point in sequenceCalculate probability for the sequence of tags selectedSome observationsThe model is a Hidden Markov Modelwe only observe words when we tag
In actuality, during training we have a visible Markov Modelbecause the training corpus provides words + tags True HMM taggersApplied to cases where we do not have a large training corpus
We maintain the usual MM assumptions
Initialisation: use dictionary:set emission probability for a word/tag to 0 if its not in dictionary
Training: apply to data, use forward-backward algorithm
Tagging: exactly as before