Upload
phamque
View
229
Download
4
Embed Size (px)
Citation preview
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
N-grams
L545
Dept. of Linguistics, Indiana UniversitySpring 2013
1 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Morphosyntax
We just finished talking about morphology (cf. words)I And pretty soon we’re going to discuss syntax (cf.
sentences)
In between, we’ll handle words in contextI Today: n-gram language modelingI Next time: POS tagging
Both of these topics are covered in more detail in L645
2 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
N-grams: Motivation
An n-gram is a stretch of text n words long
I Approximation of language: n-grams tells us somethingabout language, but doesn’t capture structure
I Efficient: finding and using every, e.g., two-wordcollocation in a text is quick and easy to do
N-grams can help in a variety of NLP applications:
I Word prediction = n-grams can be used to aid inpredicting the next word of an utterance, based on theprevious n − 1 words
I Context-sensitive spelling correctionI Machine Translation post-editingI ...
3 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Corpus-based NLP
Corpus (pl. corpora) = a computer-readable collection oftext and/or speech, often with annotations
I Use corpora to gather probabilities & other informationabout language use
I A corpus used to gather prior information = trainingdata
I Testing data = the data one uses to test the accuracyof a method
I A “word” may refer to:I type = distinct word (e.g., like)I token = distinct occurrence of a word (e.g., the type like
might have 20,000 tokens in a corpus)
4 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Simple n-grams
Let’s assume we want to predict the next word, based on theprevious context of I dreamed I saw the knights in
I What we want to find is the likelihood of w8 being thenext word, given that we’ve seen w1, ...,w7
I This is P(w8|w1, ...,w7)I But, to start with, we examine P(w1, ...,w8)
In general, for wn, we are concerned with:
(1) P(w1, ...,wn) = P(w1)P(w2|w1)...P(wn |w1, ...,wn−1)
These probabilities are impractical to calculate, as theyhardly ever occur in a corpus, if at allI Too much data to store, if we could calculate them.
5 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Unigrams
Approximate these probabilities to n-grams, for a given nI Unigrams (n = 1):
(2) P(wn |w1, ...,wn−1) ≈ P(wn)
I Easy to calculate, but lack contextual information
(3) The quick brown fox jumped
I We would like to say that over has a higher probability inthis context than lazy does
6 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Bigrams
bigrams (n = 2) give context & are still easy to calculate:
(4) P(wn |w1, ...,wn−1) ≈ P(wn |wn−1)
(5) P(over |The, quick , brown, fox, jumped) ≈P(over |jumped)
The probability of a sentence:
(6) P(w1, ...,wn) = P(w1)P(w2|w1)P(w3|w2)...P(wn |wn−1)
7 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Markov models
A bigram model is also called a first-order Markov model
I first-order because it has one element of memory (onetoken in the past)
I Markov models are essentially weighted FSAs—i.e.,the arcs between states have probabilities
I The states in the FSA are words
More on Markov models when we hit POS tagging ...
8 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Bigram example
What is the probability of seeing the sentence The quickbrown fox jumped over the lazy dog?
(7) P(The quick brown fox jumped over the lazy dog) =P(The|START)P(quick |The)P(brown|quick)...P(dog|lazy)
I Probabilities are generally small, so log probabilities areusually used
Q: Does this favor shorter sentences?I A: Yes, but it also depends upon P(END|lastword)
9 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Trigrams
Trigrams (n = 3) encode more context
I Wider context: P(know |did, he) vs. P(know |he)I Generally, trigrams are still short enough that we will
have enough data to gather accurate probabilities
10 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Training n-gram models
Go through corpus and calculate relative frequencies:
(8) P(wn |wn−1) =C(wn−1,wn)
C(wn−1)
(9) P(wn |wn−2,wn−1) =C(wn−2,wn−1,wn)
C(wn−2,wn−1)
This technique of gathering probabilities from a trainingcorpus is called maximum likelihood estimation (MLE)
11 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Know your corpus
We mentioned earlier about splitting training & testing data
I It’s important to remember what your training data iswhen applying your technology to new data
I If you train your trigram model on Shakespeare, thenyou have learned the probabilities in Shakespeare, notthe probabilities of English overall
I What corpus you use depends on your purpose
12 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Smoothing: Motivation
Assume: a bigram model has been trained on a good corpus(i.e., learned MLE bigram probabilities)
I It won’t have seen every possible bigram:I lickety split is a possible English bigram, but it may not
be in the corpus
I Problem = data sparsity→ zero probability bigramsthat are actual possible bigrams in the language
Smoothing techniques account for thisI Adjust probabilities to account for unseen dataI Make zero probabilities non-zero
13 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Add-One Smoothing
One way to smooth is to add a count of one to every bigram:
I In order to still be a probability, all probabilities need tosum to one
I Thus: add number of word types to the denominatorI We added one to every type of bigram, so we need to
account for all our numerator additions
(10) P∗(wn |wn−1) =C(wn−1,wn)+1
C(wn−1)+V
V = total number of word types in the lexicon
14 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Smoothing example
So, if treasure trove never occurred in the data, but treasureoccurred twice, we have:
(11) P∗(trove|treasure) = 0+12+V
The probability won’t be very high, but it will be better than 0
I If the surrounding probabilities are high, treasure trovecould be the best pick
I If the probability were zero, there would be no chance ofappearing
15 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Discounting
An alternate way of viewing smoothing is as discounting
I Lowering non-zero counts to get the probability masswe need for the zero count items
I The discounting factor can be defined as the ratio of thesmoothed count to the MLE count
⇒ Jurafsky and Martin show that add-one smoothing candiscount probabilities by a factor of 10!I Too much of the probability mass is now in the zeros
We will examine one way of handling this; more in L645
16 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Witten-Bell Discounting
Idea: Use the counts of words you have seen once toestimate those you have never seen
I Instead of simply adding one to every n-gram, computethe probability of wi−1,wi by seeing how likely wi−1 is atstarting any bigram.
I Words that begin lots of bigrams lead to higher “unseenbigram” probabilities
I Non-zero bigrams are discounted in essentially thesame manner as zero count bigrams
→ Jurafsky and Martin show that they are only discountedby about a factor of one
17 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Witten-Bell Discounting formula
(12) zero count bigrams:p∗(wi |wi−1) =
T(wi−1)Z(wi−1)(N(wi−1)+T(wi−1))
I T(wi−1) = number of bigram types starting with wi−1
→ determines how high the value will be (numerator)
I N(wi−1) = no. of bigram tokens starting with wi−1
→ N(wi−1) + T(wi−1) gives total number of “events” todivide by
I Z(wi−1) = number of bigram tokens starting with wi−1
and having zero count
→ this distributes the probability mass between all zerocount bigrams starting with wi−1
18 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Kneser-Ney Smoothing (asolute discounting)
Witten-Bell Discounting is based on using relativediscounting factorsI Kneser-Ney simplifies this by using absolute
discounting factorsI Instead of multiplying by a ratio, simply subtract some
discounting factor
19 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Class-based N-grams
Intuition: we may not have seen a word before, but we mayhave seen a word like itI Never observed Shanghai, but have seen other citiesI Can use a type of hard clustering, where each word is
only assigned to one class (IBM clustering)
(13) P(wi |wi−1) ≈ P(ci |ci−1) × P(wi |ci)
POS tagging equations will look fairly similar to this ...
20 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Backoff models: Basic idea
Assume a trigram model for predicting language, where wehaven’t seen a particular trigram before
I Maybe we’ve seen the bigram or the unigramI Backoff models allow one to try the most informative
n-gram first and then back off to lower n-grams
21 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Backoff equations
Roughly speaking, this is how a backoff model works:
I If this trigram has a non-zero count, use that:
(14) P̂(wi |wi−2wi−1) = P(wi |wi−2wi−1)
I Else, if the bigram count is non-zero, use that:
(15) P̂(wi |wi−2wi−1) = α1P(wi |wi−1)
I In all other cases, use the unigram information:
(16) P̂(wi |wi−2wi−1) = α2P(wi)
22 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
Backoff models: example
Assume: never seen the trigram maples want more before
I If we have seen want more, we use that bigram tocalculate a probability estimate (P(more|want))
I But we’re now assigning probability toP(more|maples,want) which was zero before
I We won’t have a true probability model anymoreI This is why α1 was used in the previous equations, to
assign less re-weight to the probability.
In general, backoff models are combined with discountingmodels
23 / 24
N-grams
Motivation
Simple n-grams
Smoothing
Backoff
A note on information theory
Some very useful notions for n-gram work can be found ininformation theory. Basic ideas:
I entropy = a measure of how much information isneeded to encode something
I perplexity = a measure of the amount of surprise of anoutcome
I mutual information = the amount of information oneitem has about another item (e.g., collocations havehigh mutual information)
Take L645 to find out more ...
24 / 24