N-grams - Indiana University Bloomingtoncl.indiana.edu/~md7/13/545/slides/05-ngrams/05-ngrams.pdf · the probabilities of English overall I What corpus you use depends on your purpose

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

N-grams

L545

Dept. of Linguistics, Indiana UniversitySpring 2013

1 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Morphosyntax

We just finished talking about morphology (cf. words)I And pretty soon we’re going to discuss syntax (cf.

sentences)

In between, we’ll handle words in contextI Today: n-gram language modelingI Next time: POS tagging

Both of these topics are covered in more detail in L645

2 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

N-grams: Motivation

An n-gram is a stretch of text n words long

I Approximation of language: n-grams tells us somethingabout language, but doesn’t capture structure

I Efficient: finding and using every, e.g., two-wordcollocation in a text is quick and easy to do

N-grams can help in a variety of NLP applications:

I Word prediction = n-grams can be used to aid inpredicting the next word of an utterance, based on theprevious n − 1 words

I Context-sensitive spelling correctionI Machine Translation post-editingI ...

3 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Corpus-based NLP

Corpus (pl. corpora) = a computer-readable collection oftext and/or speech, often with annotations

I Use corpora to gather probabilities & other informationabout language use

I A corpus used to gather prior information = trainingdata

I Testing data = the data one uses to test the accuracyof a method

I A “word” may refer to:I type = distinct word (e.g., like)I token = distinct occurrence of a word (e.g., the type like

might have 20,000 tokens in a corpus)

4 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Simple n-grams

Let’s assume we want to predict the next word, based on theprevious context of I dreamed I saw the knights in

I What we want to find is the likelihood of w8 being thenext word, given that we’ve seen w1, ...,w7

I This is P(w8|w1, ...,w7)I But, to start with, we examine P(w1, ...,w8)

In general, for wn, we are concerned with:

(1) P(w1, ...,wn) = P(w1)P(w2|w1)...P(wn |w1, ...,wn−1)

These probabilities are impractical to calculate, as theyhardly ever occur in a corpus, if at allI Too much data to store, if we could calculate them.

5 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Unigrams

Approximate these probabilities to n-grams, for a given nI Unigrams (n = 1):

(2) P(wn |w1, ...,wn−1) ≈ P(wn)

I Easy to calculate, but lack contextual information

(3) The quick brown fox jumped

I We would like to say that over has a higher probability inthis context than lazy does

6 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Bigrams

bigrams (n = 2) give context & are still easy to calculate:

(4) P(wn |w1, ...,wn−1) ≈ P(wn |wn−1)

(5) P(over |The, quick , brown, fox, jumped) ≈P(over |jumped)

The probability of a sentence:

(6) P(w1, ...,wn) = P(w1)P(w2|w1)P(w3|w2)...P(wn |wn−1)

7 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Markov models

A bigram model is also called a first-order Markov model

I first-order because it has one element of memory (onetoken in the past)

I Markov models are essentially weighted FSAs—i.e.,the arcs between states have probabilities

I The states in the FSA are words

More on Markov models when we hit POS tagging ...

8 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Bigram example

What is the probability of seeing the sentence The quickbrown fox jumped over the lazy dog?

(7) P(The quick brown fox jumped over the lazy dog) =P(The|START)P(quick |The)P(brown|quick)...P(dog|lazy)

I Probabilities are generally small, so log probabilities areusually used

Q: Does this favor shorter sentences?I A: Yes, but it also depends upon P(END|lastword)

9 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Trigrams

Trigrams (n = 3) encode more context

I Wider context: P(know |did, he) vs. P(know |he)I Generally, trigrams are still short enough that we will

have enough data to gather accurate probabilities

10 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Training n-gram models

Go through corpus and calculate relative frequencies:

(8) P(wn |wn−1) =C(wn−1,wn)

C(wn−1)

(9) P(wn |wn−2,wn−1) =C(wn−2,wn−1,wn)

C(wn−2,wn−1)

This technique of gathering probabilities from a trainingcorpus is called maximum likelihood estimation (MLE)

11 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Know your corpus

We mentioned earlier about splitting training & testing data

I It’s important to remember what your training data iswhen applying your technology to new data

I If you train your trigram model on Shakespeare, thenyou have learned the probabilities in Shakespeare, notthe probabilities of English overall

I What corpus you use depends on your purpose

12 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Smoothing: Motivation

Assume: a bigram model has been trained on a good corpus(i.e., learned MLE bigram probabilities)

I It won’t have seen every possible bigram:I lickety split is a possible English bigram, but it may not

be in the corpus

I Problem = data sparsity→ zero probability bigramsthat are actual possible bigrams in the language

Smoothing techniques account for thisI Adjust probabilities to account for unseen dataI Make zero probabilities non-zero

13 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Add-One Smoothing

One way to smooth is to add a count of one to every bigram:

I In order to still be a probability, all probabilities need tosum to one

I Thus: add number of word types to the denominatorI We added one to every type of bigram, so we need to

account for all our numerator additions

(10) P∗(wn |wn−1) =C(wn−1,wn)+1

C(wn−1)+V

V = total number of word types in the lexicon

14 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Smoothing example

So, if treasure trove never occurred in the data, but treasureoccurred twice, we have:

(11) P∗(trove|treasure) = 0+12+V

The probability won’t be very high, but it will be better than 0

I If the surrounding probabilities are high, treasure trovecould be the best pick

I If the probability were zero, there would be no chance ofappearing

15 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Discounting

An alternate way of viewing smoothing is as discounting

I Lowering non-zero counts to get the probability masswe need for the zero count items

I The discounting factor can be defined as the ratio of thesmoothed count to the MLE count

⇒ Jurafsky and Martin show that add-one smoothing candiscount probabilities by a factor of 10!I Too much of the probability mass is now in the zeros

We will examine one way of handling this; more in L645

16 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Witten-Bell Discounting

Idea: Use the counts of words you have seen once toestimate those you have never seen

I Instead of simply adding one to every n-gram, computethe probability of wi−1,wi by seeing how likely wi−1 is atstarting any bigram.

I Words that begin lots of bigrams lead to higher “unseenbigram” probabilities

I Non-zero bigrams are discounted in essentially thesame manner as zero count bigrams

→ Jurafsky and Martin show that they are only discountedby about a factor of one

17 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Witten-Bell Discounting formula

(12) zero count bigrams:p∗(wi |wi−1) =

T(wi−1)Z(wi−1)(N(wi−1)+T(wi−1))

I T(wi−1) = number of bigram types starting with wi−1

→ determines how high the value will be (numerator)

I N(wi−1) = no. of bigram tokens starting with wi−1

→ N(wi−1) + T(wi−1) gives total number of “events” todivide by

I Z(wi−1) = number of bigram tokens starting with wi−1

and having zero count

→ this distributes the probability mass between all zerocount bigrams starting with wi−1

18 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Kneser-Ney Smoothing (asolute discounting)

Witten-Bell Discounting is based on using relativediscounting factorsI Kneser-Ney simplifies this by using absolute

discounting factorsI Instead of multiplying by a ratio, simply subtract some

discounting factor

19 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Class-based N-grams

Intuition: we may not have seen a word before, but we mayhave seen a word like itI Never observed Shanghai, but have seen other citiesI Can use a type of hard clustering, where each word is

only assigned to one class (IBM clustering)

(13) P(wi |wi−1) ≈ P(ci |ci−1) × P(wi |ci)

POS tagging equations will look fairly similar to this ...

20 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Backoff models: Basic idea

Assume a trigram model for predicting language, where wehaven’t seen a particular trigram before

I Maybe we’ve seen the bigram or the unigramI Backoff models allow one to try the most informative

n-gram first and then back off to lower n-grams

21 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Backoff equations

Roughly speaking, this is how a backoff model works:

I If this trigram has a non-zero count, use that:

(14) P̂(wi |wi−2wi−1) = P(wi |wi−2wi−1)

I Else, if the bigram count is non-zero, use that:

(15) P̂(wi |wi−2wi−1) = α1P(wi |wi−1)

I In all other cases, use the unigram information:

(16) P̂(wi |wi−2wi−1) = α2P(wi)

22 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

Backoff models: example

Assume: never seen the trigram maples want more before

I If we have seen want more, we use that bigram tocalculate a probability estimate (P(more|want))

I But we’re now assigning probability toP(more|maples,want) which was zero before

I We won’t have a true probability model anymoreI This is why α1 was used in the previous equations, to

assign less re-weight to the probability.

In general, backoff models are combined with discountingmodels

23 / 24

N-grams

Motivation

Simple n-grams

Smoothing

Backoff

A note on information theory

Some very useful notions for n-gram work can be found ininformation theory. Basic ideas:

I entropy = a measure of how much information isneeded to encode something

I perplexity = a measure of the amount of surprise of anoutcome

I mutual information = the amount of information oneitem has about another item (e.g., collocations havehigh mutual information)

Take L645 to find out more ...

24 / 24

Documents

N-grams - Indiana University Bloomingtoncl.indiana.edu/~md7/13/545/slides/05-ngrams/05-ngrams.pdf · the probabilities of English overall I What corpus you use depends on your purpose