56
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Topics in Computational Linguistics Week 5: ngrams and language model Shu-Kai Hsieh Lab of Ontologies, Language Processing and e-Humanities GIL, National Taiwan University March 28, 2014 Topics in Computational Linguistics Shu-Kai Hsieh

Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Topics in Computational LinguisticsWeek 5: ngrams and language model

Shu-Kai Hsieh

Lab of Ontologies, Language Processing and e-HumanitiesGIL, National Taiwan University

March 28, 2014

Topics in Computational Linguistics Shu-Kai Hsieh

Page 2: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

..1 N-grams modelEvaluationSmoothing Techniques

..2 Web-scaled N-grams

..3 Related Topics

..4 The Entropy of Natural Languages

..5 Lab

Topics in Computational Linguistics Shu-Kai Hsieh

Page 3: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Language models

• Statistical/probabilistic language models aim to compute• either the prob. of a sentence or sequence of words,

P(S) = P(w1,w2,w3, ...wn), or• the prob. of the upcoming word

P(wn|w1,w2,w3, ...wn−1)

(which will turn out to be closely related to computing theprobability of a sequence of words.)

• N-gram model is one of the most important tools in speechand language processing.

• Varied applications: spelling checker, MT, Speech Recognition,QA, etc.

Topics in Computational Linguistics Shu-Kai Hsieh

Page 4: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

..1 N-grams modelEvaluationSmoothing Techniques

..2 Web-scaled N-grams

..3 Related Topics

..4 The Entropy of Natural Languages

..5 LabTopics in Computational Linguistics Shu-Kai Hsieh

Page 5: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Simple n-gram model

• Let’s start with calculating the P(S), say,P(S) = P(學, 語言, 很, 有趣)

Topics in Computational Linguistics Shu-Kai Hsieh

Page 6: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Review of Joint and Conditional Probability

• Recall that the conditional prob. of X given Y, P(X|Y), isdefined in terms of the prob. of Y,P(Y), and the joint prob.of X and Y, P(X,Y):

P(X|Y) = P(X,Y)P(Y)

Topics in Computational Linguistics Shu-Kai Hsieh

Page 7: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Review of Chain Rule of Probability

Conversely, the joint prob. P(X,Y) can be expressed in terms ofthe conditional prob. P(X|Y).

P(X,Y) = P(X|Y)P(Y)

which leads to the chain rule

P(X1,X2,X3, · · · ,Xn)

= P(X1)P(X2|X1)P(X3|X1,X2) · · ·P(Xn|X1, · · · ,Xn−1)

= P(X1)∏n

i=2 P(Xi|X1, · · · ,Xi−1)

Topics in Computational Linguistics Shu-Kai Hsieh

Page 8: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

The Chain Rule applied to calculate joint probability ofwords in sentence

chain rule of probability

P(S) = P(wn1) = P(w1)P(w2|w1)P(w3|w2

1)...P(wn|wn−11 )

=∏n

k=1 P(wk|wk−11 )

= P(學) * P(語言|學) * P(很|學語言) * P(有趣|學語言很)

Topics in Computational Linguistics Shu-Kai Hsieh

Page 9: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

How to Estimate these Probabilities?

• Maximum Likelihood Estimation (MLE): by dividing simplycounting in a corpus and normalize them so that they liebetween 0 and 1. (There are of course more sophisticatedalgorithms) 1

count and divideP(嗎|學語言很有趣) = Count(學語言很有趣嗎) /Count(學語言很有趣)

1MLE sometimes called relative frequencyTopics in Computational Linguistics Shu-Kai Hsieh

Page 10: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Markov Assumption: Don’t look too far into the past

Simplified idea: instead of computing the prob. of a word given itsentire history, we can approximate the history by just the last fewwords.

P(嗎|學語言很有趣) ≈ P( 嗎|有趣) OR,P(嗎|學語言很有趣) ≈ P( 嗎|很有趣 )

Topics in Computational Linguistics Shu-Kai Hsieh

Page 11: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

In other words

• Bi-gram model: approximates the prob. of a word give all theprevious P(wn|wn−1

1 ) by using only the conditional prob. ofthe preceding words P(wn|wn−1). Thus generalized asP(wn|wn−1

1 ) ≈ P(wn|wn−1n−N+1)

• Tri-gram: (your turn)• We can extend to trigrams, 4-grams, 5-grams, knowing that

in general this is an insufficient model of language (becauselanguage has long-distance dependencies). 我 在一個非

常奇特的機緣巧合 之下 學 梵文

Topics in Computational Linguistics Shu-Kai Hsieh

Page 12: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

In other words

• So given the bi-gram assumption for the prob. of an individualword, we can compute the prob. of the entire sentence as

P(S) = P(wn1) ≈

n∏k=1

P(wk|wk−1)

• recall MLE on JM book equation (4.13)-(4.14)

Topics in Computational Linguistics Shu-Kai Hsieh

Page 13: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Example: Language Modeling of Alice.txt

Topics in Computational Linguistics Shu-Kai Hsieh

Page 14: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Topics in Computational Linguistics Shu-Kai Hsieh

Page 15: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Exercise

• Walk through the example of Berkeley Restaurant Projectsentences (PP90-91)

BTW, we used to do everything in log space to avoid underflow(also adding is faster than multiplying)

log(p1 ∗ p2 ∗ p3) = logp1 + logp2 + logp3

Topics in Computational Linguistics Shu-Kai Hsieh

Page 16: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Google n-gram and Google Suggestion

Topics in Computational Linguistics Shu-Kai Hsieh

Page 17: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Generating the Wall Street Journal vs GeneratingShakespeare

Topics in Computational Linguistics Shu-Kai Hsieh

Page 18: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Generating the Wall Street Journal vs GeneratingShakespeare

Topics in Computational Linguistics Shu-Kai Hsieh

Page 19: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

• Quadrigrams looks like Shakespeare because it is Shakespeare.• N-gram model is very sensitive to the training corpus!

Overfitting issue

• N-grams only work well for word prediction if the test corpuslooks like the training corpus, but in real life, it often doesn’t.

• We need to train a more robust model that generalize, e.g.Zeros issue, i.e., Things that don’t ever occur in the trainingset but occur in the test set.

Topics in Computational Linguistics Shu-Kai Hsieh

Page 20: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Evaluation

..1 N-grams modelEvaluationSmoothing Techniques

..2 Web-scaled N-grams

..3 Related Topics

..4 The Entropy of Natural Languages

..5 LabTopics in Computational Linguistics Shu-Kai Hsieh

Page 21: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Evaluation

Evaluating n-gram models

How good is our model? How to make it better(robust)?

• N-gram language models are evaluated by separating thecorpus into a training set and a test set, training the model onthe training set, and evaluating on the test set. An evaluationmetric tells us how well our model does on the test set.

• Extrinsic (in vivo) evaluation• intrinsic evaluation: perplexity (2H of of the language model

on a test set is used to compare language models.)

Topics in Computational Linguistics Shu-Kai Hsieh

Page 22: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Evaluation

Evaluation the N-gram ModelBut the model relies heavily on the corpus the models were trainedon, and thus often results in overfitting!Example

• Given a vocabulary of 20,000 types, the potential number ofbigrams is 20, 0002 = 400, 000, 000, and with tri-grams, itamounts to the astronomic figure of 20, 0003. No corpus yethas the size to cover the corresponding word combinations.

• MLE gives no hint on how to estimate their prob.• Here we use smoothing (or discounting) techniques to

estimate prob. of unseen ngrams, presumably because adistribution without zeros is smoother than one with zeros.

Topics in Computational Linguistics Shu-Kai Hsieh

Page 23: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Evaluation

Perplexity

• The best language model is one that best predicts an unseentest set (i.e., Gives the highest P(sentence)).

• Perplexity is defined as the inverse probability of the test set,normalized by the number of words.

Topics in Computational Linguistics Shu-Kai Hsieh

Page 24: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing Techniques

The intuition of smoothing (from Dan Klein)

Topics in Computational Linguistics Shu-Kai Hsieh

Page 25: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing Techniques

Smoothing Techniques

Smoothing n-gram probabilities

• sparse data: the corpus is not big enough to have all thebigrams covered with a realistic estimate.

• Smoothing algorithms provide a better way of estimating theprobability of n-grams than Maximum Likelihood Estimation.

Topics in Computational Linguistics Shu-Kai Hsieh

Page 26: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing Techniques

Smoothing Techniques

• Laplace Smoothing (a.k.a. add-one method)• Interpolation• Backoff• Good-Turing Estimation(/Discounting)• Kneser-Ney Smoothing

Topics in Computational Linguistics Shu-Kai Hsieh

Page 27: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing Techniques

Laplace Smoothing

• Pretend we saw each word one more time than we did.• Re-estimate the counts by just add one to all the counts!• Read the BeRP examples (JM pp99-100)

Topics in Computational Linguistics Shu-Kai Hsieh

Page 28: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing Techniques

Laplace Smoothing: Comparing with Raw Bigram Counts

Topics in Computational Linguistics Shu-Kai Hsieh

Page 29: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing Techniques

Laplace Smoothing: It’s a blunt estimation

• Too much probability mass is moved to all the zeros.

喧賓奪主: 為了處理大量的zero,Chinesefood可以少10倍!

Topics in Computational Linguistics Shu-Kai Hsieh

Page 30: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing Techniques

(Katz) Backoff and Interpolation

IntuitionSometimes it helps to use less context. Condition on less contextfor contexts you haven’t learned much about.

• Backoff and Interpolation are another two strategies thatutilize n-grams of variable length.

• Backoff: use trigram if you have good evidence, otherwisebigram, otherwise unigram.

• Interpolation: mix unigram, bigram, trigram.

Topics in Computational Linguistics Shu-Kai Hsieh

Page 31: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing Techniques

Katz Back-off

• The idea is to use the frequency of longest available n-grams,and if no n-gram is available to back-off to the (n-1)-gram,and then to (n-2)-gram, and so on.

• If n = 3, we first try trigrams, then bigrams, and finallyunigrams.

Topics in Computational Linguistics Shu-Kai Hsieh

Page 32: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing Techniques

P∗ and α?

• P∗: the discounted probability rather than MLE probabilities,such as Good-Turing.

• α: the normalizing factor

Topics in Computational Linguistics Shu-Kai Hsieh

Page 33: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing Techniques

Linear Interpolation 線性插值

將高階模型和低階模型作線性組合

• Simple interpolation

• Lambdas conditional on context

Topics in Computational Linguistics Shu-Kai Hsieh

Page 34: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing Techniques

Advanced Discounting Techniques

IntuitionTo use the count of things you’ve seen once to help estimate thecount of things you’ve never seen.

• Good-Turing• Witten-Bell• Kneser-Ney

Topics in Computational Linguistics Shu-Kai Hsieh

Page 35: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing Techniques

Good-Turing Smoothing: Notations

• A word or N-gram (or any event) that occurs once is calledsingleton or a hapax legomenon.

• Nc: the number of things we’ve seen c times, i.e., thefrequency of frequency c.

Example (In terms of bigrams)

N0 is the number of bigrams with count 0, N1 the number ofbigrams with count 1 (singleton), etc

Topics in Computational Linguistics Shu-Kai Hsieh

Page 36: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing Techniques

Good-Turing Smoothing:Intuition

[2]:pp101-102

Topics in Computational Linguistics Shu-Kai Hsieh

Page 37: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing Techniques

Good-Turing Smoothing: Answer

Topics in Computational Linguistics Shu-Kai Hsieh

Page 38: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing Techniques

Other advanced Smoothing Techniques

Topics in Computational Linguistics Shu-Kai Hsieh

Page 39: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

..1 N-grams modelEvaluationSmoothing Techniques

..2 Web-scaled N-grams

..3 Related Topics

..4 The Entropy of Natural Languages

..5 LabTopics in Computational Linguistics Shu-Kai Hsieh

Page 40: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

How to deal with huge web-scaled ngrams

How might one build a language model (ngrams model) thatallows scaling to very large amounts of training data?

• Naive Pruning: Only store N-grams with count geqthreshold, and remove singletons of higher-order n-grams.

• Entropy-based pruning

Topics in Computational Linguistics Shu-Kai Hsieh

Page 41: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing for Web-scaled N-grams

“Standard backoff” uses variations of context-dependent backoff,where p are pre-computed and stored probabilities, and λ areback-off weights.

Topics in Computational Linguistics Shu-Kai Hsieh

Page 42: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing for Web-scaled N-grams

“Stupid backoff” [1] don’t apply any discounting and insteaddirectly use the relative frequencies (S is used instead of P toemphasize that these are not probabilities but scores):

Topics in Computational Linguistics Shu-Kai Hsieh

Page 43: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

LM Tools and n-gram Resources

• CMU Statistical Language Modeling Toolkithttp://www.speech.cs.cmu.edu/SLM/toolkit.html

• SRILM http://www.speech.sri.com/projects/srilm/

• Google Web1T5-gram http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

• Google Book N-grams• Chinese Web 5-gram http://www.ldc.upenn.edu/

Catalog/catalogEntry.jsp?catalogId=LDC2010T06

Topics in Computational Linguistics Shu-Kai Hsieh

Page 44: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Quick demo of CMU-LM

Topics in Computational Linguistics Shu-Kai Hsieh

Page 45: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Google book ngrams

Topics in Computational Linguistics Shu-Kai Hsieh

Page 46: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

From Corpus-based to Google-based LinguisticsEnhancing Linguistic Search with the Google Books NgramViewer

Topics in Computational Linguistics Shu-Kai Hsieh

Page 47: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

From Corpus-based to Google-based Linguistics

Syntactic N-grams are coming out too!http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html

Topics in Computational Linguistics Shu-Kai Hsieh

Page 48: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Exercise

The Google Web 1T 5-Gram Database — SQLite Index & WebInterface

Topics in Computational Linguistics Shu-Kai Hsieh

Page 49: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

ApplicationsWhat Next Words Predication (based on Probabilistic LanguageModels) can do today?

source: fandywang,2012

ExampleProduct: Swift Key, XT9 by Nuance / pierre...

Topics in Computational Linguistics Shu-Kai Hsieh

Page 50: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

You’d definitely like to try this

An Automatic CS Paper Generatorhttp://pdos.csail.mit.edu/scigen/

Topics in Computational Linguistics Shu-Kai Hsieh

Page 51: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Collocations

• Collocations are recurrent combinations of words.Example

• Simple collocations are fixed ngrams, such as The Wall Street,• Collocations with predicative relations involves

morpho-syntactic variations, such as the one linking make anddecision: to make a decision, decisions to be made, made animportant decision, etc.

Topics in Computational Linguistics Shu-Kai Hsieh

Page 52: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Collocations

• Statistically, collocates are events co-occur more often thanby chance.

• Measures used to calculate the strength of word preference areMutual Information, t-score and the likelihood ratio.

MI

Topics in Computational Linguistics Shu-Kai Hsieh

Page 53: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Lab

• ngramR for Google book ngram• python nltk [see extra ipython notebook]

Example

For newbie in pythonhttps://www.coursera.org/course/interactivepythonFor quick starter (Develop and host Python from yourbrowser):https://www.pythonanywhere.com/

Topics in Computational Linguistics Shu-Kai Hsieh

Page 54: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Homework.week5

80% (4.3, JM book p122)20% 預習chapter5 [2]

Topics in Computational Linguistics Shu-Kai Hsieh

Page 55: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Homework.week6

20% 閱讀中研院平衡語料庫說明手冊(http://app.sinica.edu.tw/kiwi/mkiwi/98-04.pdf),預習 chapter 6.

80% 實作服貿論述的languagemodel (data will be provided),由此建立自動PRO/CON 文本產生器。

Topics in Computational Linguistics Shu-Kai Hsieh

Page 56: Topics in Computational Linguisticslope.linguistics.ntu.edu.tw/courses/nlp/slides/cl.week5.pdf · Topics in Computational Linguistics Week5: ngramsandlanguagemodel Shu-KaiHsieh Lab

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

. . .

. . . . . . . . . . . . . . .

N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Thorsten Brants, Ashok C Popat, Peng Xu, Franz J Och, andJeffrey Dean.Large language models in machine translation.In In Proceedings of the Joint Conference on EmpiricalMethods in Natural Language Processing and ComputationalNatural Language Learning. Citeseer, 2007.Dan Jurafsky and James H Martin.Speech & Language Processing.Pearson Education India, 2000.

Topics in Computational Linguistics Shu-Kai Hsieh