Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This...

Preview:

Citation preview

Lecture 4: Language Model Evaluation and Advanced methods

Kai-Wei ChangCS @ University of Virginia

kw@kwchang.net

Couse webpage: http://kwchang.net/teaching/NLP16

16501 Natural Language Processing

This lecture

vKneser-Ney smoothingvDiscriminative Language Models vNeural Language ModelsvEvaluation: Cross-entropy and perplexity

26501 Natural Language Processing

Recap: Smoothing

v Add-one smoothingv Add-𝜆 smoothingvparameters tuned by the cross-validation

vWitten-Bell SmoothingvT: # word types N: # tokensvT/(N+T): total prob. mass for unseen wordsvN/(N+T): total prob. mass for observed tokens

vGood-TuringvReallocate the probability mass of n-grams that

occur r+1 times to n-grams that occur r times.

36501 Natural Language Processing

Recap: Back-off and interpolation

v Idea: even if we’ve never seen “red glasses”, we know it is more likely to occur than “red abacus”

v Interpolation:paverage(z | xy) = µ3 p(z | xy) + µ2 p(z | y) + µ1 p(z)where µ3 + µ2 + µ1 = 1 and all are ≥ 0

46501 Natural Language Processing

Absolute Discounting

vSave ourselves some time and just subtract 0.75 (or some d)!

v But should we really just use the regular unigram P(w)?

5

)()()(),()|( 11

11scountingAbsoluteDi wPw

wcdwwcwwP i

i

iiii −

−− +

−= λ

discountedbigram

unigram

Interpolationweight

6501 Natural Language Processing

Kneser-Ney Smoothing

v Betterestimateforprobabilitiesoflower-orderunigrams!v Shannongame:I can’t see without my

reading___________?v “Francisco”ismorecommonthan“glasses”v…but“Francisco”alwaysfollows“San”

6

Francisco glasses

6501 Natural Language Processing

Kneser-Ney Smoothing

v InsteadofP(w):“Howlikelyisw”v Pcontinuation(w):“Howlikelyiswtoappearasa

novelcontinuation?v Foreachword,countthenumberofbigramtypesit

completesv Everybigramtypewasanovelcontinuationthefirst

timeitwasseen

7

PCONTINUATION (w)∝ {wi−1 : c(wi−1,w)> 0}

6501 Natural Language Processing

Kneser-Ney Smoothing

v Howmanytimesdoeswappearasanovelcontinuation:

v Normalized by the total number of word bigram types

8

PCONTINUATION (w) ={wi−1 : c(wi−1,w)> 0}

{(wj−1,wj ) : c(wj−1,wj )> 0}

PCONTINUATION (w)∝ {wi−1 : c(wi−1,w)> 0}

{(wj−1,wj ) : c(wj−1,wj )> 0}

6501 Natural Language Processing

Kneser-Ney Smoothing

v Alternative metaphor: The number of # of word types seen to precede w

v normalized by the # of words preceding all words:

v A frequent word (Francisco) occurring in only one context (San) will have a low continuation probability

9

PCONTINUATION (w) ={wi−1 : c(wi−1,w)> 0}{w 'i−1 : c(w 'i−1,w ')> 0}

w '∑

| {wi−1 : c(wi−1,w)> 0} |

6501 Natural Language Processing

Kneser-Ney Smoothing

10

PKN (wi |wi−1) =max(c(wi−1,wi )− d, 0)

c(wi−1)+λ(wi−1)PCONTINUATION (wi )

λ(wi−1) =d

c(wi−1){w : c(wi−1,w)> 0}

λ is a normalizing constant; the probability mass we’ve discounted

thenormalizeddiscountThenumberofwordtypesthatcanfollowwi-1=#ofwordtypeswediscounted=#oftimesweappliednormalizeddiscount

6501 Natural Language Processing

Kneser-Ney Smoothing: Recursive formulation

11

PKN (wi |wi−n+1i−1 ) = max(cKN (wi−n+1

i )− d, 0)cKN (wi−n+1

i−1 )+λ(wi−n+1

i−1 )PKN (wi |wi−n+2i−1 )

cKN (•) =count(•) for the highest order

continuationcount(•) for lower order

!"#

$#

Continuationcount=Numberofuniquesinglewordcontextsfor�

6501 Natural Language Processing

Practical issue: Huge web-scale n-gramsvHow to deal with, e.g., Google N-gram

corpusvPruning

vOnly store N-grams with count > threshold.v Remove singletons of higher-order n-grams

126501 Natural Language Processing

Huge web-scale n-grams

vEfficiencyvEfficient data structures

v e.g. tries

vStore words as indexes, not stringsvQuantize probabilities (4-8 bits instead of

8-byte float)

13

https://en.wikipedia.org/wiki/Trie

6501 Natural Language Processing

600.465 - Intro to NLP - J. Eisner 14

Smoothing

This dark art is why NLP is taught in the engineering school.

14

There are more principled smoothing methods, too. We’ll look next at log-linear models, which are a good and popular general technique.

6501 Natural Language Processing

Conditional Modeling

v Generative language model (tri-gram model):

v Then, we compute the conditional probabilities by maximum likelihood estimation

v Can we model 𝑃 𝑤$ 𝑤%,𝑤' directly?

v Given a context x, which outcomes y are likely in that context?P (NextWord=y | PrecedingWords=x)

15600.465 - Intro to NLP - J. Eisner 15

𝑃(𝑤), …𝑤+)=P 𝑤) 𝑃 𝑤0 𝑤) …𝑃 𝑤+ 𝑤+10,𝑤+1)

6501 Natural Language Processing

Modeling conditional probabilities

vLet’s assume𝑃(𝑦|𝑥) = exp(score x, y )/∑ exp(𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦D )EDY: NextWord, x: PrecedingWords

v𝑃(𝑦|𝑥) is high ⇔ score(x,y) is highvThis is called soft-maxvRequire that P(y | x) ≥ 0, and ∑ 𝑃(𝑦|𝑥)E = 1;

not true of score(x,y)

166501 Natural Language Processing

Linear Scoring

v Score(x,y): How well does y go with x?v Simplest option: a linear function of (x,y).

But (x,y) isn’t a number ⇒ describe it by some numbers (i.e. numeric features)

v Then just use a linear function of those numbers.

17

Ranges over all features Whether (x,y) has feature k(0 or 1)Or how many times it fires (≥ 0)Or how strongly it fires (real #)

Weight of the kth feature. To be learned …

6501 Natural Language Processing

What features should we use?

vModel p wI w$1),w$10):𝑓'(“𝑤$1),𝑤$10”, “𝑤$”) for Score(“𝑤$1),𝑤$10”, “𝑤$”) canbev # “𝑤$1)” appears in the training corpus. v 1, if “𝑤$ ” is an unseen word; 0, otherwise.v 1, if “𝑤$1),𝑤$10” = “a red”; 0, otherwise.v 1, if “𝑤$10” belongs to the “color” category; 0 otherwise.

186501 Natural Language Processing

What features should we use?

vModel p ”𝑔𝑙𝑎𝑠𝑠𝑒𝑠” ”𝑎𝑟𝑒𝑑”):𝑓'(“𝑟𝑒𝑑”, “𝑎”, “𝑔𝑙𝑎𝑠𝑠𝑒𝑠”) for Score(“𝑟𝑒𝑑”, “𝑎”, “𝑔𝑙𝑎𝑠𝑠𝑒𝑠”)v # “𝑟𝑒𝑑” appears in the training corpus. v 1, if “𝑎” is an unseen word; 0, otherwise.v 1, if “a𝑟𝑒𝑑” = “a red”; 0, otherwise.v 1, if “𝑟𝑒𝑑” belongs to the “color” category; 0 otherwise.

196501 Natural Language Processing

Log-Linear Conditional Probability

20600.465 - Intro to NLP - J. Eisner 20

where we choose Z(x) to ensure that

unnormalizedprob (at leastit’s positive!)

thus, Partition function

6501 Natural Language Processing

v n training examplesv feature functions f1, f2, …v Want to maximize p(training data|θ)

v Easier to maximize the log of that:

v Alas, some weights θi may be optimal at -∞ or +∞.When would this happen? What’s going “wrong”?

Training θ

21

This version is “discriminative training”: to learn to predict y from x, maximize p(y|x).

Whereas in “generative models”, we learn to model x, too, by maximizing p(x,y).

6501 Natural Language Processing

Generalization via Regularization

v n training examplesv feature functions f1, f2, …v Want to maximize p(training data|θ) ⋅ pprior(θ)

v Easier to maximize the log of that

v Encourages weights close to 0.v “L2 regularization”: Corresponds to a Gaussian prior

22

𝑝 𝜃 ∝ 𝑒 X Y/ZY

6501 Natural Language Processing

Gradient-based training

v Gradually adjust θ in a direction that improves

23

Gradient ascent to gradually increase f(θ):

while (∇f(θ) ≠ 0) // not at a local max or minθ = θ + 𝜂∇f(θ) // for some small 𝜂 > 0

Remember: ∇f(θ) = (∂f(θ)/∂θ1, ∂f(θ)/∂θ2, …)update means: θk += ∂f(θ) / ∂θk

6501 Natural Language Processing

Gradient-based training

v Gradually adjust θ in a direction that improves

v Gradient w.r.t 𝜃

246501 Natural Language Processing

More complex assumption?

v 𝑃(𝑦|𝑥) = exp(score x, y )/∑ exp(𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦′ )𝑦′Y: NextWord, x: PrecedingWords

v Assume we saw:

What is P(shoes; blue)?

v Can we learn categories of words(representation) automatically?

v Can we build a high order n-gram model without blowing up the model size?

25

redglasses;yellowglasses;greenglasses;blueglassesredshoes;yellowshoes;greenshoes;

6501 Natural Language Processing

Neural language model

vModel 𝑃(𝑦|𝑥) with a neural network

26

Example1:Onehotvector:eachcomponentofthevectorrepresentsoneword[0,0,1,0,0]

Example2:wordembeddings

6501 Natural Language Processing

Neural language model

vModel 𝑃(𝑦|𝑥) with a neural network

27

Learnedmatricestoprojecttheinputvectors

Obtain(y|x)byperforming softmax

Concatenateprojectedvectors

Non-linearfunctione.g.,ℎ = tanh(𝑊b 𝑐 + 𝑏)

6501 Natural Language Processing

Why?

vPotentially generalize to unseen contexts vExample: P(“red” | “the”, “shoes”, “are”)vThis does not occurs in training corpus but

[“the”, ”glasses”, ”are”, “red”] does.v If the word representations of “red” and “blue”

are similar, then the model can generalize.

vWhy are “red” and “blue” similar?vBecause NN saw “red skirt”, “blue skirt”, “red

pen”, ”blue pen”, etc.

286501 Natural Language Processing

Training neural language models

vCan use gradient ascent as well

vUsing the chain rule to derive the gradienta.k.a. back propagation

vMore complex NN architectures can be used – e.g., LSTM, char-based models

296501 Natural Language Processing

Language model evaluation

vHow to compare models?v we need an unseen text set, why?

v Information theory: study resolution of uncertainty.vPerplexity: measure how well a probability

distribution predicts a sample

306501 Natural Language Processing

Cross-Entropy

v A common measure of model qualityv Task-independentv Continuous – slight improvements show up here

even if they don’t change # of right answers on task

v Just measure probability of (enough) test datav Higher prob means model better predicts the futurev There’s a limit to how well you can predict random

stuffv Limit depends on “how random” the dataset is

(easier to predict weather than headlines, especially in Arizona)

316501 Natural Language Processing

32

Cross-Entropy (“xent”)

v Want prob of test data to be high:p(h | <S>, <S>) * p(o | <S>, h) * p(r | h, o) * p(s | o, r) …

1/8 * 1/8 * 1/8 * 1/16 …

v high prob → low xent by 3 cosmetic improvements:v Take logarithm (base 2) to prevent underflow:

log (1/8 * 1/8 * 1/8 * 1/16 …) = log 1/8 + log 1/8 + log 1/8 + log 1/16 … = (-3) + (-3) + (-3) + (-4) + …

v Negate to get a positive value in bits 3+3+3+4+…v Divide by length of text à 3.25 bits per letter (or per

word)

6501 Natural Language Processing

Average?Geometric average of 1/23,1/23, 1/23, 1/24

= 1/23.25 ≈ 1/9.5

33

Cross-Entropy (“xent”)

v Want prob of test data to be high:p(h | <S>, <S>) * p(o | <S>, h) * p(r | h, o) * p(s | o, r) …

1/8 * 1/8 * 1/8 * 1/16 …

v Cross-entropy à 3.25 bits per letter (or per word)vWant this to be small (equivalent to wanting good

compression!)vLower limit is called entropy – obtained in principle as

cross-entropy of the true model measured on an infinite amount of data

v perplexity = 2xent (meaning ≈9.5 choices)

Average?Geometric average of 1/23,1/23, 1/23, 1/24

= 1/23.25 ≈ 1/9.5

6501 Natural Language Processing

More math: Entropy H(X)

v The entropy H(𝑝) of a discrete random variable 𝑋is the expected negative log probability:H p = −∑ 𝑝 𝑥 log0 𝑃(𝑥)k

vEntropy is measure of uncertainty

346501 Natural Language Processing

Entropy of coin tossing

vToss a coin P(H)=𝑝, P(T)=1 − pvH(p)= −𝑝 log0 𝑝 − 1 − 𝑝 log0 1 − 𝑝

vp=0.5: H(p)= 1vP=1: H(p) = 0

6501 Natural Language Processing 35

Entropy of coin tossing

vToss a coin P(H)=𝑝, P(T)=1 − pvH(p)= −𝑝 log0 𝑝 − 1 − 𝑝 log0 1 − 𝑝

vp=0.5: H(p)= 1vP=1: H(p) = 0

6501 Natural Language Processing 36

How many bits to encode messages

vConsider three letters (A, B, C, D):v If p=(½, ½, 0, 0), how many bits per letter

in average to encode a message ~ p?vEncode A as 0, B as 1;AAABBBAA ⇒ 00011100

v If p=(¼ , ¼ , ¼ , ¼ )vA: 00, B: 01, C:10, D:11; ABDA⇒ 00011100

vHow about p=(½, ¼, ¼, 0)vA: 0, B:10, C:11; AAACBA ⇒ 00011100

6501 Natural Language Processing 37

More math: Cross Entropy

vCross-entropy:vAvg.#bitstoencodeevents~p(x)usingacodingschemem(x)

vH p,𝑚 = −∑ 𝑝 𝑥 log0𝑚(𝑥)kvNot symmetric: H p,𝑚 ≠ H 𝑚,𝑝vLower bounded by H(p)

v Let p=(½, ¼, ¼, 0)vWe encode A:00, B:01, C:10, D:11

(i,e., m= (¼, ¼, ¼, ¼))vAAACBA?

386501 Natural Language Processing

000000100100

Perplexity and geometric mean

v

6501 Natural Language Processing 39

CS498JH: Introduction to NLP

Perplexity

Language model m is better than m’ if it assigns lower perplexity (i.e. lower cross-entropy, and higher probability) to the test corpus w1...wN

10

Perplexity(w1

. . .wN) = 2

H(w1

. . . wN)

= 2

� 1

N log

2

m(w1

. . . wN)

= m(w1

. . .wN)� 1

N

= N

s1

m(w1

. . .wN)

An experiment

vTrain: 38M WSJ text, |V|= 20kvTest: 1.5M WSJ text

vWord level LSTM ~85vChar level ~79

6501 Natural Language Processing 40

CS498JH: Introduction to NLP

Models:Unigram, Bigram, Trigram model (with Good-Turing)

Training data: 38M words of WSJ text (Vocabulary: 20K types)

Test data:1.5M words of WSJ text

Results:

An experiment

Unigram Bigram TrigramPerplexity 962 170 109

12

Recommended