CS 595-052 Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, [email protected] Room: 237C Office Hours: Mon 3-4 PM Book:

CS 595-052 Machine Learning and Statistical

Natural Language Processing

Prof. Shlomo Argamon, [email protected]: 237C Office Hours: Mon 3-4 PM

Book: Statistical Natural Language Processing C. D. Manning and H. Schütze

Requirements:– Several programming projects– Research Proposal

TrainingExamples

LearnedModel

TestExamples

Classification/LabelingResults

Learning Algorithm

Machine Learning

Modeling

• Decide how to represent learned models:– Decision rules– Linear functions– Markov models– …

• Type chosen affects generalization accuracy (on new data)

Generalization

Example Representation• Set of Features:

– Continuous

– Discrete (ordered and unordered)

– Binary

– Sets vs. Sequences

• Classes:– Continuous vs. discrete

– Binary vs. multivalued

– Disjoint vs. overlapping

Learning Algorithms

• Find a “good” hypothesis “consistent” with the training data– Many hypotheses may be consistent, so may

need a “preference bias”– No hypothesis may be consistent, so need to

find “nearly” consistent

• May rule out some hypotheses to start with:– Feature reduction

Estimating Generalization Accuracy

• Accuracy on the training says nothing about new examples!

• Must train and test on different example sets• Estimate generalization accuracy over multiple

train/test divisions• Sources of estimation error:

– Bias: Systematic error in the estimate– Variance: How much the estimate changes between

different runs

Cross-validation

1. Divide training into k sets

2. Repeat for each set:1. Train on the remaining k-1 sets

2. Test on the kth

3. Average k accuracies (and compute statistics)

Bootstrapping

For a corpus of n examples:1. Choose n examples randomly (with replacement)

Note: We expect ~0.632n different examples

2. Train model, and evaluate:• acc0 = accuracy of model on non-chosen examples

• accS = accuracy of model on n training examples

3. Estimate accuracy as 0.632*acc0 + 0.368*accS

4. Average accuracies over b different runs

Also note: there are other similar bootstrapping techniques

Bootstrapping vs. Cross-validation

• Cross-validation:– Equal participation of all examples– Dependency of class distribution in tests on

distributions in training– Stratified cross-validation: equalize class dist.

• Bootstrap:– Often has higher bias (fewer distinct examples)– Best for small datasets

Natural Language Processing

• Extract useful information from natural language texts (articles, books, web pages, queries, etc.)

• Traditional method: Handcrafted lexicons, grammars, parsers

• Statistical approach: Learn how to process language from a corpus of real usage

Some Statistical NLP Tasks

1. Part of speech tagging - How to distinguish between book the noun, and book the verb.

2. Shallow parsing – Pick out phrases of different types from a text, such as the purple people eater or would have been going

3. Word sense disambiguation - How to distinguish between river bank and bank as a financial institution.

4. Alignment – Find the correspondence between words, sentences and paragraphs of a source text and its translation.

A Paradigmatic Task

• Language Modeling:Predict the next word of a text (probabilistically):

P(wn | w1w2…wn-1) = m(wn | w1w2…wn-1)

• To do this perfectly, we must capture true notions of grammaticality

• So:

Better approximation of prob. of “the next word”

Better language model

Measuring “Surprise”• The lower the probability of the actual word,

the more the model is “surprised”:H(wn | w1…wn-1) = -log2 m(wn | w1…wn-1)

(The conditional entropy of wn given w1,n-1)

Cross-entropy: Suppose the actual distribution of the language is p(wn | w1…wn-1), then our model is on average surprised by:

Ep[H(wn|w1,n-1)] = w p(wn=w|w1,n-1)H(wn=w |w1,n-1)

= Ep[-log2 m(wn | w1,n-1)]

Estimating the Cross-Entropy

How can we estimate Ep[H(wn|w1,n-1)] when we don’t (by definition) know p?

Assume:

• Stationarity: The language doesn’t change

• Ergodicity: The language never gets “stuck”

Then:

Ep[H(wn|w1,n-1)] = limn (1/n) nH(wn | w1,n-1)

Perplexity

Commonly used measure of “model fit”:

perplexity(w1,n,m) = 2H(w1,n

,m) = m(w1,n)-(1/n)

How many “choices” for next word on average?

• Lower perplexity = better model

N-gram Models

• Assume a “limited horizon”:P(wk | w1w2…wk-1) = P(wk | wk-n…wk-1) – Each word depends only on the last n-1 words

• Specific cases:– Unigram model: P(wk) – words independent– Bigram model: P(wk | wk-1)

• Learning task: estimate these probabilities from a given corpus

Using Bigrams

• Compute probability of a sentence:

W = The cat sat on the mat

P(W) = P(The|START)P(cat|The)P(sat|cat) P(on|sat)P(the|on)P(mat|the)P(END|mat)

• Generate a random text and examine for “reasonableness”

Maximum Likelihood Estimation

• PMLE(w1…wn) = C(w1…wn) / N

• PMLE(wn | w1…wn-1) = C(w1…wn) / C(w1…wn-1)

Problem: Data Sparseness!!

• For the vast majority of possible n-grams, we get 0 probability, even in a very large corpus

• The larger the context, the greater the problem

• But there are always new cases not seen before!

Smoothing

• Idea: Take some probability away from seen events and assign it to unseen events

Simple method (Laplace):

Give every event an a priori count of 1

PLap(X) = C(X)+1 / N+B

where X is any entity, B is the number of entity types• Problem: Assigns too much probability to new events

The more event types there are, the worse this becomes

Interpolation

Lidstone:

PLid(X) = (C(X) + d) / (N + dB) [ d < 1 ]

Johnson:

PLid(X) = m PMLE(X) + (1 – m)(1/B)

where m = N/(N+dB)

• How to choose d?

• Doesn’t match low-frequency events well

Held-out Estimation

Idea: Estimate freq on unseen data from “unseen” data

• Divide data: “training” &“held out” subsetsC1(X) = freq of X in training data

C2(X) = freq of X in held out data

Tr = X:C1(X)=r C2(X)

Pho(X) = Tr/(NrN) where C(X)=r

Deleted EstimationGeneralize to use all the data :• Divide data into 2 subsets:

Nar = number of entities s.t. Ca(X)=r

Tarb = X:Ca(X)=r Cb(X)

Pdel (X) = (T0r1 + T1

r0 ) / N(N0

r1 + N1

r0 ) [C(X)=r]

• Needs a large data set

• Overestimates unseen data, underestimates infrequent data

Good-TuringFor observed items, discount item count:

r* = (r+1) E[Nr+1] / E[Nr]• The idea is that the chance of seeing the

item one more time, is about E[Nr+1] / E[Nr]For unobserved items, total probability is:

E[N1] / N– So, if we assume a uniform distribution over

unknown items, we have:

P(X) = E[N1] / (N0N)

Good-Turing Issues

• Has problems with high-frequency items (consider rmax* = E[Nrmax+1]/E[Nrmax

] = 0)

Usual answers:– Use only for low-frequency items (r < k)

– Smooth E[Nr] by function S(r)

• How to divide probability among unseen items?– Uniform distribution– Estimate which seem more likely than others…

Back-off Models

• If high-order n-gram has insufficient data, use lower order n-gram:

Pbo(wi|wi-n+1,i-1) =

• Note recursive formulation

{ (1-d(wi-n+1,i-1)) P(wi|wi-n+1,i-1)if enough data

(wi-n+1,i-1)Pbo(wi|wi-n+2,i-1)otherwise

Linear Interpolation

More generally, we can interpolate:

Pint(wi|h) = k k(h)Pk(wi| h)• Interpolation between different orders• Usually set weights by iterative training (gradient

descent – EM algorithm)

• Partition histories h into equivalence classes• Need to be responsive to the amount of data!

Documents

CS 595-052 Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, [email protected] Room: 237C Office Hours: Mon 3-4 PM Book: