Upload
mervyn-peters
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
CS 595-052 Machine Learning and Statistical
Natural Language Processing
Prof. Shlomo Argamon, [email protected]: 237C Office Hours: Mon 3-4 PM
Book: Statistical Natural Language Processing C. D. Manning and H. Schütze
Requirements:– Several programming projects– Research Proposal
TrainingExamples
LearnedModel
TestExamples
Classification/LabelingResults
Learning Algorithm
Machine Learning
Modeling
• Decide how to represent learned models:– Decision rules– Linear functions– Markov models– …
• Type chosen affects generalization accuracy (on new data)
Generalization
Example Representation• Set of Features:
– Continuous
– Discrete (ordered and unordered)
– Binary
– Sets vs. Sequences
• Classes:– Continuous vs. discrete
– Binary vs. multivalued
– Disjoint vs. overlapping
Learning Algorithms
• Find a “good” hypothesis “consistent” with the training data– Many hypotheses may be consistent, so may
need a “preference bias”– No hypothesis may be consistent, so need to
find “nearly” consistent
• May rule out some hypotheses to start with:– Feature reduction
Estimating Generalization Accuracy
• Accuracy on the training says nothing about new examples!
• Must train and test on different example sets• Estimate generalization accuracy over multiple
train/test divisions• Sources of estimation error:
– Bias: Systematic error in the estimate– Variance: How much the estimate changes between
different runs
Cross-validation
1. Divide training into k sets
2. Repeat for each set:1. Train on the remaining k-1 sets
2. Test on the kth
3. Average k accuracies (and compute statistics)
Bootstrapping
For a corpus of n examples:1. Choose n examples randomly (with replacement)
Note: We expect ~0.632n different examples
2. Train model, and evaluate:• acc0 = accuracy of model on non-chosen examples
• accS = accuracy of model on n training examples
3. Estimate accuracy as 0.632*acc0 + 0.368*accS
4. Average accuracies over b different runs
Also note: there are other similar bootstrapping techniques
Bootstrapping vs. Cross-validation
• Cross-validation:– Equal participation of all examples– Dependency of class distribution in tests on
distributions in training– Stratified cross-validation: equalize class dist.
• Bootstrap:– Often has higher bias (fewer distinct examples)– Best for small datasets
Natural Language Processing
• Extract useful information from natural language texts (articles, books, web pages, queries, etc.)
• Traditional method: Handcrafted lexicons, grammars, parsers
• Statistical approach: Learn how to process language from a corpus of real usage
Some Statistical NLP Tasks
1. Part of speech tagging - How to distinguish between book the noun, and book the verb.
2. Shallow parsing – Pick out phrases of different types from a text, such as the purple people eater or would have been going
3. Word sense disambiguation - How to distinguish between river bank and bank as a financial institution.
4. Alignment – Find the correspondence between words, sentences and paragraphs of a source text and its translation.
A Paradigmatic Task
• Language Modeling:Predict the next word of a text (probabilistically):
P(wn | w1w2…wn-1) = m(wn | w1w2…wn-1)
• To do this perfectly, we must capture true notions of grammaticality
• So:
Better approximation of prob. of “the next word”
Better language model
Measuring “Surprise”• The lower the probability of the actual word,
the more the model is “surprised”:H(wn | w1…wn-1) = -log2 m(wn | w1…wn-1)
(The conditional entropy of wn given w1,n-1)
Cross-entropy: Suppose the actual distribution of the language is p(wn | w1…wn-1), then our model is on average surprised by:
Ep[H(wn|w1,n-1)] = w p(wn=w|w1,n-1)H(wn=w |w1,n-1)
= Ep[-log2 m(wn | w1,n-1)]
Estimating the Cross-Entropy
How can we estimate Ep[H(wn|w1,n-1)] when we don’t (by definition) know p?
Assume:
• Stationarity: The language doesn’t change
• Ergodicity: The language never gets “stuck”
Then:
Ep[H(wn|w1,n-1)] = limn (1/n) nH(wn | w1,n-1)
Perplexity
Commonly used measure of “model fit”:
perplexity(w1,n,m) = 2H(w1,n
,m) = m(w1,n)-(1/n)
How many “choices” for next word on average?
• Lower perplexity = better model
N-gram Models
• Assume a “limited horizon”:P(wk | w1w2…wk-1) = P(wk | wk-n…wk-1) – Each word depends only on the last n-1 words
• Specific cases:– Unigram model: P(wk) – words independent– Bigram model: P(wk | wk-1)
• Learning task: estimate these probabilities from a given corpus
Using Bigrams
• Compute probability of a sentence:
W = The cat sat on the mat
P(W) = P(The|START)P(cat|The)P(sat|cat) P(on|sat)P(the|on)P(mat|the)P(END|mat)
• Generate a random text and examine for “reasonableness”
Maximum Likelihood Estimation
• PMLE(w1…wn) = C(w1…wn) / N
• PMLE(wn | w1…wn-1) = C(w1…wn) / C(w1…wn-1)
Problem: Data Sparseness!!
• For the vast majority of possible n-grams, we get 0 probability, even in a very large corpus
• The larger the context, the greater the problem
• But there are always new cases not seen before!
Smoothing
• Idea: Take some probability away from seen events and assign it to unseen events
Simple method (Laplace):
Give every event an a priori count of 1
PLap(X) = C(X)+1 / N+B
where X is any entity, B is the number of entity types• Problem: Assigns too much probability to new events
The more event types there are, the worse this becomes
Interpolation
Lidstone:
PLid(X) = (C(X) + d) / (N + dB) [ d < 1 ]
Johnson:
PLid(X) = m PMLE(X) + (1 – m)(1/B)
where m = N/(N+dB)
• How to choose d?
• Doesn’t match low-frequency events well
Held-out Estimation
Idea: Estimate freq on unseen data from “unseen” data
• Divide data: “training” &“held out” subsetsC1(X) = freq of X in training data
C2(X) = freq of X in held out data
Tr = X:C1(X)=r C2(X)
Pho(X) = Tr/(NrN) where C(X)=r
Deleted EstimationGeneralize to use all the data :• Divide data into 2 subsets:
Nar = number of entities s.t. Ca(X)=r
Tarb = X:Ca(X)=r Cb(X)
Pdel (X) = (T0r1 + T1
r0 ) / N(N0
r1 + N1
r0 ) [C(X)=r]
• Needs a large data set
• Overestimates unseen data, underestimates infrequent data
Good-TuringFor observed items, discount item count:
r* = (r+1) E[Nr+1] / E[Nr]• The idea is that the chance of seeing the
item one more time, is about E[Nr+1] / E[Nr]For unobserved items, total probability is:
E[N1] / N– So, if we assume a uniform distribution over
unknown items, we have:
P(X) = E[N1] / (N0N)
Good-Turing Issues
• Has problems with high-frequency items (consider rmax* = E[Nrmax+1]/E[Nrmax
] = 0)
Usual answers:– Use only for low-frequency items (r < k)
– Smooth E[Nr] by function S(r)
• How to divide probability among unseen items?– Uniform distribution– Estimate which seem more likely than others…
Back-off Models
• If high-order n-gram has insufficient data, use lower order n-gram:
Pbo(wi|wi-n+1,i-1) =
• Note recursive formulation
{ (1-d(wi-n+1,i-1)) P(wi|wi-n+1,i-1)if enough data
(wi-n+1,i-1)Pbo(wi|wi-n+2,i-1)otherwise
Linear Interpolation
More generally, we can interpolate:
Pint(wi|h) = k k(h)Pk(wi| h)• Interpolation between different orders• Usually set weights by iterative training (gradient
descent – EM algorithm)
• Partition histories h into equivalence classes• Need to be responsive to the amount of data!