MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1

MaxEnt: Training, Smoothing, Tagging

Advanced Statistical Methods in NLPLing572

February 7, 2012

RoadmapMaxent:

Training

Smoothing

Case study: POS Tagging (redux)Beam search

Training

TrainingLearn λs from training data

Challenge: Usually can’t solve analyticallyEmploy numerical methods

TrainingLearn λs from training data

Challenge: Usually can’t solve analyticallyEmploy numerical methods

Main different techniques:Generalized Iterative Scaling (GIS, Darroch &Ratcliffe,

‘72)

Improved Iterative Scaling (IIS, Della Pietra et al, ‘95)

L-BFGS,…..

Generalized Iterative Scaling

GIS Setup:GIS required constraint: , where C is a constant

If not, then set

and add a correction feature function fk+1:

If not, then set

and add a correction feature function fk+1:

GIS also requires at least one active feature for any eventDefault feature functions solve this problem

GIS IterationCompute the empirical expectation

Initialization:λj(0) ; set to 0 or some value

Iterate until convergence for each j:

Iterate until convergence for each j:Compute p(y|x) under the current model

Compute model expectation under current model

Update model parameters by weighted ratio of empirical and model expectations

GIS IterationCompute

Iterate until convergence:Compute

Iterate until convergence:Compute p(n)(y|x)=

Compute

Update

ConvergenceMethods have convergence guarantees

However, full convergence may take very long time

However, full convergence may take very long timeFrequently use threshold

Calculating LL(p)LL = 0

For each sample x in the training dataLet y be the true label of xprob = p(y|x)LL += 1/N * prob

Running TimeFor each iteration the running time is:

Running TimeFor each iteration the running time is O(NPA),

where:

N: number of training instances

P: number of classes

A: Average number of active features for instance (x,y)

L-BFGSLimited-memory version of

Broyden–Fletcher–Goldfarb–Shanno (BFGS) method

Quasi-Newton method for unconstrained optimization

Good for optimization problems with many variables

“Algorithm of choice” for MaxEnt and related models

L-BFGSReferences:

Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782

Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B 45 (3): 503–528

L-BFGSReferences:

Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782

Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B 45 (3): 503–528

Implementations: Java, Matlab, Python via scipy, R, etcSee Wikipedia page

SmoothingBased on Klein & Manning, 2003; F. Xia

SmoothingProblems of scale:

Large numbers of featuresSome NLP problems in MaxEnt 1M features

Storage can be a problem

Sparseness problemsEase of overfitting

Optimization problemsFeatures can be near infinite, take long time to

converge

SmoothingConsider the coin flipping problem

Three empirical distributionsModels

From K&M ‘03

Need for Smoothing Two problems

From K&M ‘03

Optimization:Optimal value of λ?

∞Slow to optimize

From K&M ‘03

Optimization:Optimal value of λ?

∞Slow to optimize

No smoothingLearned distribution

just as spiky (K&M’03)

From K&M ‘03

Possible Solutions

Possible SolutionsEarly stopping

Feature selection

Regularization

Early StoppingPrior use of early stopping

Decision tree heuristics

Early StoppingPrior use of early stopping

Decision tree heuristics

Similarly hereStop training after a few iterations

λwill have increased

Guarantees bounded, finite training time

Feature SelectionApproaches:

Heuristic: Drop features based on fixed thresholdsi.e. number of occurrences

Wrapper methods:Add feature selection to training loop

Heuristic approaches: Simple, reduce features, but could harm

performance

RegularizationIn statistics and machine learning, regularization

is any method of preventing overfitting of data by a model.

From K&M ’03, F. Xia

Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-normin support vector machines.

In this case, we change the objective function: log P(Y,λ|X) = log P(λ)+log P(Y|X,λ)

Prior Possible prior distributions: uniform, exponential

Gaussian prior:

Prior Possible prior distributions: uniform, exponential

Gaussian prior:

log P(Y,λ|X) = log P(λ)+log P(Y|X,λ)

Maximize P(Y|X,λ)

Maximize P(Y, λ|X)

In practice, μ=0; 2σ2=1

L1 and L2 Regularization

Smoothing: POS Example

Advantages of SmoothingSmooths distributions

Moves weight onto more informative features

Enables effective use of larger numbers of features

Can speed up convergence

Summary: TrainingMany training methods:

Generalized Iterative Scaling (GIS)

Smoothing:Early stopping, feature selection, regularization

Regularization:Change objective function – add priorCommon prior: Gaussian priorMaximizing posterior not equivalent to max ent

MaxEnt POS Tagging

Notation(Ratnaparkhi, 1996)

h: history xWord and tag history

t: tag y

POS Tagging ModelP(t1,…,tn|w1,…,wn)

where hi={wi,wi-1,wi-2,wi+1,wi+2,ti-1,ti-2}

MaxEnt Feature Set

Example

Feature for ‘about’

Exclude features seen < 10 times

TrainingGIS

Training time: O(NTA)N: training set sizeT: number of tagsA: average number of features active for event

24 hours on a ‘96 machine

Finding FeaturesIn training, where do features come from?

Where do features come from in testing?

w-1 w0 w-1w0 w+1 t-1 y

x1(Time)

<s> Time <s>Time flies BOS N

x2 (flies)

Time flies Time flies like N N

x3 (like)

flies like flies like an N V

Finding FeaturesIn training, where do features come from?

Where do features come from in testing?tag features come from classification of prior word

w-1 w0 w-1w0 w+1 t-1 y

x1(Time)

<s> Time <s>Time flies BOS N

x2 (flies)

Time flies Time flies like N N

x3 (like)

flies like flies like an N V

DecodingGoal: Identify highest probability tag sequence

Issues:Features include tags from previous words

Not immediately available

DecodingGoal: Identify highest probability tag sequence

Issues:Features include tags from previous words

Not immediately available

Uses tag historyJust knowing highest probability preceding tag

insufficient

Beam SearchIntuition:

Breadth-first search explores all pathsLots of paths are (pretty obviously) badWhy explore bad paths?Restrict to (apparently best) paths

Approach:Perform breadth-first search, but

Beam SearchIntuition:

Breadth-first search explores all pathsLots of paths are (pretty obviously) badWhy explore bad paths?Restrict to (apparently best) paths

Approach:Perform breadth-first search, butRetain only k ‘best’ paths thus fark: beam width

Beam Search, k=3 <s> time flies like an arrow

Beam Search, k=3

<s> time flies like an arrow

Beam Search, k=3

Beam SearchW={w1,w2,…,wn}: test sentence

sij: jth highest prob. sequence up to & inc. word wi

Generate tags for w1, keep top k, set s1j accordingly

for i=2 to n:

for i=2 to n:Extension: add tags for wi to each s(i-1)j

Beam selection: Sort sequences by probabilityKeep only top k sequences

Generate tags for w1, keep topN, set s1j accordingly

for i=2 to n: For each s(i-1)j

for wi form vector, keep topN tags for wi

Beam selection: Sort sequences by probabilityKeep only top sequences, using pruning on next slide

Return highest probability sequence sn1

Beam SearchPruning and storage:

W = beam widthFor each node, store:

Tag for wi

Probability of sequence so far, probi,j=

For each candidate j, si,j

Keep the node if probi,j in topK, and

probi,j is sufficiently high

e.g. lg(probi,j)+W>=lg(max_prob)

DecodingTag dictionary:

known word: returns tags seen with word in training

unknown word: returns all tags

Beam width = 5

Running time: O(NTAB)N,T,A as beforeB: beam width

POS TaggingOverall accuracy: 96.3+%

Unseen word accuracy: 86.2%

Comparable to HMM tagging accuracy or TBL

ProvidesProbabilistic frameworkBetter able to model different info sources

Topline accuracy 96-97%Consistency issues

Beam SearchBeam search decoding:

Variant of breadth first searchAt each layer, keep only top k sequences

Advantages:

Advantages:Efficient in practice: beam 3-5 near optimal

Empirically, beam 5-10% of search space; prunes 90-95%

Simple to implementJust extensions + sorting, no dynamic programming

Running time:

Variant of breadth first searchAt each layer, keep only top sequences

Empirically, beam 5-10% of search space; prunes 90-95%Simple to implement

Just extensions + sorting, no dynamic programming

Disadvantage: Not guaranteed optimal (or complete)

MaxEnt POS TaggingPart of speech tagging by classification:

Feature designword and tag context featuresorthographic features for rare words

Sequence classification problems:Tag features depend on prior classification

Beam search decodingEfficient, but inexact

Near optimal in practice

MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1

Documents

Lecture 8: Nelson Aalen Estimator and Smoothing Kernel Smoothing Smoothing Splines

SIG Maxent

Introduction to Maxent - UITSweb2.uconn.edu/cyberinfra/module3/Downloads/Day 4 - Maxent.pdf · Introduction to Maxent Day 1 of 2 . Outline

SIG Distribución Maxent

Intro to Maxent

Robust Smoothing: Smoothing Parameter Selection and ... · Robust Smoothing: Smoothing Parameter Selection and Applications to Fluorescence Spectroscopy Jong Soo Lee Carnegie Mellon

MaxEnt: Statistical Explanation (Elithet. al. 2011)bioff.forest.ku.ac.th/PDF_FILE/SEP_2012/MAXENT_HAND.pdf• The MaxEnt fitted function is usually defined over many features, meaning

Maxent Models and Discriminative Estimation

Advanced topics in language modeling: MaxEnt, features …allauzen/asr15/guest_lecture_1.pdf · MaxEnt, features and marginal distritubion constraints ... Backoff Inspired Features

Smoothing/Priors/ Regularization for Maxent Models

Maxent interface. Maximum Entropy (Maxent) Deterministic Precise mathematical definition Continuous and categorical environmental data Continuous output

MaxEnt’14, The 34th International Workshop on Bayesian ...djafari.free.fr › MaxEnt2014 › slides › MaxEnt2014_opening_slides_FB.… · MaxEnt’14, The 34th International Workshop

Maxent Grammars for the Metrics of Shakespeare and Miltonlinguistics.ucla.edu/people/hayes/papers/HayesWilsonShiskoMaxent... · Maxent Grammars for the Metrics . ... our testing indicates

Conditional Markov Models: MaxEnt Tagging and MEMMs

C5. Modelización espacial con MiraMon y MaxEnt

Introduction to SDM with Maxent JohannesS Signer

Maxent Models (II)

Exponential Smoothing - Wharton Statistics Departmentstine/insr260_2009/lectures/expo_smth.pdf · Overview Smoothing Exponential smoothing Model behind exponential smoothing Forecasts

Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19, 2012 1

Maxent Tutorial Slides