Intro to Maxent

Adwait Ratnaparkhi

Yahoo! Labs

Introduction to maximum entropy models

- 2 - Yahoo! Confidential

Introduction

• This talk is geared for Technical Yahoos…– Who are unfamiliar with machine learning or maximum entropy– Who are familiar with machine learning but not maximum

entropy– Who want to see an example of machine learning on the grid– Who are looking for an introductory talk

• Lab session is geared for Technical Yahoos…– Who brought their laptop– Who want to obtain more numerical intuition about the

optimization algorithms– Who don’t mind coding a little PERL– Who have some patience in case things go awry!


Outline

• What is maximum entropy modeling?• Example applications of maximum entropy framework• Lab session

– Look at the mapper and reducer of a maximum entropy parameter estimation algorithm


What is a maximum entropy model?

• A framework for combining evidence into a probability model• Probability can then be used for classification, or input to

next component– Content match: P(click | page, ad, user)– Part-of-speech tagging: P(part-of-speech-tag | word )– Categorization: P(category | page )

• Principle of maximum entropy is used as optimization criterion– Relationship to maximum likelihood optimization– Same model form as multi-class logistic regression


Machine learning approach

• Data with labels– Automatic: ad click

• sponsored search, content match, graphical ads

– Manual: editorial process• page categories, part-of-speech tags

• Objective– Training phase

• Find a (large) training set• Use training set to construct a classifier or probability model

– Test phase• Find an (honest) test set• Use model to assign labels to previously unseen data


Why machine learning? Why maxent?

• Why machine learning?– Pervasive ambiguity

• Any task with natural language features

– For these tasks, easier to collect and annotate data vs. hand-coding expert system

– For certain web tasks, annotation is free (e.g., clicks)

• Why maxent?– Little restriction on kinds of evidence

• No independence assumption

– Works well with sparse features– Works well in parallel settings, like Hadoop– Appeal of maxent interpretation


Sentence Boundary Detection

• Which “.” denotes a sentence boundary?

Mr. B. Green from X.Y.Z. Corp. said that the U.S. budget deficit hit $1.42 trillion for the year that ended Sept. 30. The previous year’s deficit was $459 billion.

• Model: P({yes|no}| candidate boundary)

Correct Boundary


Part-of-speech tagging

• What is the part of speech of flies ?– Fruit flies like a banana.– Time flies like an arrow.

• Model: P(tag | current word, surrounding words, …)– End goal is sequence of tags


Content match ad clicks

P(click|page, ad, user)


Ambiguity resolution: an artificial example

• Tagging unknown, or “out-of-vocabulary”, words:– Given word, predict POS tag based on spelling features

• Assume all words are either:– Proper Noun (NNP)– Verb Gerund (VBG)

• Find a model: P(tag | word) where tag is in {NNP,VBG}• Assume a training set

– Editorially derived (word, tag) pairs– Training data tags are { NNP, VBG }– Reminder: artificial example!


Ambiguity resolution (cont’d)

• Ask the data– q1: Is first letter of word capitalized ?– q2: Does word end in ing ?

• Evidence for (q1, NNP)– Clinton, Bush, Reagan as NNP

• Evidence for (q2, VBG)– trading, offering, operating as VBG

• Rules based on q1,q2 can conflict– Boeing

• How to choose in ambiguous cases?


Features represent evidence

• a = what we are predicting• b = what we observe• Terminology

– Here, input to feature is (a,b) pair– But in a lot of ML literature, input to feature is just (b)

€

fy,q (a,b) =1 if a = y AND q(b) = true

0 otherwise

⎧ ⎨ ⎩

f1(a,b) =1 if a = NNP and q1(b) = true

f2(a,b) =1 if a = VBG and q2(b) = true


Combine the features: probability model

• Probability is product of weights of active features

• Why this form?

€

p(a | b) =1

Z(b)α j

f j (a,b )

j=1

k

∏

Z(b) = α jf j (a ',b )

j=1

k

∏a '

∑

f j (a,b)∈ {0,1} : features

α j > 0 : parameters


Probabilities for ambiguity resolution

• How do we find optimal parameter values?€

p(NNP | Boeing) =1

Zα 1

f1 (a,b )α 2f2 (a,b )

=1

Zα 1

f1 (a,b )

p(VBG | Boeing) =1

Zα 1

f1 (a,b )α 2f2 (a,b )

=1

Zα 2

f2 (a,b )


Maximum likelihood estimation

QpML

ba

k

j

baf

j

pLp

bapbarpL

(a,b)bar

bZbappQ j

∈

=

=

=

=

==

∑

∏

)(maxarg

)|(log),()(

} of frequency Normalized{),(

})(

1)|(|{

,

1

),(α


Principle of maximum entropy (Jaynes, 1957)

• Use probability model that is maximally uncertain, w.r.t. to observed evidence

• Why? Anything else assumes a fact you have not observed.

PpME pHp

pH

P

∈==

=

)(maxarg

}p of entropy{)(}evidence withconsistent models{


Maxent example

• Task: estimate joint distribution p(A,B)– A is in {x,y}– B is in {0,1}

• Define a feature f– Assume some expected value over a training set

p(A,B) 0 1

x ? ?

y ? 0.7€

f (a,b) =1 iff (a = y & b =1), 0 otherwise

E[ f ] = 7 /10 = .7

p(a,b) =1a,b

∑


Maxent example (cont’d)

• Define entropy

• One way to meet the constraints. H(p) = 1.25

• Maxent way to meet constraints. H(p) = 1.35

p(A,B) 0 1

x .05 0.2

y .05 0.7

p(A,B) 0 1

x 0.1 0.1

y 0.1 0.7

€

H( p) = − p(a,b)log p(a,b)a,b

∑


Conditional maximum entropy (Berger et al, 1995)

€

E r f j = {observed expectation of f}

= r(a,b) f j

a,b

∑ (a,b)

E p f j = {model's expectation of f}

= r(b)p(a | b) f j (a,b)a,b

∑

P = {p | E p f j = E r f j , j =1...k}

H( p) = − r(b) p(a | b)log p(a | b)a,b

∑

pME = argmaxp∈P

H( p)


Duality of ML and ME

• Under ME it must be the case that:

• ML and ME solutions are the same– pme=pml

– ML: form is assumed without justification– ME: constraints are assumed, form is derived

€

pME (a | b) =1

Z(b)α j

f j (a,b )

j=1

k

∏


Extensions: Minimum divergence modeling

• Kullback-Leibler divergence– Measures “distance” between 2 probability distributions– Not symmetric!– See (Cover and Thomas, Elements of Information Theory)

€

D( p,q) = r(b) p(a | b)logp(a | b)

q(a | b)a,b

∑

D( p,q) ≥ 0

D( p,q) = 0 iff p = q


Extensions : Minimum divergence models (cont’d)

• Minimum divergence framework:– Start with prior model q– From the set of consistent models P, minimize KL divergence to q

• Parameters will reflect deviation from prior model– Use case: prior model is static

• Same as maximizing entropy when q is uniform• See (Della Pietra et al, 1992) for an example in language modeling

€

P = {p | E p f j = E r f j , j =1...k}

pMD = argminp∈P

D(p,q)

pMD (a | b) =

q(a | b) α jf j (a,b )

j=1

k

∏

q(a' | b) α jf j (a ',b )

j=1

k

∏a'

∑


Parameter estimation (an incomplete list)

• Generalized Iterative Scaling (Darroch & Ratcliff, 1972)– Find correction feature and

constant– Iterative updates

• Improved iterative scaling (Della Pietra et. al., 1997)

• Conjugate gradient• Sequential conditional GIS

(Goodman, 2002)

• Correction-free GIS (Curran and Clark, 2003)

€

Define correction feature :

fk +1(a,b) = C − f j (a,b)j=1

k

∑

GIS :

α j(0) =1

α j(n ) = α j

(n−1) E r f j

E p(n−1) f j

⎡

⎣ ⎢

⎤

⎦ ⎥

1/C


Comparisons

• Same model form as multi-class logistic regression• Diverse forms of evidence• Compared to decision trees:

– Advantage: No data fragmentation– Disadvantage: No feature construction

• Compared to naïve bayes– No independence assumptions

• Scales well on sparse feature sets– Parameter estimation (GIS) is

O( [# of training samples] [# of predictions] [avg. # of features per training event])


Disadvantages

• “Perfect” predictors cause parameters to diverge– Suppose the word the only occurred with tag DT– Estimation algorithm is forcing p(a|b) = 1 in order to meet

constraints• Parameter for (the, DT) will diverge to infinity• May beat out other parameters estimated from many examples!

• A remedy– Gaussian priors or “fuzzy maximum entropy” (Chen &

Rosenfeld, 2000)– Discount observed expectations


How to specify a maxent model

• Outcomes– What are we predicting?

• Questions– What information is useful for predicting ?

• Feature selection– Candidate feature set consists of all (outcome, question) pairs– Given candidate feature set, what subset do we use?


Finding the part-of-speech

• Part of Speech (POS) Tagging– Return a sequence of POS tags

Input: Fruit flies like a banana Output: N N V D N

Input: Time flies like an arrowOutput: N V P D N

• Train maxent models from POS tags of Penn treebank (Marcus et al, 1993)

• Use heavily pruned search procedure to find highest probability tag sequence


Model for POS tagging(Ratnaparkhi, 1996)

• Outcomes– 45 POS tags (Penn Treebank)

• Question Patterns:– common words: word identity – rare words: presence of prefix, suffix, capitalization, hyphen,

and numbers– previous 2 tags– surrounding 2 words

• Feature Selection– count cutoff of 10


A training event

• Example:

– ...stories about well-heeled communities and ... NNS IN ???

• Outcome: JJ (adjective)• Questions

– w[i-1]=about, w[i-2]=stories, w[i+1]=communities, w[i+2]=and, t[i-1]=IN, t[i-2][i-1]=NNS IN,pre[1]=w, pre[2]=we, pre[3]=wel, pre[4]=well, suf[1]=d,suf[2]=ed,suf[3]=led,suf[4]=eled


Finding the best POS sequence

∏=

=n

iii

aan bapaa

n 1...

**1 )|(maxarg...

1

• Find maximum probability sequence of n tags– Use "top K" breadth first search: – Tag left-to-right, but maintain only top K ranked hypotheses

• Best ranked hypothesis is not guaranteed to be optimal• Alternative: Conditional random fields (Lafferty et al,

2001)


Performance

Domain Word accuracy Unknown word accuracy

Sentence accuracy

English: Wall St. Journal

96.6% 85.3% 47.3%

Spanish: CRATER corpus (with re-mapped tagset)

97.7% 83.3% 60.4%


Summary

• Errors: – are typically the words that are difficult to annotate

• that, about, more

• Architecture– Can be ported easily for similar tasks with different tag sets,

esp. named entity detection• Name detector

– Tags = { begin_Name, continue_Name, other }

– Sequence probability can be used downstream

• Available for download– MXPOST & MXTERMINATOR


Maxent model for Keystone content-match

• Yahoo’s Keystone content-match uses a click model

• Use features of the page,ad,user to predict click• Use cases:

– Select ads with (page,ad) cosine score, use click model to re-rank

– Select ads directly with click model score

€

P(click | page,ad,user) =1

Z( page,ad,user)α j

f j (click,page,ad ,user )

j

∏


Maxent model for Keystone content-match

• Outcomes: click (1) or no click (0)• Questions:

– Unigrams, phrases, categories on the page side– Unigrams, phrases, categories on the ad side– The user’s BT category

• Feature selection– Count cutoff, mutual information


Some recent work

• Recent work– Using the (page,ad) cosine score (Opal) as a feature– Using page domain ad bid phrase mappings– Using user BT ad bid phrase mappings– Using user age+gender bid phrase mappings

• Contact– Andy Hatch (aohatch)– Abraham Bagherjeiran (abagher)


Maxent on the grid

• Several grid maxent implementations at Yahoo• Correction-Free GIS (Curran and Clark, 2003)

– Implemented for verification and lab session– Not product-ready code!

• Input for each iteration– Training data format:[label] [weight] [q1 … qN]

– Feature set format:[q] [label] [parameter]

• Output for each iteration– New feature set format:[q] [label] [new parameter]


Maxent on the grid (cont’d)

• Map phase (parallelization across training data)– Collect observed feature expectations– Collect model’s feature expectations w.r.t. current model– Use feature name as key

• Reduce phrase (parallelization across model parameters)– For each key (feature):

• Sum up observed and model feature expectations • Do the parameter update

– Write the new model

• Repeat for N iterations


Maxent on the grid (cont’d)

• maxent_mapper– args: [file of model params]– stdin: [training data, one instance per line]– stdout: [feature name] [observed val] [current param] [model val]

• maxent_reducer– args: [iteration] [correction constant]– stdin: ( input from maxent_mapper, sorted by key )– stdout: [feature name] [new param]

• Use hadoop streaming, can be used off the grid


Lab session

• Login to a UNIX machine or Mac• mkdir maxent_lab; cd maxent_lab• svn checkout

svn+ssh://svn.corp.yahoo.com/yahoo/adsciences/contextualadvertising/streaming_maxent/trunk/GIS .

• cd src; make clean all• cd ../unit_test• ./doloop


Lab Exercise: Sentence Boundary Detection

• Problem: given a “.” in free text, classify it:– Yes, it is a sentence boundary– No, it is not

• Not a super-hard problem, but not super-easy!– Hand-coded baseline can get high result– With foreign languages, hand-coding is tougher

• cd data• The Penn Treebank corpus : ptb.txt.gz

– gunzip –c ptb.txt.gz | head– Newline indicates sentence boundary– Penn treebank tokenizes text for NLP

• Undid tokenization as much as possible for this exercise


Lab: data generation

• cd ../lab/data• Look at mkData.sh

– creates train/development test/test– Feature extraction

• [true label] [weight] [q1] … [qN]

no 1.0 *default* prefix=Novyes 1.0 *default* prefix=29

• *default* is always on– Model can estimate prior probability of Yes and No

• Prefix: char sequence before “.” up to space char

• run ./mkData.sh


Lab: feature selection and training

• cd ../train• Look at dotrain

– selectfeatures• select features based on freq => cutoff

– train• find correction constant• Iterate (each iteration is one map/reduce job)

– maxent_mapper: collect stats– maxent_reducer: find update

• Look at dotest– accuracy.pl: Classifies as “yes” if prob > 0.5– Evaluation: # of correctly classified test instances

• Run ./dotrain


Lab: matching expectations

• GIS should bring model expectation closer to observed expectation

• After 1st map (1.mapped)

• After 9th map (9.mapped)

Feature observed parameter model

prefix=$1 no 462 1.51773 461.574

prefix=$1 yes 4 0.00731677 4.42558

Feature observed parameter model

prefix=$1 no 462 1 233

prefix=$1 yes 4 1 233


Lab: results

• Log-likelihood of training data must increase

Accuracy: – Train: 46670 correct out of 47771, or 97.6952544430722% – Dev: 14940 correct out of 15579, or 95.8983246678221%

1 2 3 4 5 6 7 8 9

-35000

-30000

-25000

-20000

-15000

-10000

-5000

0

Log Likelihood

Log Likelihood


Lab: your turn!!!

• Beat this result (on the development set only)!• Things to try

– Feature extraction• Look at data and find other things that would be useful for

sentence boundary detection• data/extractfeatures.pl

– Suffix features– Feature classes

– Feature selection• Pay attention to the number of features

– Number of iterations– Pay attention to train vs. test accuracy


Lab: Now let’s try the real test set

• Take your best model, and try it on the test set

./dotest N.features ../data/te

• Did you beat the baseline?

13937 correct out of 14586, or 95.5505279034691%

• Who has the highest result?

Documents

Intro to Maxent