Upload
m
View
210
Download
1
Embed Size (px)
Citation preview
Adwait Ratnaparkhi
Yahoo! Labs
Introduction to maximum entropy models
- 2 - Yahoo! Confidential
Introduction
• This talk is geared for Technical Yahoos…– Who are unfamiliar with machine learning or maximum entropy– Who are familiar with machine learning but not maximum
entropy– Who want to see an example of machine learning on the grid– Who are looking for an introductory talk
• Lab session is geared for Technical Yahoos…– Who brought their laptop– Who want to obtain more numerical intuition about the
optimization algorithms– Who don’t mind coding a little PERL– Who have some patience in case things go awry!
- 3 - Yahoo! Confidential
Outline
• What is maximum entropy modeling?• Example applications of maximum entropy framework• Lab session
– Look at the mapper and reducer of a maximum entropy parameter estimation algorithm
- 4 - Yahoo! Confidential
What is a maximum entropy model?
• A framework for combining evidence into a probability model• Probability can then be used for classification, or input to
next component– Content match: P(click | page, ad, user)– Part-of-speech tagging: P(part-of-speech-tag | word )– Categorization: P(category | page )
• Principle of maximum entropy is used as optimization criterion– Relationship to maximum likelihood optimization– Same model form as multi-class logistic regression
- 5 - Yahoo! Confidential
Machine learning approach
• Data with labels– Automatic: ad click
• sponsored search, content match, graphical ads
– Manual: editorial process• page categories, part-of-speech tags
• Objective– Training phase
• Find a (large) training set• Use training set to construct a classifier or probability model
– Test phase• Find an (honest) test set• Use model to assign labels to previously unseen data
- 6 - Yahoo! Confidential
Why machine learning? Why maxent?
• Why machine learning?– Pervasive ambiguity
• Any task with natural language features
– For these tasks, easier to collect and annotate data vs. hand-coding expert system
– For certain web tasks, annotation is free (e.g., clicks)
• Why maxent?– Little restriction on kinds of evidence
• No independence assumption
– Works well with sparse features– Works well in parallel settings, like Hadoop– Appeal of maxent interpretation
- 7 - Yahoo! Confidential
Sentence Boundary Detection
• Which “.” denotes a sentence boundary?
Mr. B. Green from X.Y.Z. Corp. said that the U.S. budget deficit hit $1.42 trillion for the year that ended Sept. 30. The previous year’s deficit was $459 billion.
• Model: P({yes|no}| candidate boundary)
Correct Boundary
- 8 - Yahoo! Confidential
Part-of-speech tagging
• What is the part of speech of flies ?– Fruit flies like a banana.– Time flies like an arrow.
• Model: P(tag | current word, surrounding words, …)– End goal is sequence of tags
- 9 - Yahoo! Confidential
Content match ad clicks
P(click|page, ad, user)
- 10 - Yahoo! Confidential
Ambiguity resolution: an artificial example
• Tagging unknown, or “out-of-vocabulary”, words:– Given word, predict POS tag based on spelling features
• Assume all words are either:– Proper Noun (NNP)– Verb Gerund (VBG)
• Find a model: P(tag | word) where tag is in {NNP,VBG}• Assume a training set
– Editorially derived (word, tag) pairs– Training data tags are { NNP, VBG }– Reminder: artificial example!
- 11 - Yahoo! Confidential
Ambiguity resolution (cont’d)
• Ask the data– q1: Is first letter of word capitalized ?– q2: Does word end in ing ?
• Evidence for (q1, NNP)– Clinton, Bush, Reagan as NNP
• Evidence for (q2, VBG)– trading, offering, operating as VBG
• Rules based on q1,q2 can conflict– Boeing
• How to choose in ambiguous cases?
- 12 - Yahoo! Confidential
Features represent evidence
• a = what we are predicting• b = what we observe• Terminology
– Here, input to feature is (a,b) pair– But in a lot of ML literature, input to feature is just (b)
€
fy,q (a,b) =1 if a = y AND q(b) = true
0 otherwise
⎧ ⎨ ⎩
f1(a,b) =1 if a = NNP and q1(b) = true
f2(a,b) =1 if a = VBG and q2(b) = true
- 13 - Yahoo! Confidential
Combine the features: probability model
• Probability is product of weights of active features
• Why this form?
€
p(a | b) =1
Z(b)α j
f j (a,b )
j=1
k
∏
Z(b) = α jf j (a ',b )
j=1
k
∏a '
∑
f j (a,b)∈ {0,1} : features
α j > 0 : parameters
- 14 - Yahoo! Confidential
Probabilities for ambiguity resolution
• How do we find optimal parameter values?€
p(NNP | Boeing) =1
Zα 1
f1 (a,b )α 2f2 (a,b )
=1
Zα 1
f1 (a,b )
p(VBG | Boeing) =1
Zα 1
f1 (a,b )α 2f2 (a,b )
=1
Zα 2
f2 (a,b )
- 15 - Yahoo! Confidential
Maximum likelihood estimation
QpML
ba
k
j
baf
j
pLp
bapbarpL
(a,b)bar
bZbappQ j
∈
=
=
=
=
==
∑
∏
)(maxarg
)|(log),()(
} of frequency Normalized{),(
})(
1)|(|{
,
1
),(α
- 16 - Yahoo! Confidential
Principle of maximum entropy (Jaynes, 1957)
• Use probability model that is maximally uncertain, w.r.t. to observed evidence
• Why? Anything else assumes a fact you have not observed.
PpME pHp
pH
P
∈==
=
)(maxarg
}p of entropy{)(}evidence withconsistent models{
- 17 - Yahoo! Confidential
Maxent example
• Task: estimate joint distribution p(A,B)– A is in {x,y}– B is in {0,1}
• Define a feature f– Assume some expected value over a training set
p(A,B) 0 1
x ? ?
y ? 0.7€
f (a,b) =1 iff (a = y & b =1), 0 otherwise
E[ f ] = 7 /10 = .7
p(a,b) =1a,b
∑
- 18 - Yahoo! Confidential
Maxent example (cont’d)
• Define entropy
• One way to meet the constraints. H(p) = 1.25
• Maxent way to meet constraints. H(p) = 1.35
p(A,B) 0 1
x .05 0.2
y .05 0.7
p(A,B) 0 1
x 0.1 0.1
y 0.1 0.7
€
H( p) = − p(a,b)log p(a,b)a,b
∑
- 19 - Yahoo! Confidential
Conditional maximum entropy (Berger et al, 1995)
€
E r f j = {observed expectation of f}
= r(a,b) f j
a,b
∑ (a,b)
E p f j = {model's expectation of f}
= r(b)p(a | b) f j (a,b)a,b
∑
P = {p | E p f j = E r f j , j =1...k}
H( p) = − r(b) p(a | b)log p(a | b)a,b
∑
pME = argmaxp∈P
H( p)
- 20 - Yahoo! Confidential
Duality of ML and ME
• Under ME it must be the case that:
• ML and ME solutions are the same– pme=pml
– ML: form is assumed without justification– ME: constraints are assumed, form is derived
€
pME (a | b) =1
Z(b)α j
f j (a,b )
j=1
k
∏
- 21 - Yahoo! Confidential
Extensions: Minimum divergence modeling
• Kullback-Leibler divergence– Measures “distance” between 2 probability distributions– Not symmetric!– See (Cover and Thomas, Elements of Information Theory)
€
D( p,q) = r(b) p(a | b)logp(a | b)
q(a | b)a,b
∑
D( p,q) ≥ 0
D( p,q) = 0 iff p = q
- 22 - Yahoo! Confidential
Extensions : Minimum divergence models (cont’d)
• Minimum divergence framework:– Start with prior model q– From the set of consistent models P, minimize KL divergence to q
• Parameters will reflect deviation from prior model– Use case: prior model is static
• Same as maximizing entropy when q is uniform• See (Della Pietra et al, 1992) for an example in language modeling
€
P = {p | E p f j = E r f j , j =1...k}
pMD = argminp∈P
D(p,q)
pMD (a | b) =
q(a | b) α jf j (a,b )
j=1
k
∏
q(a' | b) α jf j (a ',b )
j=1
k
∏a'
∑
- 23 - Yahoo! Confidential
Parameter estimation (an incomplete list)
• Generalized Iterative Scaling (Darroch & Ratcliff, 1972)– Find correction feature and
constant– Iterative updates
• Improved iterative scaling (Della Pietra et. al., 1997)
• Conjugate gradient• Sequential conditional GIS
(Goodman, 2002)
• Correction-free GIS (Curran and Clark, 2003)
€
Define correction feature :
fk +1(a,b) = C − f j (a,b)j=1
k
∑
GIS :
α j(0) =1
α j(n ) = α j
(n−1) E r f j
E p(n−1) f j
⎡
⎣ ⎢
⎤
⎦ ⎥
1/C
- 24 - Yahoo! Confidential
Comparisons
• Same model form as multi-class logistic regression• Diverse forms of evidence• Compared to decision trees:
– Advantage: No data fragmentation– Disadvantage: No feature construction
• Compared to naïve bayes– No independence assumptions
• Scales well on sparse feature sets– Parameter estimation (GIS) is
O( [# of training samples] [# of predictions] [avg. # of features per training event])
- 25 - Yahoo! Confidential
Disadvantages
• “Perfect” predictors cause parameters to diverge– Suppose the word the only occurred with tag DT– Estimation algorithm is forcing p(a|b) = 1 in order to meet
constraints• Parameter for (the, DT) will diverge to infinity• May beat out other parameters estimated from many examples!
• A remedy– Gaussian priors or “fuzzy maximum entropy” (Chen &
Rosenfeld, 2000)– Discount observed expectations
- 26 - Yahoo! Confidential
How to specify a maxent model
• Outcomes– What are we predicting?
• Questions– What information is useful for predicting ?
• Feature selection– Candidate feature set consists of all (outcome, question) pairs– Given candidate feature set, what subset do we use?
- 27 - Yahoo! Confidential
Finding the part-of-speech
• Part of Speech (POS) Tagging– Return a sequence of POS tags
Input: Fruit flies like a banana Output: N N V D N
Input: Time flies like an arrowOutput: N V P D N
• Train maxent models from POS tags of Penn treebank (Marcus et al, 1993)
• Use heavily pruned search procedure to find highest probability tag sequence
- 28 - Yahoo! Confidential
Model for POS tagging(Ratnaparkhi, 1996)
• Outcomes– 45 POS tags (Penn Treebank)
• Question Patterns:– common words: word identity – rare words: presence of prefix, suffix, capitalization, hyphen,
and numbers– previous 2 tags– surrounding 2 words
• Feature Selection– count cutoff of 10
- 29 - Yahoo! Confidential
A training event
• Example:
– ...stories about well-heeled communities and ... NNS IN ???
• Outcome: JJ (adjective)• Questions
– w[i-1]=about, w[i-2]=stories, w[i+1]=communities, w[i+2]=and, t[i-1]=IN, t[i-2][i-1]=NNS IN,pre[1]=w, pre[2]=we, pre[3]=wel, pre[4]=well, suf[1]=d,suf[2]=ed,suf[3]=led,suf[4]=eled
- 30 - Yahoo! Confidential
Finding the best POS sequence
∏=
=n
iii
aan bapaa
n 1...
**1 )|(maxarg...
1
• Find maximum probability sequence of n tags– Use "top K" breadth first search: – Tag left-to-right, but maintain only top K ranked hypotheses
• Best ranked hypothesis is not guaranteed to be optimal• Alternative: Conditional random fields (Lafferty et al,
2001)
- 31 - Yahoo! Confidential
Performance
Domain Word accuracy Unknown word accuracy
Sentence accuracy
English: Wall St. Journal
96.6% 85.3% 47.3%
Spanish: CRATER corpus (with re-mapped tagset)
97.7% 83.3% 60.4%
- 32 - Yahoo! Confidential
Summary
• Errors: – are typically the words that are difficult to annotate
• that, about, more
• Architecture– Can be ported easily for similar tasks with different tag sets,
esp. named entity detection• Name detector
– Tags = { begin_Name, continue_Name, other }
– Sequence probability can be used downstream
• Available for download– MXPOST & MXTERMINATOR
- 33 - Yahoo! Confidential
Maxent model for Keystone content-match
• Yahoo’s Keystone content-match uses a click model
• Use features of the page,ad,user to predict click• Use cases:
– Select ads with (page,ad) cosine score, use click model to re-rank
– Select ads directly with click model score
€
P(click | page,ad,user) =1
Z( page,ad,user)α j
f j (click,page,ad ,user )
j
∏
- 34 - Yahoo! Confidential
Maxent model for Keystone content-match
• Outcomes: click (1) or no click (0)• Questions:
– Unigrams, phrases, categories on the page side– Unigrams, phrases, categories on the ad side– The user’s BT category
• Feature selection– Count cutoff, mutual information
- 35 - Yahoo! Confidential
Some recent work
• Recent work– Using the (page,ad) cosine score (Opal) as a feature– Using page domain ad bid phrase mappings– Using user BT ad bid phrase mappings– Using user age+gender bid phrase mappings
• Contact– Andy Hatch (aohatch)– Abraham Bagherjeiran (abagher)
- 36 - Yahoo! Confidential
Maxent on the grid
• Several grid maxent implementations at Yahoo• Correction-Free GIS (Curran and Clark, 2003)
– Implemented for verification and lab session– Not product-ready code!
• Input for each iteration– Training data format:[label] [weight] [q1 … qN]
– Feature set format:[q] [label] [parameter]
• Output for each iteration– New feature set format:[q] [label] [new parameter]
- 37 - Yahoo! Confidential
Maxent on the grid (cont’d)
• Map phase (parallelization across training data)– Collect observed feature expectations– Collect model’s feature expectations w.r.t. current model– Use feature name as key
• Reduce phrase (parallelization across model parameters)– For each key (feature):
• Sum up observed and model feature expectations • Do the parameter update
– Write the new model
• Repeat for N iterations
- 38 - Yahoo! Confidential
Maxent on the grid (cont’d)
• maxent_mapper– args: [file of model params]– stdin: [training data, one instance per line]– stdout: [feature name] [observed val] [current param] [model val]
• maxent_reducer– args: [iteration] [correction constant]– stdin: ( input from maxent_mapper, sorted by key )– stdout: [feature name] [new param]
• Use hadoop streaming, can be used off the grid
- 39 - Yahoo! Confidential
Lab session
• Login to a UNIX machine or Mac• mkdir maxent_lab; cd maxent_lab• svn checkout
svn+ssh://svn.corp.yahoo.com/yahoo/adsciences/contextualadvertising/streaming_maxent/trunk/GIS .
• cd src; make clean all• cd ../unit_test• ./doloop
- 40 - Yahoo! Confidential
Lab Exercise: Sentence Boundary Detection
• Problem: given a “.” in free text, classify it:– Yes, it is a sentence boundary– No, it is not
• Not a super-hard problem, but not super-easy!– Hand-coded baseline can get high result– With foreign languages, hand-coding is tougher
• cd data• The Penn Treebank corpus : ptb.txt.gz
– gunzip –c ptb.txt.gz | head– Newline indicates sentence boundary– Penn treebank tokenizes text for NLP
• Undid tokenization as much as possible for this exercise
- 41 - Yahoo! Confidential
Lab: data generation
• cd ../lab/data• Look at mkData.sh
– creates train/development test/test– Feature extraction
• [true label] [weight] [q1] … [qN]
no 1.0 *default* prefix=Novyes 1.0 *default* prefix=29
• *default* is always on– Model can estimate prior probability of Yes and No
• Prefix: char sequence before “.” up to space char
• run ./mkData.sh
- 42 - Yahoo! Confidential
Lab: feature selection and training
• cd ../train• Look at dotrain
– selectfeatures• select features based on freq => cutoff
– train• find correction constant• Iterate (each iteration is one map/reduce job)
– maxent_mapper: collect stats– maxent_reducer: find update
• Look at dotest– accuracy.pl: Classifies as “yes” if prob > 0.5– Evaluation: # of correctly classified test instances
• Run ./dotrain
- 43 - Yahoo! Confidential
Lab: matching expectations
• GIS should bring model expectation closer to observed expectation
• After 1st map (1.mapped)
• After 9th map (9.mapped)
Feature observed parameter model
prefix=$1 no 462 1.51773 461.574
prefix=$1 yes 4 0.00731677 4.42558
Feature observed parameter model
prefix=$1 no 462 1 233
prefix=$1 yes 4 1 233
- 44 - Yahoo! Confidential
Lab: results
• Log-likelihood of training data must increase
Accuracy: – Train: 46670 correct out of 47771, or 97.6952544430722% – Dev: 14940 correct out of 15579, or 95.8983246678221%
1 2 3 4 5 6 7 8 9
-35000
-30000
-25000
-20000
-15000
-10000
-5000
0
Log Likelihood
Log Likelihood
- 45 - Yahoo! Confidential
Lab: your turn!!!
• Beat this result (on the development set only)!• Things to try
– Feature extraction• Look at data and find other things that would be useful for
sentence boundary detection• data/extractfeatures.pl
– Suffix features– Feature classes
– Feature selection• Pay attention to the number of features
– Number of iterations– Pay attention to train vs. test accuracy
- 46 - Yahoo! Confidential
Lab: Now let’s try the real test set
• Take your best model, and try it on the test set
./dotest N.features ../data/te
• Did you beat the baseline?
13937 correct out of 14586, or 95.5505279034691%
• Who has the highest result?