Upload
leslie-howard
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Combining Statistical Language Models via the Latent Maximum
Entropy Principle
Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao
Abstract
• Simultaneously incorporate various aspects of natural language– Local word interaction, syntactic structure, semantic
document information
• Latent maximum entropy (LME) principle– Which allows relationships over hidden features to be
effectively captured in a unified model
– Local lexical models (N-gram models)
– Global document-level semantic models (PLSA)
Introduction
• There are various kinds of language models that can be used to capture different aspects of natural language regularity– Markov chain (N-gram) models effectively capture
local lexical regularities in text
– Smoothed N-gram: estimating rare events
– Increase the order of an N-gram to capture longer range dependencies in natural language curse of dimensionality (Rosenfeld)
Introduction
– Structural language model effectively exploits relevant syntactic regularities to improve the perplexity score of N-gram models
– Semantic language model exploits document-level semantic regularities to achieve similar improvements
• Although each of these language models outperforms simple N-grams, they each only capture specific linguistic phenomena
Introduction
• Several techniques for combining language models:– Linear interpolation: each individual model is trained
separately and then combined by a weighted linear combination
– Maximum entropy: it model distributions over explicitly observed features
• There are many hidden semantic and syntactic information in natural language
Introduction
• Latent maximum entropy (LME) principle extends ME to incorporate latent variables
• Let XX denote the complete data, YY be the observed incomplete data and ZZ be the missing data. X = (Y, Z)
• The goal of ME is to find a probability model that matches certain constraints in the observed data while otherwise maximizing entropy
Maximum Entropy
• construct separate models– Build a single, combined model, which attempts to
capture all the information provided by tie various knowledge sources
• The intersection of all the constraints, if not empty, contains a (typically infinite) set of probability function, which are all consistent with the knowledge sources
• The second step in the ME approach is to choose, from among the functions in that set, that function which has the highest entropy (i.e. the “flattest” function)
An example• Assume we wish to estimate P(“BANK”|h), one est
imate may be provided by a conventional bigram
Features
An example
• Consider one such equivalence class, say, the one where the history ends in “THE”. The bigram assigns the same probability estimate to all events in that class
)THE(
)BANKTHE,(
)THE|BANK(
}BANKTHE,{
}BANKTHE,{BIGRAM
C
CK
KPdef
where is thedoes the
…… the[1]
An example
• Another estimate may be provided by a particular trigger pair, say (LOAN,BANK)– Assume we want to capture the dependency of “BANK
” on whether or not “LOAN” occurred before it in the same document. Thus a different partition of the event space will be added, as in Table IV
– Similarly to the bigram case, consider now one such equivalence class, say, the one where “LOAN” did occur in the history. The trigger component assigns the same probability estimate to all events in that class
)LOAN(
)LOANBANK,(
)LOAN|BANK(
}LOANBANK,{
}LOANBANK,{BANKLOAN
hC
hCK
KhPdef
h
h
…loan…the …loan…of
Features
An example• Consider the bigram, under ME, we no longer insi
st that P(BANK|h) always have the same value (K
{THE,BANK}) whenever the history ends in “THE”
• Rather, we only require that PCOMBINED (BANK|h) be equal to (K{THE,BANK}) on average in the training data
}BANKTHE,{COMBINED"THE"in ends
)|BANK(E KhPh
Equation [1] is replaced by
where E stands for an expectation
[2]
An example
• Similarly, we require that PCOMBINED(BANK|h) be equal to K{BANK,LOANvh} on average over those histories that contain occurrences of “LOAN”
}LOANBANK,{COMBINEDLOAN""
)|BANK(E hh
KhP [3]
Information sources as constraint functions
• We can view each information source as defining a subset (or many subsets) of the event space (h,w)
• We can define any subset S of the event space, and any desired expectation K, and impose the constraint:
• The subset S can be specified by an index function, also called selector function, fs :
Swh
KwhP),(
),(
otherwise 0
),( if 1),(
Swhwhf
def
s
[4]
Information sources as constraint functions
• so Equation [4] becomes:
),(
),(),(wh
s KwhfwhP [5]
Maximum Entropy
• The ME Principle (Jaynes, 1975; Kullback, 1959) can be stated as follows– 1. Reformulate the different information sources as con
straints to be satisfied by the target (combined) estimate
– 2. Among all probability distributions that satisfy these constraints, choose the one that has the highest entropy
Maximum Entropy
• Given a general event space {x}, to derive a combined probability function P(x), each constraint i is associated with a constraint function fi(x) and a desired expectation Ki
• The constraint is then written as :
ii
def
ip KfPfE )()(x
xx [6]
LME
• Given features f1,…,fN specifying the properties we would like to match in the data, select a joint model p* from the set of possible probability distributions that maximizes the entropy
Nizyfyzpypxfxp
xpxppHp
zi
yxi
xp
,...,1 ; ,|~subject to
logmax*
model lstatistica theinto structure dependencyhidden
theencodes | and data, training theof components
observed ofset theofon distributi empirical theis ~ Here
yzp
yp
LME
• Intuitively, the constraints specify that we require the expectations of fi(X) in the joint model to match their empirical expectations on the incomplete data Y– fi(X) = fi(Y, Z)
• When the features only depend on the observable data Y, the LME is equivalent to ME
Regularized LME (RLME)
• ME principle are subject to errors due to the empirical data, especially in a very sparse domain– Add a penalty to the entropy of the joint model
(2) ,...,1 ; ,|~subject to
(1) logmax,
Niazyfyzpypxfxp
aUxpxpaUpH
iz
iyx
i
xap
0at minimum its hashich function wconvex a is : and
,constrainteach for error theis ,,..., Here 1
RRU
aaaaN
iN
A Training Algorithm
• Assume we have already selected the features f1,…fN from the training data
• Restrict p(x) to be an exponential model
1 ensuresthat
constant a is where,exp1
Xx
i ii
xp
xfxp
A Training Algorithm
• Intimately related to finding locally maximum a posteriori (MAP) solutions:– Given a penalty function U over errors a, an associated
prior U* on can be obtained by setting U* to the convex conjugate of U
– e.g., given a quadratic penalty , the
convex conjugate can be determined
by setting ; which specifies a Gaussian prior on
N
i ii aaU1
22
2
1
N
ii
iU1 2
2*
2
2iiia
A Training Algorithm
– Then, given a prior U*, the standard MAP estimate maximizes the penalized log-likelihood
• Our key result is that locally maximizing R() is equivalent to satisfying the feasibility constraints (2) of the RLME principle
*log~ UypypR
y
A Training Algorithm
• THEOREM 1.– Under the log-linear assumption, locally maximizing a
posterior probability of log-linear models on incomplete data is equivalent to satisfying the feasibility constraints of the RLME principle. That is, the only distinction between MAP and RLME in log-linear models is that, among local maxima (feasible solutions), RLME selects the model with the maximum entropy, whereas MAP selects the model with the maximum posterior probability
A Training Algorithm
• R-EM-IS, which employs an EM algorithm as an outer loop, but uses a nested GIS/IIS algorithm to perform the internal M step
• Decompose the penalized log-likelihood function R():
This is a standard decomposition used for deriving EM
Yy Zz
Yy Zz
Yy
yzpyzpypH
UxpyzpypQ
HQUypypR
|log|~',
,log|~', where
',',log~
'
*'
*
A Training Algorithm
• For log-linear models:
(3) |~log
,
*
1
)(
)(
Uyzpxfyp
QN
i Zzi
Yyi
j
j
iii
i ii
xfxp
xfxp
loglog
exp1
A Training Algorithm
• LEMMA 1.
zii
yxi
xp
jj
Niazyfyzpypxfxp
aUxpxpaUpH
Q
j (5) ,...,1 ; ,|~subject to
(4) logmax
solving toequivalent is
fixedfor offunction a as , Maximum
R-EM-IS algorithm
(7) ,|~
satisfies where
(6) ,..., ;
follows as IIS)or (GIS
scaling iterativeby ,...,for value
parameter theof updates parallel Perform :
.,...,for
,|~ Compute :
//
/
///
#/
/
zi
y
i
Ksji
Ksji
x
xfi
Ksji
Ksji
Ksji
Ksji
i
z iy
zyfyzpyp
exfxp
Ks
Ni
KstepM
Ni
zyfyzpypstepE
j
Ksji
Ksj
j
2
1
1
1
1
1
1
R-EM-IS algorithm
• THEOREM 2.
modelslinear -logfor principle RLME the tosolutions
feasible yieldsally asymptotic IS-EM-R Therefore,
(8) 0:
set the tobelong ,,...,1 ,0,
sequence IS-EM-Rany of pointslimit
all and ,function likelihood-log penalized the
increaseslly monotonica algorithm IS-EM-R The
/
RR
Ksj
R
N
ksj
yy
yyy
yy
y
yp
ypyp
yp
ypyp
yplypyplypypl
yp
ypll
yp
yp
yp
R
yplUypyp
UypypR
~
~
~~
~ ~ ~
~
~
log~
log~
*
*
0
1
R-ME-EM-IS algorithm
entropy. dregularizehighest theachieves
thatcandidate feasible theChoose .candidates feasibledistinct ofset
a produce to timesseveral steps above Repeat the :
. ofentropy dregularize theCalculate :
. feasibleobtain toe,convergenc toIS-EM-RRun :
.for guesses initial chooseRandomly :
*
*
selectionModel
pncalculatioEntropy
ISEMR
tionInitializa
Combining N-gram and PLSA Models
data. unobserved is ,,
and data observed theis ,,, Thus,
document. a is
words,e with thesassociated valuestopic''hidden theare ,,,
words,previous twoandcurrent theare ,, where
,,,,,,, as data complete theDefine
012
012
210
210
012012
TTTz
DWWWy
D
TTT
WWW
TTTDWWWx
Tri-gram portion
(9) |~~
|~~,
|~~,,
2
0
1
01
012
di
x lil
dji
x ljlil
dkji
xkji
dwpdpwWxp
dwwpdpwWwWxp
dwwwpdpwWwWwWxp
PLSA portion
(10)
,||~~,
wordand picbetween to sconstraint
,||~~,
: topicanddocument between sconstraint
2
0
2
0
2
0
2
0
d lili
x lill
lll
x ll
dwWtpdwpdpwWtTxp
dWtpdWpdpdDtTxp
Efficient Feature Expectation and Inference
• Sum-product algorithm:
(13)
)(0 0 1 1 2 2
0122201212211011000
012012
0120011220120112012
,,,,,,
w t w t w t d
dtttwww
dtdtdttwwwwwwwtwwwwtww
dtdtdttwtwtwwwwwwwwwww
eeeeeeeeeeee
eeeeeeeeeeee
(12)
,,
0 1 2
012210
1
012
t t t d
xkji
dtdtdttiwtjwtkw
kwjwiwkwjwjwiwkwwjiw
eeeeee
eeeeee
wWwWwWxp
Normalization constant:
Feature expectations:
wh iii
iii
whfZ
whfZ
whp
ME
,
,exp
,exp,
:
1
Semantic Smoothing
• Add node C as word cluster between each topic node and word node.– |C| = 1, maximum smoothing
– |C| = |V|, no smoothing (|V| is the vocabulary size)
• Add node S as document cluster between topic node and document node– |S| = 1, over-smoothed
– |S| = |D|, no smoothing (|D| is the number of documents)
W|C|T
T|S|D
Computation in Testing
L
l TTTDlll
L
l TTTDll
L
lll
L
wwTTTDwp
wwTTTDwp
wwwp
wwp
1 ,,,12012
1 ,,,11012
111
1
012
012
,|,,,,
...|,,,,
...|
...
Experimental Evaluation
• Training Data:– NAB : 87000 documents, 1987~1989, 38M words
• Vocabulary size, |V|, is 20000, the most frequent words of the training data
• Testing Data:– 325000 words, 1989
• Evaluation:
PerplexityEntropy
wwwpspPerplexity L
L
i ii
L
2
1 11
log
...|
11
Experimental design
• Baseline: Tri-gram with GT smoothing, perplexity is 105
• R-EM-IS procedure: EM iteration is 5, IIS loop iteration is 20
• Feasible solution: initialized the parameters to zero and executed a single run of R-EM-IS
• RLME and MAP: use 20 random starting points for
Simple tri-gram
• There are no hidden variables, MAP, RLME and single run of R-EM-IS all reduce to the same standard ME principle
• Perplexity score is 107
Tri-gram + PLSA• |T| = 125
95
91
89
Add word cluster
89
93
84
Add topic cluster
90
87
Add word and topic clusters (1/2)
82
Add word and topic clusters (2/2)
Experiment summarizes
Extensions
• LSA:
• Perplexity = 97
(14) |
|
||
(13) ||
||
...|
12
12
12
12
11
i
i
w iLSA
liLSAlli
lLSA
llLSAlll
wilLSAlli
llLSAlll
ll
wP
dwpwwwp
wP
dwpwwwp
wdpwwwp
wdpwwwp
wwwp
Extensions
• Raising LSA portion’s contribution to some power and renormalizing
• Perplexity = 82, equal to the best results using RLME
(15) ||
||
...|
712
712
11
iwilLSAlli
llLSAlll
ll
wdpwwwp
wdpwwwp
wwwp