Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Combining Statistical Language Models via the Latent Maximum

Entropy Principle

Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Abstract

• Simultaneously incorporate various aspects of natural language– Local word interaction, syntactic structure, semantic

document information

• Latent maximum entropy (LME) principle– Which allows relationships over hidden features to be

effectively captured in a unified model

– Local lexical models (N-gram models)

– Global document-level semantic models (PLSA)

Introduction

• There are various kinds of language models that can be used to capture different aspects of natural language regularity– Markov chain (N-gram) models effectively capture

local lexical regularities in text

– Smoothed N-gram: estimating rare events

– Increase the order of an N-gram to capture longer range dependencies in natural language curse of dimensionality (Rosenfeld)

Introduction

– Structural language model effectively exploits relevant syntactic regularities to improve the perplexity score of N-gram models

– Semantic language model exploits document-level semantic regularities to achieve similar improvements

• Although each of these language models outperforms simple N-grams, they each only capture specific linguistic phenomena

Introduction

• Several techniques for combining language models:– Linear interpolation: each individual model is trained

separately and then combined by a weighted linear combination

– Maximum entropy: it model distributions over explicitly observed features

• There are many hidden semantic and syntactic information in natural language

Introduction

• Latent maximum entropy (LME) principle extends ME to incorporate latent variables

• Let XX denote the complete data, YY be the observed incomplete data and ZZ be the missing data. X = (Y, Z)

• The goal of ME is to find a probability model that matches certain constraints in the observed data while otherwise maximizing entropy

Maximum Entropy

• construct separate models– Build a single, combined model, which attempts to

capture all the information provided by tie various knowledge sources

• The intersection of all the constraints, if not empty, contains a (typically infinite) set of probability function, which are all consistent with the knowledge sources

• The second step in the ME approach is to choose, from among the functions in that set, that function which has the highest entropy (i.e. the “flattest” function)

An example• Assume we wish to estimate P(“BANK”|h), one est

imate may be provided by a conventional bigram

Features

An example

• Consider one such equivalence class, say, the one where the history ends in “THE”. The bigram assigns the same probability estimate to all events in that class

)THE(

)BANKTHE,(

)THE|BANK(

}BANKTHE,{

}BANKTHE,{BIGRAM

C

CK

KPdef

where is thedoes the

…… the[1]

An example

• Another estimate may be provided by a particular trigger pair, say (LOAN,BANK)– Assume we want to capture the dependency of “BANK

” on whether or not “LOAN” occurred before it in the same document. Thus a different partition of the event space will be added, as in Table IV

– Similarly to the bigram case, consider now one such equivalence class, say, the one where “LOAN” did occur in the history. The trigger component assigns the same probability estimate to all events in that class

)LOAN(

)LOANBANK,(

)LOAN|BANK(

}LOANBANK,{

}LOANBANK,{BANKLOAN

hC

hCK

KhPdef

h

h

…loan…the …loan…of

Features

An example• Consider the bigram, under ME, we no longer insi

st that P(BANK|h) always have the same value (K

{THE,BANK}) whenever the history ends in “THE”

• Rather, we only require that PCOMBINED (BANK|h) be equal to (K{THE,BANK}) on average in the training data

}BANKTHE,{COMBINED"THE"in ends

)|BANK(E KhPh

Equation [1] is replaced by

where E stands for an expectation

[2]

An example

• Similarly, we require that PCOMBINED(BANK|h) be equal to K{BANK,LOANvh} on average over those histories that contain occurrences of “LOAN”

}LOANBANK,{COMBINEDLOAN""

)|BANK(E hh

KhP [3]

Information sources as constraint functions

• We can view each information source as defining a subset (or many subsets) of the event space (h,w)

• We can define any subset S of the event space, and any desired expectation K, and impose the constraint:

• The subset S can be specified by an index function, also called selector function, fs :

Swh

KwhP),(

),(

otherwise 0

),( if 1),(

Swhwhf

def

s

[4]

Information sources as constraint functions

• so Equation [4] becomes:

),(

),(),(wh

s KwhfwhP [5]

Maximum Entropy

• The ME Principle (Jaynes, 1975; Kullback, 1959) can be stated as follows– 1. Reformulate the different information sources as con

straints to be satisfied by the target (combined) estimate

– 2. Among all probability distributions that satisfy these constraints, choose the one that has the highest entropy

Maximum Entropy

• Given a general event space {x}, to derive a combined probability function P(x), each constraint i is associated with a constraint function fi(x) and a desired expectation Ki

• The constraint is then written as :

ii

def

ip KfPfE )()(x

xx [6]

LME

• Given features f1,…,fN specifying the properties we would like to match in the data, select a joint model p* from the set of possible probability distributions that maximizes the entropy

Nizyfyzpypxfxp

xpxppHp

zi

yxi

xp

,...,1 ; ,|~subject to

logmax*

model lstatistica theinto structure dependencyhidden

theencodes | and data, training theof components

observed ofset theofon distributi empirical theis ~ Here

yzp

yp

LME

• Intuitively, the constraints specify that we require the expectations of fi(X) in the joint model to match their empirical expectations on the incomplete data Y– fi(X) = fi(Y, Z)

• When the features only depend on the observable data Y, the LME is equivalent to ME

Regularized LME (RLME)

• ME principle are subject to errors due to the empirical data, especially in a very sparse domain– Add a penalty to the entropy of the joint model

(2) ,...,1 ; ,|~subject to

(1) logmax,

Niazyfyzpypxfxp

aUxpxpaUpH

iz

iyx

i

xap

0at minimum its hashich function wconvex a is : and

,constrainteach for error theis ,,..., Here 1

RRU

aaaaN

iN

A Training Algorithm

• Assume we have already selected the features f1,…fN from the training data

• Restrict p(x) to be an exponential model

1 ensuresthat

constant a is where,exp1

Xx

i ii

xp

xfxp


• Intimately related to finding locally maximum a posteriori (MAP) solutions:– Given a penalty function U over errors a, an associated

prior U* on can be obtained by setting U* to the convex conjugate of U

– e.g., given a quadratic penalty , the

convex conjugate can be determined

by setting ; which specifies a Gaussian prior on

N

i ii aaU1

22

2

1

N

ii

iU1 2

2*

2

2iiia


– Then, given a prior U*, the standard MAP estimate maximizes the penalized log-likelihood

• Our key result is that locally maximizing R() is equivalent to satisfying the feasibility constraints (2) of the RLME principle

*log~ UypypR

y


• THEOREM 1.– Under the log-linear assumption, locally maximizing a

posterior probability of log-linear models on incomplete data is equivalent to satisfying the feasibility constraints of the RLME principle. That is, the only distinction between MAP and RLME in log-linear models is that, among local maxima (feasible solutions), RLME selects the model with the maximum entropy, whereas MAP selects the model with the maximum posterior probability


• R-EM-IS, which employs an EM algorithm as an outer loop, but uses a nested GIS/IIS algorithm to perform the internal M step

• Decompose the penalized log-likelihood function R():

This is a standard decomposition used for deriving EM

Yy Zz

Yy Zz

Yy

yzpyzpypH

UxpyzpypQ

HQUypypR

|log|~',

,log|~', where

',',log~

'

*'

*


• For log-linear models:

(3) |~log

,

*

1

)(

)(

Uyzpxfyp

QN

i Zzi

Yyi

j

j

iii

i ii

xfxp

xfxp

loglog

exp1


• LEMMA 1.

zii

yxi

xp

jj

Niazyfyzpypxfxp

aUxpxpaUpH

Q

j (5) ,...,1 ; ,|~subject to

(4) logmax

solving toequivalent is

fixedfor offunction a as , Maximum

R-EM-IS algorithm

(7) ,|~

satisfies where

(6) ,..., ;

follows as IIS)or (GIS

scaling iterativeby ,...,for value

parameter theof updates parallel Perform :

.,...,for

,|~ Compute :

//

/

///

#/

/

zi

y

i

Ksji

Ksji

x

xfi

Ksji

Ksji

Ksji

Ksji

i

z iy

zyfyzpyp

exfxp

Ks

Ni

KstepM

Ni

zyfyzpypstepE

j

Ksji

Ksj

j

2

1

1

1

1

1

1

R-EM-IS algorithm

• THEOREM 2.

modelslinear -logfor principle RLME the tosolutions

feasible yieldsally asymptotic IS-EM-R Therefore,

(8) 0:

set the tobelong ,,...,1 ,0,

sequence IS-EM-Rany of pointslimit

all and ,function likelihood-log penalized the

increaseslly monotonica algorithm IS-EM-R The

/

RR

Ksj

R

N

ksj

yy

yyy

yy

y

yp

ypyp

yp

ypyp

yplypyplypypl

yp

ypll

yp

yp

yp

R

yplUypyp

UypypR

~

~

~~

~ ~ ~

~

~

log~

log~

*

*

0

1

R-ME-EM-IS algorithm

entropy. dregularizehighest theachieves

thatcandidate feasible theChoose .candidates feasibledistinct ofset

a produce to timesseveral steps above Repeat the :

. ofentropy dregularize theCalculate :

. feasibleobtain toe,convergenc toIS-EM-RRun :

.for guesses initial chooseRandomly :

*

*

selectionModel

pncalculatioEntropy

ISEMR

tionInitializa

Combining N-gram and PLSA Models

data. unobserved is ,,

and data observed theis ,,, Thus,

document. a is

words,e with thesassociated valuestopic''hidden theare ,,,

words,previous twoandcurrent theare ,, where

,,,,,,, as data complete theDefine

012

012

210

210

012012

TTTz

DWWWy

D

TTT

WWW

TTTDWWWx

Tri-gram portion

(9) |~~

|~~,

|~~,,

2

0

1

01

012

di

x lil

dji

x ljlil

dkji

xkji

dwpdpwWxp

dwwpdpwWwWxp

dwwwpdpwWwWwWxp

PLSA portion

(10)

,||~~,

wordand picbetween to sconstraint

,||~~,

: topicanddocument between sconstraint

2

0

2

0

2

0

2

0

d lili

x lill

lll

x ll

dwWtpdwpdpwWtTxp

dWtpdWpdpdDtTxp

Efficient Feature Expectation and Inference

• Sum-product algorithm:

(13)

)(0 0 1 1 2 2

0122201212211011000

012012

0120011220120112012

,,,,,,

w t w t w t d

dtttwww

dtdtdttwwwwwwwtwwwwtww

dtdtdttwtwtwwwwwwwwwww

eeeeeeeeeeee

eeeeeeeeeeee

(12)

,,

0 1 2

012210

1

012

t t t d

xkji

dtdtdttiwtjwtkw

kwjwiwkwjwjwiwkwwjiw

eeeeee

eeeeee

wWwWwWxp

Normalization constant:

Feature expectations:

wh iii

iii

whfZ

whfZ

whp

ME

,

,exp

,exp,

:

1

Semantic Smoothing

• Add node C as word cluster between each topic node and word node.– |C| = 1, maximum smoothing

– |C| = |V|, no smoothing (|V| is the vocabulary size)

• Add node S as document cluster between topic node and document node– |S| = 1, over-smoothed

– |S| = |D|, no smoothing (|D| is the number of documents)

W|C|T

T|S|D

Computation in Testing

L

l TTTDlll

L

l TTTDll

L

lll

L

wwTTTDwp

wwTTTDwp

wwwp

wwp

1 ,,,12012

1 ,,,11012

111

1

012

012

,|,,,,

...|,,,,

...|

...

Experimental Evaluation

• Training Data:– NAB : 87000 documents, 1987~1989, 38M words

• Vocabulary size, |V|, is 20000, the most frequent words of the training data

• Testing Data:– 325000 words, 1989

• Evaluation:

PerplexityEntropy

wwwpspPerplexity L

L

i ii

L

2

1 11

log

...|

11

Experimental design

• Baseline: Tri-gram with GT smoothing, perplexity is 105

• R-EM-IS procedure: EM iteration is 5, IIS loop iteration is 20

• Feasible solution: initialized the parameters to zero and executed a single run of R-EM-IS

• RLME and MAP: use 20 random starting points for

Simple tri-gram

• There are no hidden variables, MAP, RLME and single run of R-EM-IS all reduce to the same standard ME principle

• Perplexity score is 107

Tri-gram + PLSA• |T| = 125

95

91

89

Add word cluster

89

93

84

Add topic cluster

90

87

Add word and topic clusters (1/2)

82

Add word and topic clusters (2/2)

Experiment summarizes

Extensions

• LSA:

• Perplexity = 97

(14) |

|

||

(13) ||

||

...|

12

12

12

12

11

i

i

w iLSA

liLSAlli

lLSA

llLSAlll

wilLSAlli

llLSAlll

ll

wP

dwpwwwp

wP

dwpwwwp

wdpwwwp

wdpwwwp

wwwp

Extensions

• Raising LSA portion’s contribution to some power and renormalizing

• Perplexity = 82, equal to the best results using RLME

(15) ||

||

...|

712

712

11

iwilLSAlli

llLSAlll

ll

wdpwwwp

wdpwwwp

wwwp

Documents

Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao