49
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Embed Size (px)

Citation preview

Page 1: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Combining Statistical Language Models via the Latent Maximum

Entropy Principle

Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Page 2: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Abstract

• Simultaneously incorporate various aspects of natural language– Local word interaction, syntactic structure, semantic

document information

• Latent maximum entropy (LME) principle– Which allows relationships over hidden features to be

effectively captured in a unified model

– Local lexical models (N-gram models)

– Global document-level semantic models (PLSA)

Page 3: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Introduction

• There are various kinds of language models that can be used to capture different aspects of natural language regularity– Markov chain (N-gram) models effectively capture

local lexical regularities in text

– Smoothed N-gram: estimating rare events

– Increase the order of an N-gram to capture longer range dependencies in natural language curse of dimensionality (Rosenfeld)

Page 4: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Introduction

– Structural language model effectively exploits relevant syntactic regularities to improve the perplexity score of N-gram models

– Semantic language model exploits document-level semantic regularities to achieve similar improvements

• Although each of these language models outperforms simple N-grams, they each only capture specific linguistic phenomena

Page 5: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Introduction

• Several techniques for combining language models:– Linear interpolation: each individual model is trained

separately and then combined by a weighted linear combination

– Maximum entropy: it model distributions over explicitly observed features

• There are many hidden semantic and syntactic information in natural language

Page 6: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Introduction

• Latent maximum entropy (LME) principle extends ME to incorporate latent variables

• Let XX denote the complete data, YY be the observed incomplete data and ZZ be the missing data. X = (Y, Z)

• The goal of ME is to find a probability model that matches certain constraints in the observed data while otherwise maximizing entropy

Page 7: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Maximum Entropy

• construct separate models– Build a single, combined model, which attempts to

capture all the information provided by tie various knowledge sources

• The intersection of all the constraints, if not empty, contains a (typically infinite) set of probability function, which are all consistent with the knowledge sources

• The second step in the ME approach is to choose, from among the functions in that set, that function which has the highest entropy (i.e. the “flattest” function)

Page 8: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

An example• Assume we wish to estimate P(“BANK”|h), one est

imate may be provided by a conventional bigram

Features

Page 9: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

An example

• Consider one such equivalence class, say, the one where the history ends in “THE”. The bigram assigns the same probability estimate to all events in that class

)THE(

)BANKTHE,(

)THE|BANK(

}BANKTHE,{

}BANKTHE,{BIGRAM

C

CK

KPdef

where is thedoes the

…… the[1]

Page 10: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

An example

• Another estimate may be provided by a particular trigger pair, say (LOAN,BANK)– Assume we want to capture the dependency of “BANK

” on whether or not “LOAN” occurred before it in the same document. Thus a different partition of the event space will be added, as in Table IV

– Similarly to the bigram case, consider now one such equivalence class, say, the one where “LOAN” did occur in the history. The trigger component assigns the same probability estimate to all events in that class

Page 11: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

)LOAN(

)LOANBANK,(

)LOAN|BANK(

}LOANBANK,{

}LOANBANK,{BANKLOAN

hC

hCK

KhPdef

h

h

…loan…the …loan…of

Features

Page 12: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

An example• Consider the bigram, under ME, we no longer insi

st that P(BANK|h) always have the same value (K

{THE,BANK}) whenever the history ends in “THE”

• Rather, we only require that PCOMBINED (BANK|h) be equal to (K{THE,BANK}) on average in the training data

}BANKTHE,{COMBINED"THE"in ends

)|BANK(E KhPh

Equation [1] is replaced by

where E stands for an expectation

[2]

Page 13: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

An example

• Similarly, we require that PCOMBINED(BANK|h) be equal to K{BANK,LOANvh} on average over those histories that contain occurrences of “LOAN”

}LOANBANK,{COMBINEDLOAN""

)|BANK(E hh

KhP [3]

Page 14: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Information sources as constraint functions

• We can view each information source as defining a subset (or many subsets) of the event space (h,w)

• We can define any subset S of the event space, and any desired expectation K, and impose the constraint:

• The subset S can be specified by an index function, also called selector function, fs :

Swh

KwhP),(

),(

otherwise 0

),( if 1),(

Swhwhf

def

s

[4]

Page 15: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Information sources as constraint functions

• so Equation [4] becomes:

),(

),(),(wh

s KwhfwhP [5]

Page 16: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Maximum Entropy

• The ME Principle (Jaynes, 1975; Kullback, 1959) can be stated as follows– 1. Reformulate the different information sources as con

straints to be satisfied by the target (combined) estimate

– 2. Among all probability distributions that satisfy these constraints, choose the one that has the highest entropy

Page 17: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Maximum Entropy

• Given a general event space {x}, to derive a combined probability function P(x), each constraint i is associated with a constraint function fi(x) and a desired expectation Ki

• The constraint is then written as :

ii

def

ip KfPfE )()(x

xx [6]

Page 18: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

LME

• Given features f1,…,fN specifying the properties we would like to match in the data, select a joint model p* from the set of possible probability distributions that maximizes the entropy

Nizyfyzpypxfxp

xpxppHp

zi

yxi

xp

,...,1 ; ,|~subject to

logmax*

model lstatistica theinto structure dependencyhidden

theencodes | and data, training theof components

observed ofset theofon distributi empirical theis ~ Here

yzp

yp

Page 19: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

LME

• Intuitively, the constraints specify that we require the expectations of fi(X) in the joint model to match their empirical expectations on the incomplete data Y– fi(X) = fi(Y, Z)

• When the features only depend on the observable data Y, the LME is equivalent to ME

Page 20: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Regularized LME (RLME)

• ME principle are subject to errors due to the empirical data, especially in a very sparse domain– Add a penalty to the entropy of the joint model

(2) ,...,1 ; ,|~subject to

(1) logmax,

Niazyfyzpypxfxp

aUxpxpaUpH

iz

iyx

i

xap

0at minimum its hashich function wconvex a is : and

,constrainteach for error theis ,,..., Here 1

RRU

aaaaN

iN

Page 21: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

A Training Algorithm

• Assume we have already selected the features f1,…fN from the training data

• Restrict p(x) to be an exponential model

1 ensuresthat

constant a is where,exp1

Xx

i ii

xp

xfxp

Page 22: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

A Training Algorithm

• Intimately related to finding locally maximum a posteriori (MAP) solutions:– Given a penalty function U over errors a, an associated

prior U* on can be obtained by setting U* to the convex conjugate of U

– e.g., given a quadratic penalty , the

convex conjugate can be determined

by setting ; which specifies a Gaussian prior on

N

i ii aaU1

22

2

1

N

ii

iU1 2

2*

2

2iiia

Page 23: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

A Training Algorithm

– Then, given a prior U*, the standard MAP estimate maximizes the penalized log-likelihood

• Our key result is that locally maximizing R() is equivalent to satisfying the feasibility constraints (2) of the RLME principle

*log~ UypypR

y

Page 24: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

A Training Algorithm

• THEOREM 1.– Under the log-linear assumption, locally maximizing a

posterior probability of log-linear models on incomplete data is equivalent to satisfying the feasibility constraints of the RLME principle. That is, the only distinction between MAP and RLME in log-linear models is that, among local maxima (feasible solutions), RLME selects the model with the maximum entropy, whereas MAP selects the model with the maximum posterior probability

Page 25: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

A Training Algorithm

• R-EM-IS, which employs an EM algorithm as an outer loop, but uses a nested GIS/IIS algorithm to perform the internal M step

• Decompose the penalized log-likelihood function R():

This is a standard decomposition used for deriving EM

Yy Zz

Yy Zz

Yy

yzpyzpypH

UxpyzpypQ

HQUypypR

|log|~',

,log|~', where

',',log~

'

*'

*

Page 26: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

A Training Algorithm

• For log-linear models:

(3) |~log

,

*

1

)(

)(

Uyzpxfyp

QN

i Zzi

Yyi

j

j

iii

i ii

xfxp

xfxp

loglog

exp1

Page 27: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

A Training Algorithm

• LEMMA 1.

zii

yxi

xp

jj

Niazyfyzpypxfxp

aUxpxpaUpH

Q

j (5) ,...,1 ; ,|~subject to

(4) logmax

solving toequivalent is

fixedfor offunction a as , Maximum

Page 28: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

R-EM-IS algorithm

(7) ,|~

satisfies where

(6) ,..., ;

follows as IIS)or (GIS

scaling iterativeby ,...,for value

parameter theof updates parallel Perform :

.,...,for

,|~ Compute :

//

/

///

#/

/

zi

y

i

Ksji

Ksji

x

xfi

Ksji

Ksji

Ksji

Ksji

i

z iy

zyfyzpyp

exfxp

Ks

Ni

KstepM

Ni

zyfyzpypstepE

j

Ksji

Ksj

j

2

1

1

1

1

1

1

Page 29: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

R-EM-IS algorithm

• THEOREM 2.

modelslinear -logfor principle RLME the tosolutions

feasible yieldsally asymptotic IS-EM-R Therefore,

(8) 0:

set the tobelong ,,...,1 ,0,

sequence IS-EM-Rany of pointslimit

all and ,function likelihood-log penalized the

increaseslly monotonica algorithm IS-EM-R The

/

RR

Ksj

R

N

ksj

Page 30: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

yy

yyy

yy

y

yp

ypyp

yp

ypyp

yplypyplypypl

yp

ypll

yp

yp

yp

R

yplUypyp

UypypR

~

~

~~

~ ~ ~

~

~

log~

log~

*

*

0

1

Page 31: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

R-ME-EM-IS algorithm

entropy. dregularizehighest theachieves

thatcandidate feasible theChoose .candidates feasibledistinct ofset

a produce to timesseveral steps above Repeat the :

. ofentropy dregularize theCalculate :

. feasibleobtain toe,convergenc toIS-EM-RRun :

.for guesses initial chooseRandomly :

*

*

selectionModel

pncalculatioEntropy

ISEMR

tionInitializa

Page 32: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Combining N-gram and PLSA Models

data. unobserved is ,,

and data observed theis ,,, Thus,

document. a is

words,e with thesassociated valuestopic''hidden theare ,,,

words,previous twoandcurrent theare ,, where

,,,,,,, as data complete theDefine

012

012

210

210

012012

TTTz

DWWWy

D

TTT

WWW

TTTDWWWx

Page 33: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Tri-gram portion

(9) |~~

|~~,

|~~,,

2

0

1

01

012

di

x lil

dji

x ljlil

dkji

xkji

dwpdpwWxp

dwwpdpwWwWxp

dwwwpdpwWwWwWxp

Page 34: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

PLSA portion

(10)

,||~~,

wordand picbetween to sconstraint

,||~~,

: topicanddocument between sconstraint

2

0

2

0

2

0

2

0

d lili

x lill

lll

x ll

dwWtpdwpdpwWtTxp

dWtpdWpdpdDtTxp

Page 35: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Efficient Feature Expectation and Inference

• Sum-product algorithm:

(13)

)(0 0 1 1 2 2

0122201212211011000

012012

0120011220120112012

,,,,,,

w t w t w t d

dtttwww

dtdtdttwwwwwwwtwwwwtww

dtdtdttwtwtwwwwwwwwwww

eeeeeeeeeeee

eeeeeeeeeeee

(12)

,,

0 1 2

012210

1

012

t t t d

xkji

dtdtdttiwtjwtkw

kwjwiwkwjwjwiwkwwjiw

eeeeee

eeeeee

wWwWwWxp

Normalization constant:

Feature expectations:

wh iii

iii

whfZ

whfZ

whp

ME

,

,exp

,exp,

:

1

Page 36: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Semantic Smoothing

• Add node C as word cluster between each topic node and word node.– |C| = 1, maximum smoothing

– |C| = |V|, no smoothing (|V| is the vocabulary size)

• Add node S as document cluster between topic node and document node– |S| = 1, over-smoothed

– |S| = |D|, no smoothing (|D| is the number of documents)

W|C|T

T|S|D

Page 37: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Computation in Testing

L

l TTTDlll

L

l TTTDll

L

lll

L

wwTTTDwp

wwTTTDwp

wwwp

wwp

1 ,,,12012

1 ,,,11012

111

1

012

012

,|,,,,

...|,,,,

...|

...

Page 38: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Experimental Evaluation

• Training Data:– NAB : 87000 documents, 1987~1989, 38M words

• Vocabulary size, |V|, is 20000, the most frequent words of the training data

• Testing Data:– 325000 words, 1989

• Evaluation:

PerplexityEntropy

wwwpspPerplexity L

L

i ii

L

2

1 11

log

...|

11

Page 39: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Experimental design

• Baseline: Tri-gram with GT smoothing, perplexity is 105

• R-EM-IS procedure: EM iteration is 5, IIS loop iteration is 20

• Feasible solution: initialized the parameters to zero and executed a single run of R-EM-IS

• RLME and MAP: use 20 random starting points for

Page 40: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Simple tri-gram

• There are no hidden variables, MAP, RLME and single run of R-EM-IS all reduce to the same standard ME principle

• Perplexity score is 107

Page 41: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Tri-gram + PLSA• |T| = 125

95

91

89

Page 42: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao
Page 43: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Add word cluster

89

93

84

Page 44: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Add topic cluster

90

87

Page 45: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Add word and topic clusters (1/2)

82

Page 46: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Add word and topic clusters (2/2)

Page 47: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Experiment summarizes

Page 48: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Extensions

• LSA:

• Perplexity = 97

(14) |

|

||

(13) ||

||

...|

12

12

12

12

11

i

i

w iLSA

liLSAlli

lLSA

llLSAlll

wilLSAlli

llLSAlll

ll

wP

dwpwwwp

wP

dwpwwwp

wdpwwwp

wdpwwwp

wwwp

Page 49: Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Extensions

• Raising LSA portion’s contribution to some power and renormalizing

• Perplexity = 82, equal to the best results using RLME

(15) ||

||

...|

712

712

11

iwilLSAlli

llLSAlll

ll

wdpwwwp

wdpwwwp

wwwp