8

Click here to load reader

A Note on the Derivation of the Variational Inference Updates for DILN

Embed Size (px)

Citation preview

Page 1: A Note on the Derivation of the Variational Inference Updates for DILN

A Note on the Derivation of the Variational Inference Updates for

DILN [2]

Tomonari MASADA @ Nagasaki University

August 30, 2013

1

Let M,Nm, T be the number of documents, the number of word tokens appearing in the dth document,and the truncation level. Xmn denotes the word appearing as the nth token of the mth document, andCmn denotes the latent topic for the nth token of the dth document. The definitions of other symbols canbe found in the original paper [2].

The joint distribution can be written as follows:

p(X,Z,C,w,η,V , α, β,m,K)

= p(X|C,η)p(Z|V ,w, β)p(C|Z)p(w|m,K)p(η)p(V |α)p(α)p(β)p(m)p(K). (1)

A lower bound of the log evidence can be obtained by using Jensen’s inequality as follows:

ln p(X) = ln

∫ ∑C

p(X,Z,C,w,η,V , α, β,m,K)dZdwdηdV dαdβdmdK

= ln

∫ ∑C

q(Z)q(C)q(w)q(η)q(V )q(α)q(β)q(m)q(K)

· p(X|C,η)p(Z|V ,w, β)p(C|Z)p(w|m,K)p(η)p(V |α)p(α)p(β)p(m)p(K)

q(Z)q(C)q(w)q(η)q(V )q(α)q(β)q(m)q(K)

dZdwdηdV dαdβdmdK

≥∫ ∑

C

q(Z)q(C)q(w)q(η)q(V )q(α)q(β)q(m)q(K)

· ln p(X|C,η)p(Z|V ,w, β)p(C|Z)p(w|m,K)p(η)p(V |α)p(α)p(β)p(m)p(K)

q(Z)q(C)q(w)q(η)q(V )q(α)q(β)q(m)q(K)

dZdwdηdV dαdβdmdK

=

∫ ∑C

q(C)q(η) ln p(X|C,η)dη +

∫q(Z)q(V )q(w)q(β) ln p(Z|V ,w, β)dZdV dwdβ

+

∫ ∑C

q(C)q(Z) ln p(C|Z)dZ +

∫q(w)q(m)q(K) ln p(w|m,K)dwdmdK

+

∫q(η) ln p(η)dη +

∫q(V ) ln p(V |α)dV +

∫q(α) ln p(α)dα

+

∫q(β) ln p(β)dβ +

∫q(m) ln p(m)dm+

∫q(K) ln p(K)dK

−∫q(Z) ln q(Z)dZ −

∑C

q(C) ln q(C)−∫q(w) ln q(w)dw

−∫q(η) ln q(η)dη −

∫q(V ) ln q(V )dV −

∫q(α) ln q(α)dα

−∫q(β) ln q(β)dβ −

∫q(m) ln q(m)dm−

∫q(K) ln q(K)dK. (2)

1

Page 2: A Note on the Derivation of the Variational Inference Updates for DILN

Since q(V ) = δV , q(m) = δm, q(K) = δK , q(α) = δα, q(β) = δβ, we can rewrite the right hand sideof Eq. (2) as follows:

ln p(X) ≥∫ ∑

C

q(C)q(η) ln p(X|C,η)dη +

∫q(Z)q(w) ln p(Z|V ,w, β)dZdw

+

∫ ∑C

q(C)q(Z) ln p(C|Z)dZ +

∫q(w) ln p(w|m,K)dw +

∫q(η) ln p(η)dη + ln p(V |α)

+ ln p(α) + ln p(β) + ln p(m) + ln p(K)

−∫q(Z) ln q(Z)dZ −

∑C

q(C) ln q(C)−∫q(w) ln q(w)dw −

∫q(η) ln q(η)dη. (3)

2

We examine each term of the right hand side of Eq. (3).∫ ∑C

q(C)q(η) ln p(X|C,η)dη =M∑

m=1

Nm∑n=1

T∑k=1

ϕmnk

∫Γ(

∑d γ

′kd)∏

d Γ(γ′kd)

D∏d=1

ηγ′kd−1

kd ln ηkXmndηk

=M∑

m=1

Nm∑n=1

T∑k=1

ϕmnk

{ψ(γ′kXmn

)− ψ(γ̂′k)}, (4)

where γ̂′k ≡∑D

d=1 γ′kd.

∫q(Z)q(w) ln p(Z|V ,w, β)dZdw

=∑m

∑k

∫q(Zmk)q(wmk) ln

{(e−wmk)βpk

Γ(βpk)Zβpk−1mk e−e−wmkZmk

}dZmkdwmk

= −∑k

βpk∑m

∫q(wmk)wmkdwmk −

∑k

ln Γ(βpk)

+∑k

(βpk − 1)∑m

∫q(Zmk) lnZmkdZmk −

∑m

∑k

∫q(Zmk)q(wmk)e

−wmkZmkdZmkdwmk, (5)

where ∫q(wmk)e

−wmkdwmk =

∫1√

2πvmkexp

{− (wmk − µmk)

2

2vmk− wmk

}dwmk

=

∫1√

2πvmkexp

(− w2

mk − 2µmkwmk + 2vmkwmk + µ2mk

2vmk

)dwmk

=

∫1√

2πvmkexp

{− (wmk − µmk + vmk)

2

2vmk− µmk +

vmk

2

}dwmk = exp

(− µmk +

vmk

2

). (6)

Note that vmk is a variance. Consequently, we have∫q(Z)q(w) ln p(Z|V ,w, β)dZdw

= −∑k

βpk∑m

µmk −∑k

ln Γ(βpk) +∑k

(βpk − 1)∑m

{ψ(amk)− ln bmk

}−

∑m

∑k

amk

bmkexp

(− µmk +

vmk

2

). (7)

Note that pk ≡ Vk∏k−1

j=1 (1− Vj).

2

Page 3: A Note on the Derivation of the Variational Inference Updates for DILN

∫ ∑C

q(C)q(Z) ln p(C|Z)dZ =∑m

∑n

∫q(Zm)

∑k

ϕmnk lnZmk∑Tj=1 Zmj

dZm

=∑m

∑k

(∑n

ϕmnk

)∫q(Zmk) lnZmkdZmk −

∑m

Nm

∫q(Zm) ln

( T∑j=1

Zmj

)dZm. (8)

Since lnx ≤ xξ − 1 + ln ξ for any ξ > 0,

∫q(Zm) ln

( T∑j=1

Zmj

)dZm ≤

∫q(Zm)

(∑j Zmk

ξm− 1 + ln ξm

)dZm =

1

ξm

∑k

amk

bmk− 1 + ln ξm. (9)

Therefore,∫ ∑C

q(C)q(Z) ln p(C|Z)dZ

=∑m

∑k

(∑n

ϕmnk

){ψ(amk)− ln bmk

}−∑m

Nm

ξm

∑k

amk

bmk+∑m

Nm −∑m

Nm ln ξm. (10)

∫q(w) ln p(w|m,K)dw =

∑m

∫q(wm) ln p(wm|m,K)dwm

=∑m

[− D

2ln 2π − 1

2ln |K| − 1

2

∫q(wm)(wm −m)TK−1(wm −m)dwm

]= −MD ln 2π

2− M ln |K|

2− 1

2

∑m

{∑k

(µ2mk + vmk)K

−1k:k − 2

∑k

mkµmkK−1k:k +

∑k

m2kK

−1k:k

+∑k

∑j ̸=k

(µmkµmj − 2µmkmj +mkmj)K−1k:j

}= −MD ln 2π

2− M ln |K|

2− 1

2

∑m

{∑k

vmkK−1k:k +

∑k

∑j

(µmk −mk)(µmj −mj)K−1k:j

}(11)

∫q(η) ln p(η)dη =

∑k

∫Γ(

∑d γ

′kd)∏

d Γ(γ′kd)

D∏d=1

ηγ′kd−1

kd

{ln Γ(Dγ)−DΓ(γ) +

∑d′

(γ − 1) ln ηkd

}dηk

= T ln Γ(Dγ)− TDΓ(γ) + (γ − 1)∑k

∑d

{ψ(γ′kd)− ψ(γ̂′k)

}(12)

ln p(V |α) = T ln Γ(α+ 1)− TΓ(α) + (α− 1)∑k

ln(1− Vk) (13)

∫q(Z) ln q(Z)dZ = −

∑m

∑k

{ln Γ(amk)− (amk − 1)ψ(amk)− ln bmk + amk

}(14)

∑C

q(C) ln q(C) =∑m

∑n

∑k

ϕmnk lnϕmnk (15)

∫q(w) ln q(w)dw = −MT (1 + ln 2π)

2−∑m

∑k

ln vmk

2(16)

∫q(η) ln q(η)dη =

∑k

[∑d

(γ′kd − 1){ψ(γ′kd)− ψ(γ̂′k)

}+ lnΓ(γ̂′k)−

∑d

ln Γ(γ′kd)]

(17)

3

Page 4: A Note on the Derivation of the Variational Inference Updates for DILN

Consequently, we obtain a lower bound of the log evidence as follows:

ln p(X) ≥M∑

m=1

Nm∑n=1

T∑k=1

ϕmnk

{ψ(γ′kXmn

)− ψ(γ̂′k)}

−T∑

k=1

{βVk

k−1∏j=1

(1− Vj)} M∑

m=1

µmk −T∑

k=1

ln Γ(βVk

k−1∏j=1

(1− Vj))

+T∑

k=1

{βVk

k−1∏j=1

(1− Vj)− 1} M∑

m=1

{ψ(amk)− ln bmk

}−

M∑m=1

T∑k=1

amk

bmkexp

(− µmk +

vmk

2

)

+M∑

m=1

T∑k=1

( Nm∑n=1

ϕmnk

){ψ(amk)− ln bmk

}−

M∑m=1

Nm

ξm

T∑k=1

amk

bmk+

M∑m=1

Nm −M∑

m=1

Nm ln ξm

− MD ln 2π

2− M ln |K|

2− 1

2

M∑m=1

{ T∑k=1

vmkK−1k:k +

T∑k=1

T∑j=1

(µmk −mk)(µmj −mj)K−1k:j

}

+ T ln Γ(Dγ)− TD ln Γ(γ) + (γ − 1)

T∑k=1

D∑d=1

{ψ(γ′kd)− ψ(γ̂′k)

}+ T ln Γ(α+ 1)− T ln Γ(α) + (α− 1)

T∑k=1

ln(1− Vk)

+

M∑m=1

T∑k=1

{ln Γ(amk)− (amk − 1)ψ(amk)− ln bmk + amk

}−

M∑m=1

Nm∑n=1

T∑k=1

ϕmnk lnϕmnk +MT (1 + ln 2π)

2+

M∑m=1

T∑k=1

ln vmk

2

−T∑

k=1

[ D∑d=1

(γ′kd − 1){ψ(γ′kd)− ψ(γ̂′k)

}+ lnΓ(γ̂′k)−

D∑d=1

ln Γ(γ′kd)]

+ ln p(α) + ln p(β) + ln p(m) + ln p(K). (18)

We assume that p(m) and p(K) are a uniform distribution, and that p(α) and p(β) are a Gamma distri-bution.

3 Inference Algorithm

3.1 Update q(Cmn)

Let L denote the right hand side of the Eq. (18).

∂L

∂ϕmnk= ψ(γ′kXmn

)− ψ(γ̂′k) + ψ(amk)− ln bmk − lnϕmnk − 1

∴ ϕmnk ∝ exp{ψ(γ′kXmn

)− ψ(γ̂′k) + ψ(amk)− ln bmk

}(19)

3.2 Update q(Zmk)

∂L

∂ξm=Nm

ξ2m

∑k

amk

bmk− Nm

ξm, ∴ ξm =

∑k

amk

bmk. (20)

∂L

∂bmk= −

{βVk

k−1∏j=1

(1− Vj)− 1} 1

bmk+amk

b2mk

exp(− µmk +

vmk

2

)−( Nm∑

n=1

ϕmnk

) 1

bmk+Nm

ξm

amk

b2mk

− 1

bmk

(21)

4

Page 5: A Note on the Derivation of the Variational Inference Updates for DILN

∂L∂bmk

= 0 gives

0 = −bmk

{βVk

k−1∏j=1

(1− Vj) +

Nm∑n=1

ϕmnk

}+ amk

{exp

(− µmk +

vmk

2

)+Nm

ξm

}. (22)

Therefore,

bmk = amk ·exp

(− µmk + vmk

2

)+ Nm

ξm

βVk∏k−1

j=1 (1− Vj) +∑Nm

n=1 ϕmnk

. (23)

∂L

∂amk=

{βVk

k−1∏j=1

(1− Vj)− 1}ψ′(amk)−

1

bmkexp

(− µmk +

vmk

2

)+( Nm∑

n=1

ϕmnk

)ψ′(amk)−

Nm

ξm

1

bmk

− (amk − 1)ψ′(amk) + 1

={βVk

k−1∏j=1

(1− Vj) +

Nm∑n=1

ϕmnk − amk

}ψ′(amk)−

1

bmk

{exp

(− µmk +

vmk

2

)+Nm

ξm

}+ 1 (24)

By using the result for bmk, we obtain

∂L

∂amk=

{βVk

k−1∏j=1

(1− Vj) +

Nm∑n=1

ϕmnk − amk

}ψ′(amk)−

βVk∏k−1

j=1 (1− Vj) +∑Nm

n=1 ϕmnk

amk+ 1

={βVk

k−1∏j=1

(1− Vj) +

Nm∑n=1

ϕmnk − amk

}{ψ′(amk)−

1

amk

}

∴ amk = βVk

k−1∏j=1

(1− Vj) +

Nm∑n=1

ϕmnk, bmk = exp(− µmk +

vmk

2

)+Nm

ξm. (25)

3.3 Update q(wmk)

∂L

∂µmk=amk

bmkexp

(− µmk +

vmk

2

)−{βVk

k−1∏j=1

(1− Vj)}−

T∑j=1

(µmj −mj)K−1k:j (26)

∂L

∂vmk=

1

2

{− amk

bmkexp

(− µmk +

vmk

2

)−K−1

k:k +1

vmk

}(27)

The plus and minus signs on the right hand side of the second line of Eq. (22) in the original paper aredifferent from those given above. We may use L-BFGS for updating µmk and vmk.

3.4 Update q(ηk)

∂L

∂γ′kd=

∑m

∑n

I(Xmn = d)ϕmnkψ′(γ′kd)−

∑m

∑n

ϕmnkψ′(γ̂′k) + (γ − 1)ψ′(γ′kd)− (γ − 1)

∑d

ψ′(γ̂′k)

− ψ(γ′kd) + ψ(γ̂′k)− (γ′kd − 1)ψ′(γ′kd) +∑d

(γ′kd − 1)ψ′(γ̂′k)− ψ(γ̂′k) + ψ(γ′dk)

=∑m

∑n

I(Xmn = d)ϕmnkψ′(γ′kd)−

∑m

∑n

ϕmnkψ′(γ̂′k) + (γ − γ′kd)ψ

′(γ′kd)−∑d

(γ − γ′kd)ψ′(γ̂′k)

= ψ′(γ′kd){∑

m

∑n

I(Xmn = d)ϕmnk + γ − γ′kd

}− ψ′(γ̂′k)

∑d

{∑m

∑n

I(Xmn = d)ϕmnk + γ − γ′kd

}∴ γ′kd = γ +

∑m

∑n

I(Xmn = d)ϕmnk (28)

5

Page 6: A Note on the Derivation of the Variational Inference Updates for DILN

3.5 Update q(Vk)

∂L

∂Vk= − α− 1

1− Vk− β

k−1∏j=1

(1− Vj)M∑

m=1

{µmk − ψ(amk) + ln bmk

}− 1

1− Vk

T∑k̂=k+1

{βVk̂

k̂−1∏j=1

(1− Vj)} M∑

m=1

{µmk̂ − ψ(amk̂) + ln bmk̂

}

− β

k−1∏j=1

(1− Vj)ψ(βVk

k−1∏j=1

(1− Vj))−

T∑k̂=k+1

1

1− VkβVk̂

k̂−1∏j=1

(1− Vj)ψ(βVk̂

k̂−1∏j=1

(1− Vj))

= − α− 1

1− Vk− β

k−1∏j=1

(1− Vj)M∑

m=1

{µmk − ψ(amk) + ln bmk

}− β

k−1∏j=1

(1− Vj)T∑

k̂=k+1

{Vk̂

k̂−1∏j=k+1

(1− Vj)} M∑

m=1

{µmk̂ − ψ(amk̂) + ln bmk̂

}

− β

k−1∏j=1

(1− Vj)ψ(βVk

k−1∏j=1

(1− Vj))− β

k−1∏j=1

(1− Vj)

T∑k̂=k+1

{Vk̂

k̂−1∏j=k+1

(1− Vj)}ψ(βVk̂

k̂−1∏j=1

(1− Vj))

= − α− 1

1− Vk− β

k−1∏j=1

(1− Vj)

[ M∑m=1

{µmk − ψ(amk) + ln bmk

}+ ψ

(βVk

k−1∏j=1

(1− Vj))]

− βk−1∏j=1

(1− Vj)T∑

k̂=k+1

{Vk̂

k̂−1∏j=k+1

(1− Vj)}[ M∑

m=1

{µmk̂ − ψ(amk̂) + ln bmk̂

}+ ψ

(βVk̂

k̂−1∏j=1

(1− Vj))]

= − α− 1

1− Vk− pkVk

[ M∑m=1

{µmk − ψ(amk) + ln bmk

}+ ψ(βpk)

]

−T∑

j=k+1

pj1− Vk

[ M∑m=1

{µmj − ψ(amj) + ln bmj

}+ ψ(βpj)

](29)

I think that Vk on the second line of Eq. (24) in the original paper is not required.

3.6 Update q(K)

With respect to K, we maximize the following function:

L(K) = −M2

ln |K| − 1

2

M∑m=1

T∑k=1

vmkK−1k:k − 1

2

M∑m=1

(µm −m)TK−1(µm −m), (30)

where the last term is equal to 12

∑Mm=1

∑Tk=1

∑Tj=1(µmk −mk)(µmj −mj)K

−1k:j .

The derivative of the first term of the right hand side in Eq. (30) is obtained based on the followingidentity (Cf. Eq. (51) of The Matrix Cookbook1):

∂ ln |K|∂K

= K−1. (31)

For the second term of the right hand side in Eq. (30), it holds that∑

k vmkK−1k:k = Tr[K−1diag(vm)],

where diag(vm) is a diagonal matrix whose kth diagonal entry is vmk. By using the following identity (Cf.Eq. (16) in Old and New Matrix Algebra Useful for Statistics2):

∂Tr[AΣ−1B]

∂Σ= −Σ−1BAΣ−1, (32)

1http://orion.uwaterloo.ca/ hwolkowi/matrixcookbook.pdf2http://research.microsoft.com/en-us/um/people/minka/papers/matrix/minka-matrix.pdf

6

Page 7: A Note on the Derivation of the Variational Inference Updates for DILN

we obtain∂∑

m

∑k vmkK

−1k:k

∂K = −K−1{∑

m diag(vm)}K−1.

For the last term in Eq. (30), it holds that

(µm −m)TK−1(µm −m) = Tr[(µm −m)TK−1(µm −m)

]. (33)

Therefore, by using Eq. (32), we obtain ∂(µm−m)TK−1(µm−m)∂K = −K−1(µm −m)(µm −m)TK−1.

Consequently, we have

∂L(K)

∂K= −M

2K−1 +

1

2K−1

{∑m

diag(vm)}K−1 +

1

2K−1

∑m

{(µm −m)(µm −m)T

}K−1 . (34)

∂L(K)∂K = 0 holds when

K−1 =1

MK−1

∑m

{diag(vm) + (µm −m)(µm −m)T

}K−1. (35)

By multiplying K on both sides of the above equation from left and right, we obtain

K =1

M

∑m

{diag(vm) + (µm −m)(µm −m)T

}. (36)

This derivation is completely the same with that of CTM [1].

3.7 Update q(m)

∂L

∂mk=

T∑j=1

(µmj −mj)K−1k:j , ∴ mk =

1

T

T∑j=1

µmj (37)

3.8 Update q(α)

With respect to α, we maximize the following function:

L(α) = T ln Γ(α+ 1)− T ln Γ(α) + (α− 1)T∑

k=1

ln(1− Vk) (38)

We use the following identity (Cf. Eqs. (120), (121), and (122) in Estimating a Dirichlet distribution3):

Γ(n+ x)

Γ(x)≥ cxa if n ≥ 1 (39)

a ={ψ(n+ x̂)− ψ(x̂)

}x̂ (40)

c =Γ(n+ x̂)

Γ(x̂)x̂−a (41)

Then we obtain:

L(α) ≥ T{ψ(α̂+ 1)− ψ(α̂)

}α̂ lnα+ (α− 1)

T∑k=1

ln(1− Vk) + const. (42)

We maximize this lower bound, which we denote as L(α).

∂L(α)∂α

=1

αT{ψ(α̂+ 1)− ψ(α̂)

}α̂+

T∑k=1

ln(1− Vk) (43)

∴ α = α ·T{ψ(α+ 1)− ψ(α)

}−∑T

k=1 ln(1− Vk)(44)

3http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/

7

Page 8: A Note on the Derivation of the Variational Inference Updates for DILN

This is a multiplicative update.

When we apply a Gamma prior p(α) =ba00

Γ(a0)αa0−1e−b0α to α, we have the following result:

∂L(α)∂α

=1

αT{ψ(α̂+ 1)− ψ(α̂)

}α̂+

T∑k=1

ln(1− Vk) + (a0 − 1)1

α− b0 (45)

∴ α = α ·a0 − 1 + T

{ψ(α+ 1)− ψ(α)

}b0 −

∑Tk=1 ln(1− Vk)

(46)

3.9 Update q(β)

With respect to β, we maximize the following function L(β):

L(β) = −T∑

k=1

{βVk

k−1∏j=1

(1− Vj)} M∑

m=1

µmk −T∑

k=1

ln Γ(βVk

k−1∏j=1

(1− Vj))

+T∑

k=1

{βVk

k−1∏j=1

(1− Vj)} M∑

m=1

{ψ(amk)− ln bmk

}= −

T∑k=1

βpk

M∑m=1

µmk −T∑

k=1

ln Γ(βpk) +T∑

k=1

βpk

M∑m=1

{ψ(amk)− ln bmk

}(47)

The first and the second derivatives are obtained as follows:

∂L(β)

∂β= −

T∑k=1

pk

[ψ(βpk) +

M∑m=1

{µmk − ψ(amk) + ln bmk

}]∂2L(β)

∂β2= −

T∑k=1

p2kψ′(βpk) (48)

We can use Newton’s method to update β.

When we apply a Gamma prior p(β) =dc00

Γ(c0)βc0−1e−d0β to β, we have the following result:

∂L(β)

∂β= −

T∑k=1

pk

[ψ(βpk) +

M∑m=1

{µmk − ψ(amk) + ln bmk

}]+ (c0 − 1)

1

β− d0

∂2L(β)

∂β2= −

T∑k=1

p2kψ′(βpk)− (c0 − 1)

1

β2(49)

References

[1] David M. Blei and John D. Lafferty. Correlated topic models. In NIPS, 2005.

[2] John Paisley, Chong Wang, and David Blei. The discrete infinite logistic normal distribution formixed-membership modeling. In AISTATS, 2011.

8