View
92
Download
1
Category
Preview:
Citation preview
A Note on the Derivation of the Variational Inference Updates for
DILN [2]
Tomonari MASADA @ Nagasaki University
August 30, 2013
1
Let M,Nm, T be the number of documents, the number of word tokens appearing in the dth document,and the truncation level. Xmn denotes the word appearing as the nth token of the mth document, andCmn denotes the latent topic for the nth token of the dth document. The definitions of other symbols canbe found in the original paper [2].
The joint distribution can be written as follows:
p(X,Z,C,w,η,V , α, β,m,K)
= p(X|C,η)p(Z|V ,w, β)p(C|Z)p(w|m,K)p(η)p(V |α)p(α)p(β)p(m)p(K). (1)
A lower bound of the log evidence can be obtained by using Jensen’s inequality as follows:
ln p(X) = ln
∫ ∑C
p(X,Z,C,w,η,V , α, β,m,K)dZdwdηdV dαdβdmdK
= ln
∫ ∑C
q(Z)q(C)q(w)q(η)q(V )q(α)q(β)q(m)q(K)
· p(X|C,η)p(Z|V ,w, β)p(C|Z)p(w|m,K)p(η)p(V |α)p(α)p(β)p(m)p(K)
q(Z)q(C)q(w)q(η)q(V )q(α)q(β)q(m)q(K)
dZdwdηdV dαdβdmdK
≥∫ ∑
C
q(Z)q(C)q(w)q(η)q(V )q(α)q(β)q(m)q(K)
· ln p(X|C,η)p(Z|V ,w, β)p(C|Z)p(w|m,K)p(η)p(V |α)p(α)p(β)p(m)p(K)
q(Z)q(C)q(w)q(η)q(V )q(α)q(β)q(m)q(K)
dZdwdηdV dαdβdmdK
=
∫ ∑C
q(C)q(η) ln p(X|C,η)dη +
∫q(Z)q(V )q(w)q(β) ln p(Z|V ,w, β)dZdV dwdβ
+
∫ ∑C
q(C)q(Z) ln p(C|Z)dZ +
∫q(w)q(m)q(K) ln p(w|m,K)dwdmdK
+
∫q(η) ln p(η)dη +
∫q(V ) ln p(V |α)dV +
∫q(α) ln p(α)dα
+
∫q(β) ln p(β)dβ +
∫q(m) ln p(m)dm+
∫q(K) ln p(K)dK
−∫q(Z) ln q(Z)dZ −
∑C
q(C) ln q(C)−∫q(w) ln q(w)dw
−∫q(η) ln q(η)dη −
∫q(V ) ln q(V )dV −
∫q(α) ln q(α)dα
−∫q(β) ln q(β)dβ −
∫q(m) ln q(m)dm−
∫q(K) ln q(K)dK. (2)
1
Since q(V ) = δV , q(m) = δm, q(K) = δK , q(α) = δα, q(β) = δβ, we can rewrite the right hand sideof Eq. (2) as follows:
ln p(X) ≥∫ ∑
C
q(C)q(η) ln p(X|C,η)dη +
∫q(Z)q(w) ln p(Z|V ,w, β)dZdw
+
∫ ∑C
q(C)q(Z) ln p(C|Z)dZ +
∫q(w) ln p(w|m,K)dw +
∫q(η) ln p(η)dη + ln p(V |α)
+ ln p(α) + ln p(β) + ln p(m) + ln p(K)
−∫q(Z) ln q(Z)dZ −
∑C
q(C) ln q(C)−∫q(w) ln q(w)dw −
∫q(η) ln q(η)dη. (3)
2
We examine each term of the right hand side of Eq. (3).∫ ∑C
q(C)q(η) ln p(X|C,η)dη =M∑
m=1
Nm∑n=1
T∑k=1
ϕmnk
∫Γ(
∑d γ
′kd)∏
d Γ(γ′kd)
D∏d=1
ηγ′kd−1
kd ln ηkXmndηk
=M∑
m=1
Nm∑n=1
T∑k=1
ϕmnk
{ψ(γ′kXmn
)− ψ(γ̂′k)}, (4)
where γ̂′k ≡∑D
d=1 γ′kd.
∫q(Z)q(w) ln p(Z|V ,w, β)dZdw
=∑m
∑k
∫q(Zmk)q(wmk) ln
{(e−wmk)βpk
Γ(βpk)Zβpk−1mk e−e−wmkZmk
}dZmkdwmk
= −∑k
βpk∑m
∫q(wmk)wmkdwmk −
∑k
ln Γ(βpk)
+∑k
(βpk − 1)∑m
∫q(Zmk) lnZmkdZmk −
∑m
∑k
∫q(Zmk)q(wmk)e
−wmkZmkdZmkdwmk, (5)
where ∫q(wmk)e
−wmkdwmk =
∫1√
2πvmkexp
{− (wmk − µmk)
2
2vmk− wmk
}dwmk
=
∫1√
2πvmkexp
(− w2
mk − 2µmkwmk + 2vmkwmk + µ2mk
2vmk
)dwmk
=
∫1√
2πvmkexp
{− (wmk − µmk + vmk)
2
2vmk− µmk +
vmk
2
}dwmk = exp
(− µmk +
vmk
2
). (6)
Note that vmk is a variance. Consequently, we have∫q(Z)q(w) ln p(Z|V ,w, β)dZdw
= −∑k
βpk∑m
µmk −∑k
ln Γ(βpk) +∑k
(βpk − 1)∑m
{ψ(amk)− ln bmk
}−
∑m
∑k
amk
bmkexp
(− µmk +
vmk
2
). (7)
Note that pk ≡ Vk∏k−1
j=1 (1− Vj).
2
∫ ∑C
q(C)q(Z) ln p(C|Z)dZ =∑m
∑n
∫q(Zm)
∑k
ϕmnk lnZmk∑Tj=1 Zmj
dZm
=∑m
∑k
(∑n
ϕmnk
)∫q(Zmk) lnZmkdZmk −
∑m
Nm
∫q(Zm) ln
( T∑j=1
Zmj
)dZm. (8)
Since lnx ≤ xξ − 1 + ln ξ for any ξ > 0,
∫q(Zm) ln
( T∑j=1
Zmj
)dZm ≤
∫q(Zm)
(∑j Zmk
ξm− 1 + ln ξm
)dZm =
1
ξm
∑k
amk
bmk− 1 + ln ξm. (9)
Therefore,∫ ∑C
q(C)q(Z) ln p(C|Z)dZ
=∑m
∑k
(∑n
ϕmnk
){ψ(amk)− ln bmk
}−∑m
Nm
ξm
∑k
amk
bmk+∑m
Nm −∑m
Nm ln ξm. (10)
∫q(w) ln p(w|m,K)dw =
∑m
∫q(wm) ln p(wm|m,K)dwm
=∑m
[− D
2ln 2π − 1
2ln |K| − 1
2
∫q(wm)(wm −m)TK−1(wm −m)dwm
]= −MD ln 2π
2− M ln |K|
2− 1
2
∑m
{∑k
(µ2mk + vmk)K
−1k:k − 2
∑k
mkµmkK−1k:k +
∑k
m2kK
−1k:k
+∑k
∑j ̸=k
(µmkµmj − 2µmkmj +mkmj)K−1k:j
}= −MD ln 2π
2− M ln |K|
2− 1
2
∑m
{∑k
vmkK−1k:k +
∑k
∑j
(µmk −mk)(µmj −mj)K−1k:j
}(11)
∫q(η) ln p(η)dη =
∑k
∫Γ(
∑d γ
′kd)∏
d Γ(γ′kd)
D∏d=1
ηγ′kd−1
kd
{ln Γ(Dγ)−DΓ(γ) +
∑d′
(γ − 1) ln ηkd
}dηk
= T ln Γ(Dγ)− TDΓ(γ) + (γ − 1)∑k
∑d
{ψ(γ′kd)− ψ(γ̂′k)
}(12)
ln p(V |α) = T ln Γ(α+ 1)− TΓ(α) + (α− 1)∑k
ln(1− Vk) (13)
∫q(Z) ln q(Z)dZ = −
∑m
∑k
{ln Γ(amk)− (amk − 1)ψ(amk)− ln bmk + amk
}(14)
∑C
q(C) ln q(C) =∑m
∑n
∑k
ϕmnk lnϕmnk (15)
∫q(w) ln q(w)dw = −MT (1 + ln 2π)
2−∑m
∑k
ln vmk
2(16)
∫q(η) ln q(η)dη =
∑k
[∑d
(γ′kd − 1){ψ(γ′kd)− ψ(γ̂′k)
}+ lnΓ(γ̂′k)−
∑d
ln Γ(γ′kd)]
(17)
3
Consequently, we obtain a lower bound of the log evidence as follows:
ln p(X) ≥M∑
m=1
Nm∑n=1
T∑k=1
ϕmnk
{ψ(γ′kXmn
)− ψ(γ̂′k)}
−T∑
k=1
{βVk
k−1∏j=1
(1− Vj)} M∑
m=1
µmk −T∑
k=1
ln Γ(βVk
k−1∏j=1
(1− Vj))
+T∑
k=1
{βVk
k−1∏j=1
(1− Vj)− 1} M∑
m=1
{ψ(amk)− ln bmk
}−
M∑m=1
T∑k=1
amk
bmkexp
(− µmk +
vmk
2
)
+M∑
m=1
T∑k=1
( Nm∑n=1
ϕmnk
){ψ(amk)− ln bmk
}−
M∑m=1
Nm
ξm
T∑k=1
amk
bmk+
M∑m=1
Nm −M∑
m=1
Nm ln ξm
− MD ln 2π
2− M ln |K|
2− 1
2
M∑m=1
{ T∑k=1
vmkK−1k:k +
T∑k=1
T∑j=1
(µmk −mk)(µmj −mj)K−1k:j
}
+ T ln Γ(Dγ)− TD ln Γ(γ) + (γ − 1)
T∑k=1
D∑d=1
{ψ(γ′kd)− ψ(γ̂′k)
}+ T ln Γ(α+ 1)− T ln Γ(α) + (α− 1)
T∑k=1
ln(1− Vk)
+
M∑m=1
T∑k=1
{ln Γ(amk)− (amk − 1)ψ(amk)− ln bmk + amk
}−
M∑m=1
Nm∑n=1
T∑k=1
ϕmnk lnϕmnk +MT (1 + ln 2π)
2+
M∑m=1
T∑k=1
ln vmk
2
−T∑
k=1
[ D∑d=1
(γ′kd − 1){ψ(γ′kd)− ψ(γ̂′k)
}+ lnΓ(γ̂′k)−
D∑d=1
ln Γ(γ′kd)]
+ ln p(α) + ln p(β) + ln p(m) + ln p(K). (18)
We assume that p(m) and p(K) are a uniform distribution, and that p(α) and p(β) are a Gamma distri-bution.
3 Inference Algorithm
3.1 Update q(Cmn)
Let L denote the right hand side of the Eq. (18).
∂L
∂ϕmnk= ψ(γ′kXmn
)− ψ(γ̂′k) + ψ(amk)− ln bmk − lnϕmnk − 1
∴ ϕmnk ∝ exp{ψ(γ′kXmn
)− ψ(γ̂′k) + ψ(amk)− ln bmk
}(19)
3.2 Update q(Zmk)
∂L
∂ξm=Nm
ξ2m
∑k
amk
bmk− Nm
ξm, ∴ ξm =
∑k
amk
bmk. (20)
∂L
∂bmk= −
{βVk
k−1∏j=1
(1− Vj)− 1} 1
bmk+amk
b2mk
exp(− µmk +
vmk
2
)−( Nm∑
n=1
ϕmnk
) 1
bmk+Nm
ξm
amk
b2mk
− 1
bmk
(21)
4
∂L∂bmk
= 0 gives
0 = −bmk
{βVk
k−1∏j=1
(1− Vj) +
Nm∑n=1
ϕmnk
}+ amk
{exp
(− µmk +
vmk
2
)+Nm
ξm
}. (22)
Therefore,
bmk = amk ·exp
(− µmk + vmk
2
)+ Nm
ξm
βVk∏k−1
j=1 (1− Vj) +∑Nm
n=1 ϕmnk
. (23)
∂L
∂amk=
{βVk
k−1∏j=1
(1− Vj)− 1}ψ′(amk)−
1
bmkexp
(− µmk +
vmk
2
)+( Nm∑
n=1
ϕmnk
)ψ′(amk)−
Nm
ξm
1
bmk
− (amk − 1)ψ′(amk) + 1
={βVk
k−1∏j=1
(1− Vj) +
Nm∑n=1
ϕmnk − amk
}ψ′(amk)−
1
bmk
{exp
(− µmk +
vmk
2
)+Nm
ξm
}+ 1 (24)
By using the result for bmk, we obtain
∂L
∂amk=
{βVk
k−1∏j=1
(1− Vj) +
Nm∑n=1
ϕmnk − amk
}ψ′(amk)−
βVk∏k−1
j=1 (1− Vj) +∑Nm
n=1 ϕmnk
amk+ 1
={βVk
k−1∏j=1
(1− Vj) +
Nm∑n=1
ϕmnk − amk
}{ψ′(amk)−
1
amk
}
∴ amk = βVk
k−1∏j=1
(1− Vj) +
Nm∑n=1
ϕmnk, bmk = exp(− µmk +
vmk
2
)+Nm
ξm. (25)
3.3 Update q(wmk)
∂L
∂µmk=amk
bmkexp
(− µmk +
vmk
2
)−{βVk
k−1∏j=1
(1− Vj)}−
T∑j=1
(µmj −mj)K−1k:j (26)
∂L
∂vmk=
1
2
{− amk
bmkexp
(− µmk +
vmk
2
)−K−1
k:k +1
vmk
}(27)
The plus and minus signs on the right hand side of the second line of Eq. (22) in the original paper aredifferent from those given above. We may use L-BFGS for updating µmk and vmk.
3.4 Update q(ηk)
∂L
∂γ′kd=
∑m
∑n
I(Xmn = d)ϕmnkψ′(γ′kd)−
∑m
∑n
ϕmnkψ′(γ̂′k) + (γ − 1)ψ′(γ′kd)− (γ − 1)
∑d
ψ′(γ̂′k)
− ψ(γ′kd) + ψ(γ̂′k)− (γ′kd − 1)ψ′(γ′kd) +∑d
(γ′kd − 1)ψ′(γ̂′k)− ψ(γ̂′k) + ψ(γ′dk)
=∑m
∑n
I(Xmn = d)ϕmnkψ′(γ′kd)−
∑m
∑n
ϕmnkψ′(γ̂′k) + (γ − γ′kd)ψ
′(γ′kd)−∑d
(γ − γ′kd)ψ′(γ̂′k)
= ψ′(γ′kd){∑
m
∑n
I(Xmn = d)ϕmnk + γ − γ′kd
}− ψ′(γ̂′k)
∑d
{∑m
∑n
I(Xmn = d)ϕmnk + γ − γ′kd
}∴ γ′kd = γ +
∑m
∑n
I(Xmn = d)ϕmnk (28)
5
3.5 Update q(Vk)
∂L
∂Vk= − α− 1
1− Vk− β
k−1∏j=1
(1− Vj)M∑
m=1
{µmk − ψ(amk) + ln bmk
}− 1
1− Vk
T∑k̂=k+1
{βVk̂
k̂−1∏j=1
(1− Vj)} M∑
m=1
{µmk̂ − ψ(amk̂) + ln bmk̂
}
− β
k−1∏j=1
(1− Vj)ψ(βVk
k−1∏j=1
(1− Vj))−
T∑k̂=k+1
1
1− VkβVk̂
k̂−1∏j=1
(1− Vj)ψ(βVk̂
k̂−1∏j=1
(1− Vj))
= − α− 1
1− Vk− β
k−1∏j=1
(1− Vj)M∑
m=1
{µmk − ψ(amk) + ln bmk
}− β
k−1∏j=1
(1− Vj)T∑
k̂=k+1
{Vk̂
k̂−1∏j=k+1
(1− Vj)} M∑
m=1
{µmk̂ − ψ(amk̂) + ln bmk̂
}
− β
k−1∏j=1
(1− Vj)ψ(βVk
k−1∏j=1
(1− Vj))− β
k−1∏j=1
(1− Vj)
T∑k̂=k+1
{Vk̂
k̂−1∏j=k+1
(1− Vj)}ψ(βVk̂
k̂−1∏j=1
(1− Vj))
= − α− 1
1− Vk− β
k−1∏j=1
(1− Vj)
[ M∑m=1
{µmk − ψ(amk) + ln bmk
}+ ψ
(βVk
k−1∏j=1
(1− Vj))]
− βk−1∏j=1
(1− Vj)T∑
k̂=k+1
{Vk̂
k̂−1∏j=k+1
(1− Vj)}[ M∑
m=1
{µmk̂ − ψ(amk̂) + ln bmk̂
}+ ψ
(βVk̂
k̂−1∏j=1
(1− Vj))]
= − α− 1
1− Vk− pkVk
[ M∑m=1
{µmk − ψ(amk) + ln bmk
}+ ψ(βpk)
]
−T∑
j=k+1
pj1− Vk
[ M∑m=1
{µmj − ψ(amj) + ln bmj
}+ ψ(βpj)
](29)
I think that Vk on the second line of Eq. (24) in the original paper is not required.
3.6 Update q(K)
With respect to K, we maximize the following function:
L(K) = −M2
ln |K| − 1
2
M∑m=1
T∑k=1
vmkK−1k:k − 1
2
M∑m=1
(µm −m)TK−1(µm −m), (30)
where the last term is equal to 12
∑Mm=1
∑Tk=1
∑Tj=1(µmk −mk)(µmj −mj)K
−1k:j .
The derivative of the first term of the right hand side in Eq. (30) is obtained based on the followingidentity (Cf. Eq. (51) of The Matrix Cookbook1):
∂ ln |K|∂K
= K−1. (31)
For the second term of the right hand side in Eq. (30), it holds that∑
k vmkK−1k:k = Tr[K−1diag(vm)],
where diag(vm) is a diagonal matrix whose kth diagonal entry is vmk. By using the following identity (Cf.Eq. (16) in Old and New Matrix Algebra Useful for Statistics2):
∂Tr[AΣ−1B]
∂Σ= −Σ−1BAΣ−1, (32)
1http://orion.uwaterloo.ca/ hwolkowi/matrixcookbook.pdf2http://research.microsoft.com/en-us/um/people/minka/papers/matrix/minka-matrix.pdf
6
we obtain∂∑
m
∑k vmkK
−1k:k
∂K = −K−1{∑
m diag(vm)}K−1.
For the last term in Eq. (30), it holds that
(µm −m)TK−1(µm −m) = Tr[(µm −m)TK−1(µm −m)
]. (33)
Therefore, by using Eq. (32), we obtain ∂(µm−m)TK−1(µm−m)∂K = −K−1(µm −m)(µm −m)TK−1.
Consequently, we have
∂L(K)
∂K= −M
2K−1 +
1
2K−1
{∑m
diag(vm)}K−1 +
1
2K−1
∑m
{(µm −m)(µm −m)T
}K−1 . (34)
∂L(K)∂K = 0 holds when
K−1 =1
MK−1
∑m
{diag(vm) + (µm −m)(µm −m)T
}K−1. (35)
By multiplying K on both sides of the above equation from left and right, we obtain
K =1
M
∑m
{diag(vm) + (µm −m)(µm −m)T
}. (36)
This derivation is completely the same with that of CTM [1].
3.7 Update q(m)
∂L
∂mk=
T∑j=1
(µmj −mj)K−1k:j , ∴ mk =
1
T
T∑j=1
µmj (37)
3.8 Update q(α)
With respect to α, we maximize the following function:
L(α) = T ln Γ(α+ 1)− T ln Γ(α) + (α− 1)T∑
k=1
ln(1− Vk) (38)
We use the following identity (Cf. Eqs. (120), (121), and (122) in Estimating a Dirichlet distribution3):
Γ(n+ x)
Γ(x)≥ cxa if n ≥ 1 (39)
a ={ψ(n+ x̂)− ψ(x̂)
}x̂ (40)
c =Γ(n+ x̂)
Γ(x̂)x̂−a (41)
Then we obtain:
L(α) ≥ T{ψ(α̂+ 1)− ψ(α̂)
}α̂ lnα+ (α− 1)
T∑k=1
ln(1− Vk) + const. (42)
We maximize this lower bound, which we denote as L(α).
∂L(α)∂α
=1
αT{ψ(α̂+ 1)− ψ(α̂)
}α̂+
T∑k=1
ln(1− Vk) (43)
∴ α = α ·T{ψ(α+ 1)− ψ(α)
}−∑T
k=1 ln(1− Vk)(44)
3http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/
7
This is a multiplicative update.
When we apply a Gamma prior p(α) =ba00
Γ(a0)αa0−1e−b0α to α, we have the following result:
∂L(α)∂α
=1
αT{ψ(α̂+ 1)− ψ(α̂)
}α̂+
T∑k=1
ln(1− Vk) + (a0 − 1)1
α− b0 (45)
∴ α = α ·a0 − 1 + T
{ψ(α+ 1)− ψ(α)
}b0 −
∑Tk=1 ln(1− Vk)
(46)
3.9 Update q(β)
With respect to β, we maximize the following function L(β):
L(β) = −T∑
k=1
{βVk
k−1∏j=1
(1− Vj)} M∑
m=1
µmk −T∑
k=1
ln Γ(βVk
k−1∏j=1
(1− Vj))
+T∑
k=1
{βVk
k−1∏j=1
(1− Vj)} M∑
m=1
{ψ(amk)− ln bmk
}= −
T∑k=1
βpk
M∑m=1
µmk −T∑
k=1
ln Γ(βpk) +T∑
k=1
βpk
M∑m=1
{ψ(amk)− ln bmk
}(47)
The first and the second derivatives are obtained as follows:
∂L(β)
∂β= −
T∑k=1
pk
[ψ(βpk) +
M∑m=1
{µmk − ψ(amk) + ln bmk
}]∂2L(β)
∂β2= −
T∑k=1
p2kψ′(βpk) (48)
We can use Newton’s method to update β.
When we apply a Gamma prior p(β) =dc00
Γ(c0)βc0−1e−d0β to β, we have the following result:
∂L(β)
∂β= −
T∑k=1
pk
[ψ(βpk) +
M∑m=1
{µmk − ψ(amk) + ln bmk
}]+ (c0 − 1)
1
β− d0
∂2L(β)
∂β2= −
T∑k=1
p2kψ′(βpk)− (c0 − 1)
1
β2(49)
References
[1] David M. Blei and John D. Lafferty. Correlated topic models. In NIPS, 2005.
[2] John Paisley, Chong Wang, and David Blei. The discrete infinite logistic normal distribution formixed-membership modeling. In AISTATS, 2011.
8
Recommended