A Note on Over-replicated Softmax Model

Derivation of equations for over-replicated softmax model

Tomonari MASADA @ Nagasaki University

May 31, 2013

1 Joint probability distribution

• We define constants as follows:

– D : the number of documents

– N : the length of the document

– W : the dictionary size, i.e., the number of different words

– J : the number of hidden units in the first hidden layer

– M : the number of hidden units in the second hidden layer

• Let V denote the set of visible binary units with vnw = 1 if the wth word appears as the nth token.

• Let h(1) denote the set of hidden binary units in the first hidden layer.

• Let H(2) denote the set of hidden binary units in the second hidden layer. This is an M ×W matrix

with h(2)mw = 1 if the mth hidden softmax unit takes on the wth value.

The energy of the joint configuration {V ,h(1),H(2)} is defined as:

E(V ,h(1),H(2); θ) = −N∑n=1

J∑j=1

W∑w=1

W(1)njwh

(1)j vnw −

M∑m=1

J∑j=1

W∑w=1

W(2)mjwh

(1)j h(2)mw

−N∑n=1

W∑w=1

vnwb(1)nw − (M +N)

J∑j=1

h(1)j aj −

M∑m=1

W∑w=1

h(2)mwb(2)mw (1)

where θ = {W (1),W (2),a, b(1), b(2)} are the model parameters.

We ignore the order of the word tokens by letting W(1)njw be the same value for all n. In a similar

manner, we let W(2)mjw be the same value for all m. Further, we tie the first and second layer weights.

Consequently, we have W(1)njw = W

(2)mjw = Wjw and b

(1)nw = b

(2)mk = bw, and the energy is simplified to:

E(V ,h(1),H(2); θ) = −N∑n=1

J∑j=1

W∑w=1

Wjwh(1)j vnw −

M∑m=1

J∑j=1

W∑w=1

Wjwh(1)j h(2)mw

−N∑n=1

W∑w=1

vnwbw − (M +N)

J∑j=1

h(1)j aj −

M∑m=1

W∑w=1

h(2)mwbw

= −J∑j=1

W∑w=1

Wjwh(1)j (vw + h(2)w )−

W∑w=1

(vw + h(2)w )bw − (M +N)

J∑j=1

h(1)j aj , (2)

where vw =∑n vnw and h

(2)w =

∑m h

(2)mw.

The joint probability distribution is defined as:

p(V ,h(1),H(2); θ) =exp

(− E(V ,h(1),H(2); θ

)Z(θ,N)

. (3)

where Z(θ,N) =∑

V

∑h(1)

∑H(2) exp

(V ,h(1),H(2); θ

).

1

2 Conditional distributions over hidden and visible units

The conditional distribution over a visible unit is

p(vn|V \n,h(1),H(2); θ) =p(V ,h(1),H(2); θ)∑

vn∈{e1,...,eW } p(V ,h(1),H(2); θ)

=exp

(− E(V ,h(1),H(2); θ)

)∑vn

exp(− E(V ,h(1),H(2); θ)

)=

∏j

∏w exp(Wjwh

(1)j vw) exp(Wjwh

(1)j h

(2)w ) ·

∏w exp(vwbw) exp(h

(2)w bw) ·

∏j exp(h

(1)j aj)

M+N∑vn

{∏j

∏w exp(Wjwh

(1)j vw) exp(Wjwh

(1)j h

(2)w ) ·

∏w exp(vwbw) exp(h

(2)w bw) ·

∏j exp(h

(1)j aj)M+N

}=

∏j

∏w exp(Wjwh

(1)j vnw) ·

∏w exp(vnwbw)∑

vn

{∏j

∏w exp(Wjwh

(1)j vnw) ·

∏w exp(vnwbw)

} (4)

This results shows p(vn|V \n,h(1),H(2); θ) = p(vn|h(1); θ). For vnw = 1, we obtain

p(vnw = 1|h(1); θ) =

∏j exp(Wjwh

(1)j ) · exp(bw)∑

w

{∏j exp(Wjwh

(1)j ) · exp(bw)

} (5)

The conditional distribution over a hidden unit of the first hidden layer is

p(h(1)j |V ,h

(1)\j ,H(2); θ) =

∏j

∏w exp(Wjwh

(1)j vw) exp(Wjwh

(1)j h

(2)w ) ·

∏j exp(h

(1)j aj)

M+N∑h(1)j ∈{0,1}

{∏j

∏w exp(Wjwh

(1)j vw) exp(Wjwh

(1)j h

(2)w ) ·

∏j exp(h

(1)j aj)M+N

}(6)

This result shows p(h(1)j |V ,h

(1)\j ,H(2); θ) = p(h(1)j |V ,H

(2); θ). For h(1)j = 1, we obtain

p(h(1)j = 1|V ,H(2); θ) =

∏j

∏w exp(Wjwvw) exp(Wjwh

(2)w ) ·

∏j exp(aj)

M+N

1 +∏j

∏w exp(Wjwvw) exp(Wjwh

(2)w ) ·

∏j exp(aj)M+N

= σ(∑

j

∑w

Wjw(vw + h(2)w ) + (M +N)∑j

aj). (7)

The conditional distribution over a hidden unit of the second hidden layer is

p(h(2)m |V ,h

(1),H(2)\m; θ) =

∏j

∏w exp(Wjwh

(1)j h

(2)w ) ·

∏w exp(h

(2)w bw)∑

h(2)m

{∏j

∏w exp(Wjwh

(1)j h

(2)w ) ·

∏w exp(h

(2)w bw)

} (8)

This result shows p(h(2)m |V ,h

(1),H(2)\m; θ) = p(h(2)m |h

(1); θ). For h(2)mw = 1, we obtain

p(h(2)mw = 1|h(1); θ) =

∏j exp(Wjwh

(1)j ) · exp(bw)∑

w

{∏j exp(Wjwh

(1)j ) · exp(bw)

} (9)

The above distributions can be used for sampling V ,h(1),H(2).

When we set h(2)mw =

∑n vnw∑

n

∑w vnw

= vwN for all m 1,

p(h(1)j = 1|V ,H(2); θ) = σ

(∑j

∑w

Wjw(vw +MvwN) + (M +N)∑j

aj)

= σ((

1 +M

N

)∑j

∑w

Wjwvw + (M +N)∑j

aj

). (10)

This can be used in pretraining.

1cf. Sec. 2.2 of Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine by Nitish Srivastava,Ruslan Salakhutdinov, and Geoffrey Hinton.

2

3 Derivatives of log-likelihood

When we have D documents V 1, . . . ,V D, the log-likelihood can be written as

ln

D∏d=1

P (V d; θ)

=∑d

ln∑h(1)

∑H(2)

exp(− E(V d,h

(1),H(2); θ))−D ln

∑V

∑h(1)

∑H(2)

exp(− E(V ,h(1),H(2); θ)

). (11)

∂ ln∏Dd=1 P (V d; θ)

∂Wjw=∑d

∂

∂Wjwln∑h(1)

∑H(2)

exp(− E(V d,h

(1),H(2); θ))

−D ∂

∂Wjwln∑V

∑h(1)

∑H(2)

exp(− E(V ,h(1),H(2); θ)

).

=∑d

∂∂Wjw

∑h(1)

∑H(2) exp

(− E(V d,h

(1),H(2); θ))∑

h(1)

∑H(2) exp

(− E(V d,h

(1),H(2); θ))

−D∂

∂Wjw

∑V

∑h(1)

∑H(2) exp

(− E(V ,h(1),H(2); θ)

)∑V

∑h(1)

∑H(2) exp

(− E(V ,h(1),H(2); θ)

) (12)

∂

∂Wjw

∑h(1)

∑H(2)

exp(− E(V d,h

(1),H(2); θ))

=∂

∂Wjw

∑h(1)

∑H(2)

{∏j

∏w

exp(Wjwh(1)j vdw) exp(Wjwh

(1)j h(2)w )

·∏w

exp(vdwbw) exp(h(2)w bw) ·∏j

exp(h(1)j aj)

M+N}

=∑h(1)

∑H(2)

(h(1)j vdw + h

(1)j h(2)w )

{∏j

∏w

exp(Wjwh(1)j vdw) exp(Wjwh

(1)j h(2)w )

·∏w

exp(vdwbw) exp(h(2)w bw) ·∏j

exp(h(1)j aj)

M+N}

=∑h(1)

∑H(2)

(h(1)j vdw + h

(1)j h(2)w ) exp

(− E(V d,h

(1),H(2); θ))

(13)

In a similar manner, we obtain

∂

∂Wjw

∑V

∑h(1)

∑H(2)

exp(− E(V ,h(1),H(2); θ)

)=∑V

∑h(1)

∑H(2)

(h(1)j vdw + h

(1)j h(2)w ) exp

(− E(V ,h(1),H(2); θ)

). (14)

3

Therefore,

∂ ln∏Dd=1 P (V d; θ)

∂Wjw

=∑d

∑h(1)

∑H(2)(h

(1)j vdw + h

(1)j h

(2)w ) exp

(− E(V d,h

(1),H(2); θ))∑

h(1)

∑H(2) exp

(− E(V d,h

(1),H(2); θ))

−D∑

V

∑h(1)

∑H(2)(h

(1)j vdw + h

(1)j h

(2)w ) exp

(− E(V ,h(1),H(2); θ)

)∑V

∑h(1)

∑H(2) exp

(− E(V ,h(1),H(2); θ)

)=∑d

∑h(1)

∑H(2)(h

(1)j vdw + h

(1)j h

(2)w )

exp(−E(V d,h

(1),H(2);θ))

∑V

∑h(1)

∑H(2) exp

(−E(V ,h(1),H(2);θ)

)∑

h(1)

∑H(2) exp

(−E(V d,h(1),H(2);θ)

)∑

V

∑h(1)

∑H(2) exp

(−E(V ,h(1),H(2);θ)

)−D

∑V

∑h(1)

∑H(2)

(h(1)j vdw + h

(1)j h(2)w )

exp(− E(V ,h(1),H(2); θ)

)∑V

∑h(1)

∑H(2) exp

(− E(V ,h(1),H(2); θ)

)=∑d

∑h(1)

∑H(2)

(h(1)j vdw + h

(1)j h(2)w )p(h(1),H(2)|V d; θ)

−D∑V

∑h(1)

∑H(2)

(h(1)j vdw + h

(1)j h(2)w )p(V ,h(1),H(2); θ) , (15)

where the second term on the right hand side can be approximated by Gibbs sampling (cf. Eqs. (7) and(9)), and the first term by the variational inference described in the next section.

4 Variational inference

For a particular document V , we have the following based on Jensen’s inequality:

ln p(V ; θ) = ln∑h(1)

∑H(2)

p(V ,h(1),H(2); θ) = ln∑h(1)

∑H(2)

{q(h(1),H(2)|V ) · p(V ,h

(1),H(2); θ)

q(h(1),H(2)|V )

}≥∑h(1)

∑H(2)

q(h(1),H(2)|V ) lnp(V ,h(1),H(2); θ)

q(h(1),H(2)|V )

=∑h(1)

∑H(2)

q(h(1),H(2)|V ) ln p(V ,h(1),H(2); θ)−∑h(1)

∑H(2)

q(h(1),H(2)|V ) ln q(h(1),H(2)|V )

=∑h(1)

∑H(2)

q(h(1),H(2)|V ) ln{p(h(1),H(2)|V ; θ)p(V ; θ)

}−∑h(1)

∑H(2)

q(h(1),H(2)|V ) ln q(h(1),H(2)|V )

= ln p(V ; θ) +∑h(1)

∑H(2)

q(h(1),H(2)|V ) ln p(h(1),H(2)|V ; θ)−∑h(1)

∑H(2)

q(h(1),H(2)|V ) ln q(h(1),H(2)|V )

(16)

We denote the lower bound in Eq. (16) as L. We assume

q(h(1),H(2)|V d) = q(h(1)|V )q(H(2)|V d) =∏j

qd(h(1)j ) ·

∏m

∏w

qd(h(2)mw) (17)

with qd(h(1)j = 1) = µdj and qd(h

(2)mw = 1) = νdw for the dth document. Note that

∑w νdw = 1 holds,

because∑w h

(2)mw = 1. We omit the subscript d from now on.

The first term of the lower bound in Eq. (16), i.e., ln p(V ; θ), can be regarded as a constant. The

4

second term of the lower bound in Eq. (16) can be rewritten as follows:∑h(1)

∑H(2)

q(h(1),H(2)|V ) ln p(h(1),H(2)|V ; θ)

=∑h(1)

∑H(2)

q(h(1),H(2)|V ) lnexp

(− E(V ,h(1),H(2); θ)

)∑h(1)

∑H(2) exp

(− E(V ,h(1),H(2); θ)

)= −

∑h(1)

∑H(2)

q(h(1),H(2)|V )E(V ,h(1),H(2); θ)

−∑h(1)

∑H(2)

q(h(1),H(2)|V ) ln∑h(1)

∑H(2)

exp(− E(V ,h(1),H(2); θ)

)= −

∑h(1)

∑H(2)

q(h(1),H(2)|V )E(V ,h(1),H(2); θ)− ln

∑h(1)

∑H(2)

exp(− E(V ,h(1),H(2); θ)

)(18)

The second term of the right hand side of Eq. (18) is a constant with respect to the hidden units.The first term can be rewritten as follows:

−∑h(1)

∑H(2)

q(h(1),H(2)|V )E(V ,h(1),H(2); θ)

=∑h(1)

∑H(2)

{∏j

qd(h(1)j ) ·

∏m

∏w

qd(h(2)mw)

}( J∑j=1

W∑w=1

Wjwh(1)j (vw + h(2)w ) +

W∑w=1

h(2)w bw + (M +N)

J∑j=1

h(1)j aj

)

=∑h(1)

∑H(2)

{∏j

qd(h(1)j ) ·

∏m

∏w

qd(h(2)mw)

}( J∑j=1

W∑w=1

Wjwh(1)j h(2)w

)

+∑h(1)

{∏j

qd(h(1)j )}{ J∑

j=1

W∑w=1

Wjwh(1)j vw + (M +N)

J∑j=1

h(1)j aj

}+∑H(2)

{∏m

∏w

qd(h(2)mw)

}( W∑w=1

h(2)w bw

)=∑j

∑w

MµjνwWjw +∑j

µj{∑

w

Wjwvw + (M +N)aj}

+M∑w

νwbw (19)

Therefore,

∂

∂µj

∑h(1)

∑H(2)

q(h(1),H(2)|V ) ln p(h(1),H(2)|V ; θ) =∑w

MνwWjw +∑w

Wjwvw + (M +N)aj

∂

∂νw

∑h(1)

∑H(2)

q(h(1),H(2)|V ) ln p(h(1),H(2)|V ; θ) =∑j

MµjWjw +Mbw (20)

The third term of the lower bound in Eq. (16) is

−∑h(1)

∑H(2)

q(h(1),H(2)|V ) ln q(h(1),H(2)|V )

= −∑j

{µj lnµj + (1− µj) ln(1− µj)

}−M

∑w

νw ln νw . (21)

Therefore,

− ∂

∂µj

∑h(1)

∑H(2)

q(h(1),H(2)|V ) ln q(h(1),H(2)|V ) = − lnµj + ln(1− µj)

− ∂

∂νw

∑h(1)

∑H(2)

q(h(1),H(2)|V ) ln q(h(1),H(2)|V ) = −M ln νw −M (22)

Consequently,

∂L∂µj

=∑w

MνwWjw +∑w

Wjwvw + (M +N)aj − lnµj + ln(1− µj)

∂L∂νw

=∑j

MµjWjw +Mbw −M ln νw −M (23)

5

By solving ∂L∂µj

= 0, we obtain the following:

lnµj − ln(1− µj) =∑w

MνwWjw +∑w

Wjwvw + (M +N)aj

µj1− µj

= exp(∑w

MνwWjw +∑w

Wjwvw + (M +N)aj)

∴ µj = σ(∑

w

MνwWjw +∑w

Wjwvw + (M +N)aj

). (24)

Be solving ∂L∂νw

= 0, we obtain the following:

M ln νw +M =∑j

MµjWjw +Mbw

νw ∝ exp(∑

j

µjWjw + bw)

νw =exp

(∑j µjWjw + bw

)∑w exp

(∑j µjWjw + bw

) (25)

We can use Eqs. (24) and (25) for updating the variational posterior parameters µ and ν.

5 Learning procedure

Please refer to the following paper:Nitish Srivastava, Ruslan Salakhutdinov, and Geoffrey Hinton. Fast Inference and Learning for ModelingDocuments with a Deep Boltzmann Machine.

6

Technology

A Note on Over-replicated Softmax Model