Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December...

Deep Learning

Restricted Boltzmann Machines (RBM)

Ali Ghodsi

University of Waterloo

December 15, 2015

Slides are partially based on Book in preparation, Deep Learningby Bengio, Goodfellow, and Aaron Courville, 2015

Ali Ghodsi Deep Learning

Restricted Boltzmann Machines

Restricted Boltzmann machines are some of the most commonbuilding blocks of deep probabilistic models. They are undirectedprobabilistic graphical models containing a layer of observablevariables and a single layer of latent variables.

Restricted Boltzmann Machines

p(v,h) =1

Zexp{−E (v,h)}.

Where E (v,h) is the energy function.

E (v,h) = −bTv − cTh− vTWh,

Z is the normalizing constant partition function:

Z =∑v

exp{−E (v,h)}.

Restricted Boltzmann Machine (RBM)

Energy function:

E (v,h) = −bTv − cTh− vTWh

= −∑k

bkvk −∑j

cjhj −∑j

Wjkhjvk

Distribution: p(v,h) = 1Zexp{−E (v,h)}

Partition function: Z =∑

∑h exp{−E (v,h)}

Conditional Distributions

The partition function Z is intractable.

Therefore the joint probability distribution is also intractable.

But P(h|v) is simple to compute and sample from.

Deriving the conditional distributions from the

joint distribution.

p(h|v) =p(h, v)

Zexp{bTv + cTh+ vTWh}

Z ′ exp{cTh+ vTWh}

Z ′ exp

cjhj +n∑

vTW:jhj

n∏j=1

exp{cjhj + vTW:jhj}

joint distribution.

p(h|v) =p(h, v)

Z ′ exp

cjhj +n∑

vTW:jhj

n∏j=1

exp{cjhj + vTW:jhj}

joint distribution.

p(h|v) =p(h, v)

Z ′ exp

cjhj +n∑

vTW:jhj

n∏j=1

exp{cjhj + vTW:jhj}

joint distribution.

p(h|v) =p(h, v)

Z ′ exp

cjhj +n∑

vTW:jhj

n∏j=1

exp{cjhj + vTW:jhj}

joint distribution.

p(h|v) =p(h, v)

Z ′ exp

cjhj +n∑

vTW:jhj

n∏j=1

exp{cjhj + vTW:jhj}

The distributions over the individual binary hj

P(hj = 1|v) =P(hj = 1, v)

P(hj = 0, v) + P(hj = 1, v)

=exp{cj + vTW:j}

exp{0}+ exp{cj + vTW:j}= sigmoid(cj + vTW:j)

P(h|v) =n∏

sigmoid(cj + vTW:j)

P(v|h) =d∏

sigmoid(bi +Wi :h)

RBM Gibbs Sampling

Step1: Sample h(l) ∼ P(h|v(l)).

We can simultaneously and independently sample from all theelements of h(l) given v(l).

Step 2: Sample v(l+1) ∼ P(v|h(l)).

We can simultaneously and independently sample from all theelements of v(l+1) given h(l).

Training Restricted Boltzmann Machines

The log-likelihood is given by:

`(W ,b, c) =n∑

logP(v(t))

log∑h

P(v(t),h)

log∑h

exp{−E (v(t),h)} )− n logZ

log∑h

exp{−E (v(t),h)} )− n log∑v,h

exp{−E (v,h)}

`(W ,b, c) =n∑

logP(v(t))

log∑h

P(v(t),h)

log∑h

exp{−E (v,h)}

`(W ,b, c) =n∑

logP(v(t))

log∑h

P(v(t),h)

log∑h

exp{−E (v,h)}

`(W ,b, c) =n∑

logP(v(t))

log∑h

P(v(t),h)

log∑h

exp{−E (v,h)}

Maximizing the likelihood

θ = {b, c,W } :

`(θ) =n∑

log∑h

exp{−E (v,h)}

∇θ`(θ) = ∇θn∑

log∑h

exp{−E (v(t),h)} )− n ∇θlog∑v,h

exp{−E (v,h)}

∑h exp{−E (v(t),h)}∇θ − E (v(t),h)∑

h exp{−E (v(t),h)}

−n∑

v,h exp{−E (v,h)}∇θ − E (v,h)∑v,h exp{−E (v,h)}

EP(h|v(t))[∇θ − E (v(t),h)]− n EP(h,v)[∇θ − E (v,h)]

θ = {b, c,W } :

`(θ) =n∑

log∑h

exp{−E (v,h)}

∇θ`(θ) = ∇θn∑

log∑h

exp{−E (v,h)}

∑h exp{−E (v(t),h)}∇θ − E (v(t),h)∑

−n∑

θ = {b, c,W } :

`(θ) =n∑

log∑h

exp{−E (v,h)}

∇θ`(θ) = ∇θn∑

log∑h

exp{−E (v,h)}

∑h exp{−E (v(t),h)}∇θ − E (v(t),h)∑

−n∑

θ = {b, c,W } :

`(θ) =n∑

log∑h

exp{−E (v,h)}

∇θ`(θ) = ∇θn∑

log∑h

exp{−E (v,h)}

∑h exp{−E (v(t),h)}∇θ − E (v(t),h)∑

−n∑

The gradient of the negative energy function

∇W − E (v,h) =∂

∂W(bTv + cTh + vTWh)

∇b − E (v,h) =∂

∂b(bTv + cTh + vTWh)

∇c − E (v,h) =∂

∂c(bTv + cTh + vTWh)

∇θ`(θ) =n∑

∇W `(W ,b, c) =n∑

h(t)v(t)T − nEP(v,h)[hvT ]

∇b`(W ,b, c) =n∑

v(t)T − nEP(v,h)[v]

∇c`(W ,b, c) =n∑

h(t) − nEP(v,h)[h]

whereh(t) = EP(h,v(t))[h] = sigmoid(c+ v(t)W ).

it is impractical to compute the exact log-likelihood gradient.Ali Ghodsi Deep Learning

Contrastive Divergence

1. replace the expectation by a point estimate at v

2. obtain the point v by Gibbs sampling

3. start sampling chain at v(t)

EP(h,v)[∇θ − E (v,h)] ≈ ∇θ − E (v,h)|v=v,h=h

Set ∈, the step size, to a small positive numberSet k , the number of Gibbs steps, high enough to allow a Markovchain of p(v; θ) to mix when initialized from pdata. Perhaps 1-20 totrain an RBM on a small image patch.while Not converged doSample a mini batch of m examples from the training set{v(1), . . . , v(m)}.∇W ← 1

∑mt=1 v

(t)h(t)T∇b ← 1

∑mt=1 v

∇c ← 1m

∑mt=1 h

for t = 1 to m dov(t) ← v(t)

end forfor ` = 1 to k dofor t = 1 to m do

h(t) sampled from∏n

j=1 sigmoid(cj + v(t)TW:,j).

v(t) sampled from∏d

i=1 sigmoid(bj +Wi ,:h(t)).end forend forh(t) ← sigmoid(c+ v(t)TW )∇W ← ∇W − 1

∑mt=1 v

(t)h(t)T∇b ← ∇b − 1

∑mt=1 v

∇c ← ∇c − 1m

∑mt=1 h

W ← W+ ∈ ∇W

b← b+ ∈ ∇b

∑c← c+ ∈ ∇c

∑end while

Pseudo code

1. For each training example v(t)

i. generate a negative sample v using k steps of Gibbs sampling,starting at v(t)

ii. update parameters

W ⇐ W + α(h(v(t))x (t) − h(v)vT

)b ⇐ b+ α

(h(v(t))− h(v)

)c ⇐ c+ α

(v(t) − v)

2. Go back to 1 until stopping criteria

Example

Samples from the MNIST digit recognition data set. Here, a black pixel corresponds to an input value of 0 and a white pixel

corresponds to 1 (the inputs are scaled between 0 and 1).

Example

The input weights of a random subset of the hidden units. The activation of units of the first hidden layer is obtained by a dot

product of such a weight “image” with the input image. In these images, a black pixel corresponds to a weight smaller than 3

and a white pixel to a weight larger than 3, with the different shades of gray corresponding to different weight values uniformly

between 3 and 3. Larochelle, et. al, JMLR2009

Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December...

Documents

Deep Learning Basics Lecture 6: Convolutional NN · Figure from Deep Learning, by Goodfellow, Bengio, and Courville the same weight shared for all output nodes ... •Theano –CPU/GPU

Deep Learning - Welcome to CEDARsrihari/CSE574/Chap5/Chap5.… · · 2015-10-26Deep Learning Summary (LeCun,Bengio,Hinton, ... • Composing simple but non-linear modules that transform

Why Does Unsupervised Pre-training Help Deep …...201 Why Does Unsupervised Pre-training Help Deep Learning? Dumitru Erhan Aaron Courville Yoshua Bengio Pascal Vincent DIRO, Universite

Autonomous Driving with Deep Learning: A Survey of State-of-Art Technologies · 2020-07-07 · three of them (also called fathers of deep learning), Hinton, Bengio and LeCun, won

Charles Dugas, Yoshua Bengio, Nicolas Chapados, …lisa/pointeurs/dugas03.pdfStatistical Learning Algorithms Applied to Automobile Insurance Ratemaking Charles Dugas, Yoshua Bengio,

IPM - Sharifsharif.edu/~ghodsi/farsi-full-resume.pdf · Erik D. Demaine, Mohammad Ghodsi, MohammadTaghi Hajiaghayi, Amin S. Sayedi-Roshkhar, Morteza Zadimoghaddam, Scheduling to minimize

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and ... · Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville Chapter 6 :Deep Feedforward Networks Benoit

Unsupervised Feature Learning and Deep Learning: A · PDF file1 Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives Yoshua Bengio, Aaron Courville, and Pascal

Greedy Layer-Wise Training of Deep Networkspift6266/A06/cours/nips-talk.pdfGreedy Layer-Wise Training of Deep Networks Yoshua Bengio,Pascal Lamblin,Dan Popovici, Hugo Larochelle December

Deep Learning - Convolutional Neural Network (CNNs)i-systems.github.io/HSE545/machine learning all/16 Deep learning... · Deep Learning Convolutional Neural Network (CNNs) Ali Ghodsi

Tutorial: Learning Deep Architectureslisa/pointeurs/DeepTutorial2009.pdf · Hinton, Osindero & Teh « A Fast Learning Algorithm for Deep Belief Nets », Neural Computation, 2006 Bengio,

Learning Deep Representations - Université de Montréallisa/pointeurs/talk_darpa_deep_workshop.pdf · Learning Deep Representations Yoshua Bengio September 25th, 2008 ... Expoit

Import Demand Elasticities Revisited - wiiw...Import Demand Elasticities Revisited MAHDI GHODSI JULIA GRÜBLER ROBERT STEHRER Mahdi Ghodsi and Julia Grübler are Research Economists

Deep Learning · This book is accompanied by the above website. ... Bengio, Y. (2013). Deep learning of representations: ... In IEEE International Conference on Neural

Deep Learning - Recurrent Neural Network (RNNs) · Deep Learning Recurrent Neural Network (RNNs) Ali Ghodsi University of Waterloo October 23, 2015 ... in preparation, Deep Learning

arXiv:submit/2340273 [cs.CV] 24 Jul 2018cocosci.princeton.edu/papers/deep-sim-arxiv.pdf · understanding (LeCun, Bengio, & Hinton, 2015), among other breakthroughs in natural language

Deep Learning for AI - Data Science Summer School · Deep Learning for AI Yoshua Bengio August 28th, 2017 @ DS3 Data Science Summer School. A new revolution ... unlocked a breakthrough

APPRENTISSAGE MACHINE & DEEP LEARNING Deep Learning · Understanding the difficulty of training deep feedforward neural networks, Glorot and Bengio Adaptive Subgradient Methods for

Bengio Machine learning semi-supervised learning

Tutorial: Learning Deep Architectures - University of …rsalakhu/deeplearning/yoshua_icml2009.pdf · Tutorial: Learning Deep Architectures Yoshua Bengio, U. Montreal Yann LeCun,