Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Classification using Discriminative RestrictedBoltzmann Machines
Hugo Larochelle and Yoshua Bengio
Presented by: Bhargav Mangipudi
December 2, 2016
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 1 / 24
Outline
1 Introduction
2 Modeling Classification using RBMs
3 Discriminative Restricted Boltzmann Machines (DRBM)
4 Hybrid Discriminative Restricted Boltzmann Machines (HDRBM)
5 Semi-Supervised Learning
6 ExperimentsCharacter RecognitionDocument Classification
7 Further Work
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 2 / 24
Restricted Boltzmann Machines
Restricted Boltzmann Machines are generative stochastic models that canmodel a probability distribution over its set of inputs using a set of hidden (orlatent) units.
Figure 1: Restricted Boltzmann Machine
They are represented as a bi-partitie graphical model where the visible layer isthe observed data and the hidden layer models latent features.
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 3 / 24
RBM as feature extractors
RBMs were predominantly used in an unsupervised setting to extract betterfeatures from the input data.
By trying to reconstruct the input vector, we can make the RBM learn adifferent representation of the input data. This representation is then used ina higher classification algorithm.
By using features extracted from RBM and using them in an upstream task,we introduce large number of hyper-parameters: Parameters for theRBM/DBNs + Parameters from the upstream classification procedure.
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 4 / 24
Using RBM for NN/DBN initialization
RBMs have been used as a building block in Deep Belief Networks.
Pre-training: The stacked RBMs are trained in a layer-by-layer procedure in agreedy manner by fixing the previously computed layer weights.
Discriminative fine-tuning: When used for a classification task, fine-tuning ofthe weights is performed by adding a final layer of output nodes andpropagating the gradients of the loss.
This was one of the earliest methods to tractably train deep neural networkarchitectures.
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 5 / 24
Main Ideas
RBMs can be used as a self-contained framework for non-linear classificationtasks.
Introduces an objective function to model discriminative training of RBMs.
Extension to support semi-supervised learning setting.
Experimentation to show competitive results with other techniques.
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 6 / 24
Modeling Classification using RBMs
Assume that the training set is represented as Dtrain = {(xi , yi )} where xirepresents the input vector for the i-th example and yi ∈ {1, . . .C} is thecorresponding target class.
The labels yi is encoded as a one-hot representation (~y) over the set of Cclasses and treated as observed variables.
Thus, we have P(y , x , h) ∝ e−E(y ,x,h)
where E (y , x , h) = −hTWx − bT x − cTh − dT~y − hTU~ywith parameters Θ = (W,b, c,d,U).
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 7 / 24
Modeling Classification using RBMs - Continued
Thus, we see the following properties based on the structure and modeling
P(x |h) =∏i
P(xi |h)
P(xi = 1|h) = sigm(bi +∑j
Wj,ihj)
P(y |h) =edy+
∑j Uj,yhj∑
y∗ edy∗+
∑j Uj,y∗hj
Similarly for the hidden units, we get:
P(h|y , x) =∏j
P(hj |y , x)
P(hj = 1|y , x) = sigm(cj + Uj,y +∑i
Wj,ixi )
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 8 / 24
Training RBMs
To train a generative model on the training set, we minimize the negativelog-likelihood:
Lgen(Dtrain) = −|Dtrain|∑i=1
log P(yi , xi )
Calculating the exact gradient of log P(yi , xi ) for any parameter θ ∈ Θ, weget:
∂log P(yi , xi )
∂θ= −Eh|yi,xi
[ ∂∂θ
E(yi, xi,h)]
+ Ey,x,h
[ ∂∂θ
E(y, x,h)]
Gradient can be estimated using Contrastive Divergence
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 9 / 24
Training RBMs - Continued
After learning the model’s parameters iteratively by stochastic gradientdescent with a suitable learning rate, we can perform classification byestimating:
p(y |x) =edy∏n
j=1(1 + e(cj+Uj,y+∑
i Wj,ixi ))∑y∗ e
dy∗∏n
j=1(1 + e(cj+Uj,y∗+∑
i Wj,ixi ))
While training in an unsupervised setting, the features representations learnedby the hidden layer is not guaranteed to be useful for the classification task.
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 10 / 24
Discriminative Restricted Boltzmann Machines
For the discriminative setting, we directly optimize P(y |x) instead of thejoint-distribution for p(y , x). We get the following log-likelihood expression:
Ldisc(Dtrain) = −|Dtrain|∑i=1
logP(yi |xi )
Training Procedure:
∂logP(yi |xi )∂θ
=∑j
sigm(oy ,j(xi ))∂oy ,j(xi )
∂θ
−∑j,y∗
sigm(oy∗,j(xi ))P(y∗|xi )∂oy∗,j(xi )
∂θ
where oy ,j(x) = cj +∑k
Wj,kxk + Uj,y
This gradient can be computed efficiently using stochastic gradient descentoptimization.
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 11 / 24
Hybrid Discriminative Restricted Boltzmann Machines
Observation: Smaller training sets tend to favor generative learning andbigger training sets favor discriminative learning.
We adopt a hybrid generative/discriminative approach by combining therespective objective terms.
Lhybrid(Dtrain) = Ldisc(Dtrain) + αLgen(Dtrain)
α is a hyper-parameter controlling the generative criterion.
Ldisc(Dtrain) and Lgen(Dtrain) are calculated using the techniques describedabove.
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 12 / 24
Extension for Semi-Supervised Learning
Motivation: Labeled/supervised data is not readily available.
Assume Dunlab represents the set of unlabelled data.
Lunsup(Dunlab) = −|Dunlab|∑i=1
logP(xi )
Training:
∂log P(xi )
∂θ= −Eh,yi|xi
[ ∂∂θ
E(yi, xi,h)]
+ Ey,x,h
[ ∂∂θ
E(y, x,h)]
The first term is different from the Lgen gradient estimation. This canefficiently calculated by taking an expectation w.r.t P(y |x) of the term fromLLgen or sampling for a single yi and doing a point-estimate.
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 13 / 24
Extension for Semi-Supervised Learning - Continued
Semi-supervised setting:
Lsemi−sup(Dtrain,Dunlab) = LTYPE (Dtrain) + βLunsup(Dunlab)
where TYPE ∈ {gen, disc, hybrid}
Thus, online training by stochastic gradient descent corresponds to applyingtwo gradient updates: one for the objective LTYPE and the other one for theunlabelled data objective Lunsup.
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 14 / 24
Character (Digit) Recognition - MNIST Dataset
MNIST Dataset for classifying images of digits.
λ = learning rate and n = number of training iterations.
Likelihood objective for HDRBM (after hyper-parameter tuning):
Lhybrid(Dtrain) = Ldisc(Dtrain) + 0.01Lgen(Dtrain)
Sparse HDRBM - Small value δ is subtracted from the biases c in the hiddenlayer after each update.
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 15 / 24
Document Classification - 20-newsgroup Dataset
Task: Classify newsgroup documents into 20 target classes.
Features used: 5000 most common works across the data set as binary inputfeatures.
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 16 / 24
Document Classification - 20-newsgroup Dataset -HDRBM analysis
Figure 2: Similarity Matrix for weights U·,yj
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 17 / 24
Semi-supervised learning results
Compare the semi-supervised setting against other non-parametrizedsemi-supervised learning algorithms.
Figure 3: Percentage error across different datasets for semi-supervised learning
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 18 / 24
Further Work
Louradour, Jerome and Larochelle, HugoClassification of Sets using Restricted Boltzmann Machines.UAI 2011
Larochelle, Hugo and Mandel, Michael and Pascanu, Razvan and Bengio,YoshuaLearning algorithms for the classification restricted boltzmann machine.JMLR 2012
van der Maaten, LaurensDiscriminative restricted Boltzmann machines are universal approximators fordiscrete data.Technical Report EWI-PRB TR 2011001, Delft University of Technology
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 19 / 24
Classification of Sets using RBMs
In this scenario, each input instance xi is represented by a set of vectors x(0)i ,
x(1)i . . . x
(s)i . For example: Each input instance could be different snippets of
a document (mail) or different regions of an image. The is still modelled amulti-class classification task.
One common approach to the problem is to use multiple-instance learning(MIL) model, where of the instance segments are individually classifiedassigned either the majority label or a conservative voting mechanism.
The drawback with MIL is that it treats each instance segment equally andthis implicit assumption might not always hold. If each segment can onlyprovide partial class information, we won’t capture that correctly in MIL. Thispaper tackles the problem by adding redundant hidden units (layers) to theRBM architecture.
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 20 / 24
ClassSetRBM - Mutually exclusive hidden units (XOR)
Figure 4: ClassSetRBMXOR architecture
X represents the set of instance segments for each data instance, i.e. |X | =number of segments per training instance.
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 21 / 24
ClassSetRBM - Mutually exclusive hidden units (XOR) -Continued
In this scenario, all hidden units are mutually exclusive across differentsegments, i.e.
|X |∑x=1
h(s)j ∈ {0, 1} ∀j = 1 . . .H
Other distributions are slightly modified to accommodate the redundanthidden layers.
P(y = yi |X ) =exp(−FXOR(X , yi ))∑y∗ exp(−FXOR(X , y∗))
where free energy FXOR(X , y) = −dT y −H∑j=1
softplus(softmaxj(X ) + Uj,y ))
and softmaxj(X ) = log(
|X |∑s=1
exp(cj + Wjx(s)))
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 22 / 24
ClassSetRBM - Redundant Evidence (OR)
Figure 5: ClassSetRBMOR architecture.
In the paper, they argue that the assumption of mutual exclusivity overhidden unit may be too strong and try to relax it by adding additional”hidden” layer copies G connected only to y and by removing the directconnection from H to y .
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 23 / 24
Thank You!
Questions?
Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 24 / 24