13
Scaling Up Spike-and-Slab Models for Unsupervised Feature Learning Ian J. Goodfellow, Aaron Courville, and Yoshua Bengio Abstract—We describe the use of two spike-and-slab models for modeling real-valued data, with an emphasis on their applications to object recognition. The first model, which we call spike-and-slab sparse coding (S3C), is a preexisting model for which we introduce a faster approximate inference algorithm. We introduce a deep variant of S3C, which we call the partially directed deep Boltzmann machine (PD-DBM) and extend our S3C inference algorithm for use on this model. We describe learning procedures for each. We demonstrate that our inference procedure for S3C enables scaling the model to unprecedented large problem sizes, and demonstrate that using S3C as a feature extractor results in very good object recognition performance, particularly when the number of labeled examples is low. We show that the PD-DBM generates better samples than its shallow counterpart, and that unlike DBMs or DBNs, the PD-DBM may be trained successfully without greedy layerwise training. Index Terms—Neural nets, pattern recognition, computer vision Ç 1 INTRODUCTION I T is difficult to overstate the importance of the quality of the input features to supervised learning algorithms. A supervised learning algorithm is given a set of examples V ¼fv ð1Þ ; ... ;v ðmÞ g and associated labels fy ð1Þ ; ... ;y ðmÞ g from which it learns a mapping from v to y that can predict the labels y of new unlabeled examples v. The difficulty of this task is strongly influenced by the choice of representation, or the feature set used to encode the input examples v. The premise of unsupervised feature discovery is that by learning the structure of V , we can discover a feature mapping 0ðvÞ that renders standard supervised learning algorithms, such as the support vector machine, more effective. Because 0ðvÞ can be learned from unlabeled data, unsupervised feature discovery can be used for semi- supervised learning (where many more unlabeled examples than labeled examples are available) or transfer learning (where the classifier will be evaluated on only a subset of the categories present in the training data). When adopting a deep learning (Bengio [1]) approach, the feature learning algorithm should discover a 0 that consists of the composition of several simple feature mappings, each of which transforms the output of the earlier mappings to incrementally disentangle the factors of variation present in the data. Deep learning methods are typically created by repeatedly composing together shallow unsupervised fea- ture learners. Examples of shallow models applied to feature discovery include sparse coding (SC) (Raina et al. [30]); restricted Boltzmann machines (RBMs) (Hinton et al. [14], Courville et al. [6]); various autoencoder-based models (Bengio et al. [2], Vincent et al. [38]); and hybrids of autoencoders and SC (Kavukcuoglu et al. [17]). In this paper, we describe how to use a model which we call spike-and-slab sparse coding (S3C) as an efficient feature learning algorithm. We also demonstrate how to construct a new deep model, the partially directed deep Boltzmann machine (PD-DBM), with S3C as its first layer. Both are models of real-valued data, and as such are well suited to modeling images, or image-like data, such as audio that has been preprocessed into an image-like space (Deng et al. [8]). In this paper, we focus on applying these models to object recognition. Single-layer convolutional models based on simple thresholded linear feature extractors are currently among the state-of-the-art performers on the CIFAR-10 object recognition dataset (Coates and Ng, [4], Jia and Huang [16]). However, the CIFAR-10 dataset contains 5,000 labels per class, and this amount of labeled data can be inconvenient or expensive to obtain for applications requiring more than 10 classes. Previous work has shown that the performance of a simple thresholded linear feature set degrades sharply in accuracy as the number of labeled examples decreases (Coates and Ng [4]). We introduce the use of the S3C model as a feature extractor to make features more robust to this degradation. This is motivated by the observation that SC performs relatively well when the number of labeled examples is low (Coates and Ng [4]). SC inference invokes a competition among the features to explain the data and therefore, relative to simple thresholded linear feature extractors, acts as a more regularized feature extraction scheme. We speculate that this additional regularization is responsible for its improved performance in the low-labeled-data regime. S3C can be considered as employing an alternative regularization for feature extraction where, unlike SC, the sparsity prior is decoupled from the magnitude of the nonzero, real-valued feature values. 1902 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013 . The authors are with the De´partement d’Informatique et de Recherche Ope´rationelle, Universite´de Montre´al, Montre´al, QC H3C 3J7, Canada. E-mail: [email protected]. Manuscript received 8 Apr. 2012; revised 17 Aug. 2012; accepted 30 Nov. 2012; published online 19 Dec. 2012. Recommended for acceptance by S. Bengio, L. Deng, H. Larochelle, H. Lee, and R. Salakhutdinovg. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMISI-2012-04-0258. Digital Object Identifier no. 10.1109/TPAMI.2012.273. 0162-8828/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society

Scaling Up Spike-and-Slab Models for Unsupervised Feature Learning

  • Upload
    yoshua

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scaling Up Spike-and-Slab Models for Unsupervised Feature Learning

Scaling Up Spike-and-Slab Modelsfor Unsupervised Feature Learning

Ian J. Goodfellow, Aaron Courville, and Yoshua Bengio

Abstract—We describe the use of two spike-and-slab models for modeling real-valued data, with an emphasis on their applications to

object recognition. The first model, which we call spike-and-slab sparse coding (S3C), is a preexisting model for which we introduce a

faster approximate inference algorithm. We introduce a deep variant of S3C, which we call the partially directed deep Boltzmann

machine (PD-DBM) and extend our S3C inference algorithm for use on this model. We describe learning procedures for each. We

demonstrate that our inference procedure for S3C enables scaling the model to unprecedented large problem sizes, and demonstrate

that using S3C as a feature extractor results in very good object recognition performance, particularly when the number of labeled

examples is low. We show that the PD-DBM generates better samples than its shallow counterpart, and that unlike DBMs or DBNs, the

PD-DBM may be trained successfully without greedy layerwise training.

Index Terms—Neural nets, pattern recognition, computer vision

Ç

1 INTRODUCTION

IT is difficult to overstate the importance of the quality ofthe input features to supervised learning algorithms. A

supervised learning algorithm is given a set of examplesV ¼ fvð1Þ; . . . ; vðmÞg and associated labels fyð1Þ; . . . ; yðmÞgfrom which it learns a mapping from v to y that can predictthe labels y of new unlabeled examples v. The difficultyof this task is strongly influenced by the choice ofrepresentation, or the feature set used to encode the inputexamples v. The premise of unsupervised feature discoveryis that by learning the structure of V , we can discover afeature mapping �ðvÞ that renders standard supervisedlearning algorithms, such as the support vector machine,more effective. Because �ðvÞ can be learned from unlabeleddata, unsupervised feature discovery can be used for semi-supervised learning (where many more unlabeled examplesthan labeled examples are available) or transfer learning(where the classifier will be evaluated on only a subset ofthe categories present in the training data).

When adopting a deep learning (Bengio [1]) approach, thefeature learning algorithm should discover a � that consistsof the composition of several simple feature mappings, eachof which transforms the output of the earlier mappings toincrementally disentangle the factors of variation present inthe data. Deep learning methods are typically created byrepeatedly composing together shallow unsupervised fea-ture learners. Examples of shallow models applied tofeature discovery include sparse coding (SC) (Raina et al.[30]); restricted Boltzmann machines (RBMs) (Hinton et al.

[14], Courville et al. [6]); various autoencoder-based models(Bengio et al. [2], Vincent et al. [38]); and hybrids ofautoencoders and SC (Kavukcuoglu et al. [17]).

In this paper, we describe how to use a model which wecall spike-and-slab sparse coding (S3C) as an efficient featurelearning algorithm. We also demonstrate how to constructa new deep model, the partially directed deep Boltzmannmachine (PD-DBM), with S3C as its first layer. Both aremodels of real-valued data, and as such are well suited tomodeling images, or image-like data, such as audio that hasbeen preprocessed into an image-like space (Deng et al.[8]). In this paper, we focus on applying these models toobject recognition.

Single-layer convolutional models based on simplethresholded linear feature extractors are currently amongthe state-of-the-art performers on the CIFAR-10 objectrecognition dataset (Coates and Ng, [4], Jia and Huang[16]). However, the CIFAR-10 dataset contains 5,000 labelsper class, and this amount of labeled data can beinconvenient or expensive to obtain for applicationsrequiring more than 10 classes. Previous work has shownthat the performance of a simple thresholded linear featureset degrades sharply in accuracy as the number of labeledexamples decreases (Coates and Ng [4]).

We introduce the use of the S3C model as a featureextractor to make features more robust to this degradation.This is motivated by the observation that SC performsrelatively well when the number of labeled examples is low(Coates and Ng [4]). SC inference invokes a competitionamong the features to explain the data and therefore, relativeto simple thresholded linear feature extractors, acts as a moreregularized feature extraction scheme. We speculate that thisadditional regularization is responsible for its improvedperformance in the low-labeled-data regime. S3C can beconsidered as employing an alternative regularization forfeature extraction where, unlike SC, the sparsity prior isdecoupled from the magnitude of the nonzero, real-valuedfeature values.

1902 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013

. The authors are with the Departement d’Informatique et de RechercheOperationelle, Universite de Montreal, Montreal, QC H3C 3J7, Canada.E-mail: [email protected].

Manuscript received 8 Apr. 2012; revised 17 Aug. 2012; accepted 30 Nov.2012; published online 19 Dec. 2012.Recommended for acceptance by S. Bengio, L. Deng, H. Larochelle, H. Lee,and R. Salakhutdinovg.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMISI-2012-04-0258.Digital Object Identifier no. 10.1109/TPAMI.2012.273.

0162-8828/13/$31.00 � 2013 IEEE Published by the IEEE Computer Society

Page 2: Scaling Up Spike-and-Slab Models for Unsupervised Feature Learning

The S3C generative model can be viewed as a hybrid ofSC and the recently introduced spike-and-slab RBM(ssRBM) (Courville et al. [7]). Like the ssRBM, S3Cpossesses a layer of hidden units composed of real-valuedslab variables and binary spike variables. The binary spikevariables are well suited as inputs to subsequent layers in adeep model. However, like SC and unlike the ssRBM, S3Ccan be interpreted as a directed graphical model, implyingthat features in S3C compete with each other to explain theinput. As we show, S3C can be derived either from SC byreplacing the factorial Laplace prior with a factorial spike-and-slab prior or from the ssRBM by simply adding a termto its energy function that causes the hidden units tocompete with each other.

We hypothesize that S3C features have a strongerregularizing effect than SC features due to the greatersparsity in the spike-and-slab prior relative to the Laplaceprior. We validate this hypothesis by showing that S3C hassuperior performance when labeled data are scarce. Wepresent results on the CIFAR-10 and CIFAR-100 objectclassification datasets. We also describe how we used S3Cto win a transfer learning challenge.

The major technical challenge in using S3C is that exactinference over the posterior of the latent layer is intractable.We derive an efficient structured variational approximationto the posterior distribution and use it to performapproximate inference as well as learning as part of avariational Expectation Maximization (EM) procedure(Saul and Jordan [32]). Our inference algorithm allows usto scale inference and learning in the spike-and-slab codingmodel to the large problem sizes required for state-of-the-art object recognition.

Our use of a variational approximation for inferencedistinguishes S3C from standard SC schemes, wheremaximum a posteriori (MAP) inference is typically used.It also allows us to naturally incorporate S3C as a moduleof a deeper model. We introduce learning rules for theresulting PD-DBM, describe some of its interestingtheoretical properties, and demonstrate how this modelcan be trained jointly by a single algorithm, rather thanrequiring the traditional greedy learning algorithm thatconsists of composing individually trained components(Salakhutdinov and Hinton [31]). The ability to jointly traindeep models in a single unified learning stage has theadvantage that it allows the units in higher layers toinfluence the entire learning process at the lower layers.We anticipate that this property may become essential inthe future as the size of the models increases. Consider anextremely large deep model, with size sufficient that itrequires sparse connections. When this model is trainedjointly, the feedback from the units in higher layers willcause units in lower layers to naturally group themselvesso that each higher layer unit receives all of theinformation it needs in its sparse receptive field. Even insmall, densely connected models, greedy training may getcaught in local optimal that joint training can avoid.

2 MODELS

We now describe the models considered in this paper. Wefirst study a model we call the S3C model. This modelhas appeared previously in the literature in a variety of

different domains (Lucke and Sheikh [24], Garrigues andOlshausen [11], Mohamed et al. [26], Zhou et al. [43], Titsiasand Lazaro-Gredilla [36]). Next, we describe a way toincorporate S3C into a deeper model, with the primary goalof obtaining a better generative model.

2.1 The S3C Model

The S3C model consists of latent binary spike variablesh 2 f0; 1gN , latent real-valued slab variables s 2 IRN , andreal-valued visible vector v 2 IRD generated according tothis process:

8i 2 f1; . . . ; Ng; d 2 f1; . . . ; Dg;pðhi ¼ 1Þ ¼ �ðbiÞ;pðsi j hiÞ ¼ N ðsi j hi�i; ��1

ii Þ;pðvd j s; hÞ ¼ N ðvd jWd:ðh � sÞ; ��1

dd Þ;

ð1Þ

where � is the logistic sigmoid function, b is a set of biaseson h, � and W govern the linear dependence of s on h and von s, respectively, � and � are diagonal precision matricesof their respective conditionals, and h � s denotes theelement-wise product of h and s.

To avoid overparameterizing the distribution, we con-strain the columns of W to have unit norm, as in SC. Werestrict � to being a diagonal matrix and � to being adiagonal matrix or a scalar. We refer to the variables hi andsi as jointly defining the ith hidden unit, so that there are atotal of N rather than 2N hidden units. The state of a hiddenunit is best understood as hisi, that is, the spike variablesgate the slab variables.1

2.2 The PD-DBM Model

As described above, the S3C prior is factorial over thehidden units (hisi pairs). Distributions such as the distribu-tion over natural images are rarely well described by simpleindependent factor models, and so we expect that S3C willlikely be a poor generative model for the kinds of data thatwe wish to consider. We now show one way of incorporat-ing S3C into a deeper model, with the primary goal ofobtaining a better generative model. If we assume that �becomes large relative to �, then the primary structure weneed to model is in h. We therefore propose placing a DBMprior rather than a factorial prior on h. The resulting modelcan be viewed as a deep Boltzmann machine with directedconnections at the bottom layer. We call the resulting modela PD-DBM.

The PD-DBM model consists of an observed input vectorv 2 IRD, a vector of slab variables s 2 IRN0 , and a set ofbinary vectors hhhh ¼ fhð0Þ; . . . ; hðLÞg, where hðlÞ 2 f0; 1gNl andL is the number of layers added on top of the S3C model.

The model is parameterized by �, �, and �, which playthe same roles as in S3C. The parameters W ðlÞ and bðlÞ,l 2 f0; . . . ; Lg, provide the weights and biases of both theS3C model and the DBM prior attached to it.

Together, the complete model implements the followingprobability distribution:

PPD�DBMðv; s; hhhhÞ ¼ PS3Cðv; sjhð0ÞÞPDBMðhhhhÞ;

GOODFELLOW ET AL.: SCALING UP SPIKE-AND-SLAB MODELS FOR UNSUPERVISED FEATURE LEARNING 1903

1. We can essentially recover hi and si from hisi because si ¼ 0 has zeromeasure.

Page 3: Scaling Up Spike-and-Slab Models for Unsupervised Feature Learning

where

PDBMðhhhhÞ / exp �XLl¼0

bðlÞThðlÞ �XLl¼1

hðl�1ÞTW ðlÞhðlÞ

!:

A version of the model with three hidden layers (L ¼ 2) isdepicted graphically in Fig. 1.

Besides admitting a straightforward learning algorithm,the PD-DBM has several useful properties:

. The partition function exists for all parametersettings. This is not true of the ssRBM, which is avery good generative model of natural images(Courville et al. [7]).

. The model family is a universal approximator. TheDBM portion, which is a universal approximator ofbinary distributions (Le Roux and Bengio [22]), canimplement a one-hot prior on hð0Þ, thus turning theoverall model into a mixture of Gaussians, which is auniversal approximator of real-valued distributions(Titterington et al. [37]).

. Inference of the posterior involves feedforward,feedback, and lateral connections. This increasesthe biological plausibility of the model, and enablesit to learn and exploit several rich kinds of interac-tions between features. The lateral interactions makethe lower level features compete to explain the input,and the top-down influences help to obtain thecorrect representations of ambiguous input.

3 LEARNING PROCEDURES

Maximum likelihood learning is intractable for bothmodels. S3C suffers from an intractable posterior distribu-tion over the latent variables. In addition to an intractableposterior distribution, the PD-DBM suffers from an in-tractable partition function.

We follow the variational learning approach used bySalakhutdinov and Hinton ([31]) to train DBMs: Ratherthan maximizing the log likelihood, we maximize avariational lower bound on the log likelihood. In the case

of the PD-DBM, we must do so using a stochasticapproximation of the gradient.

The basic strategy of variational learning is to approx-imate the true posterior P ðh; s j vÞ with a simpler distribu-tion Qðh; sÞ. The choice of Q induces a lower bound on thelog likelihood called the negative variational free energy.The term of the negative variational free energy thatdepends on the model parameters is

IEs;h�Q½logP ðv; s; hÞ� ¼ �IEs;h�Q½logP ðv j s; hð0ÞÞþ logP ðs j hÞ þ logP ðhÞ�:

In the case of S3C, this bound is tractable and can beoptimized in a straightforward manner. It is even possibleto use variational EM (Saul and Jordan [32]) to make large,closed-form jumps in parameter space. However, we findgradient ascent learning to be preferable in practice due tothe computational expense of the closed-form solution,which involves estimating and inverting the covariancematrix of all of the hidden units.

In the case of the PD-DBM, the objective function is nottractable because the partition function of the DBM portionof the model is not tractable. We can use contrastivedivergence (Hinton [12]) or stochastic maximum likelihood(Younes [40], Tieleman [35]) to make a sampling-basedapproximation to the DBM partition function’s contributionto the gradient. Thus, unlike S3C, we must do gradient-based learning rather than closed-form parameter updates.However, the PD-DBM model still has some nice propertiesin that only a subset of the variables must be sampledduring training. The factors of the partition functionoriginating from the S3C portion of the model are stilltractable. In particular, training does not ever requiresampling real-valued variables. This is a nice propertybecause it means that the gradient estimates are boundedfor fixed parameters and data. When sampling real-valuedvariables, it is possible for the sampling procedure to makegradient estimates arbitrarily large.

We found using the “true gradient” (Douglas et al. [10])method to be useful for learning with the norm constrainton W . We also found that using momentum (Hinton [13]) isvery important for learning PD-DBMs.

3.1 Avoiding Greedy Pretraining

Deep models are commonly pretrained in a greedylayerwise fashion. For example, a DBM is usually initializedfrom a stack of RBMs, with one RBM trained on the dataand each of the other RBMs trained on samples of theprevious RBM’s hidden layer.

Any greedy training procedure can obviously get stuckin a local minimum. Avoiding the need for greedy trainingcould thus result in better models. For example, whenpretraining with an RBM, the lack of explaining away in theposterior prevents the first layer from learning nearlyparallel weight vectors because these would result insimilar activations (up to the bias term, which could simplymake one unit always less active than the other). Eventhough the deeper layers of the DBM could implement theexplaining away needed for these weight vectors tofunction correctly (i.e., to have the one that resembles theinput the most activate, and inhibit the other unit), thegreedy learning procedure does not have the opportunity tolearn such weight vectors.

1904 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013

Fig. 1. A graphical model depicting an example PD-DBM.

Page 4: Scaling Up Spike-and-Slab Models for Unsupervised Feature Learning

Previous efforts at jointly training even two layer DBMson MNIST have failed (Salakhutdinov and Hinton [31],Desjardins et al. [9], Montavon and Muller [27]). Typically,the jointly trained DBM does not make good use of thesecond layer, either because the second layer weightsare very small or because they contain several duplicateweights focused on a small subset of first layer units thatbecame active early during training. We hypothesize thatthis is because the second layer hidden units in a DBM mustboth learn to model correlations in the first layer induced bythe data and to counteract correlations in the first layerinduced by the model family. When the second layerweights are set to 0, the DBM prior acts to correlate hiddenunits that have similar weight vectors (see Section 5.2).

The PD-DBM model avoids this problem. When thesecond layer weights are set to 0, the first layer hidden unitsare independent in the PD-DBM prior (essentially the S3Cprior). The second layer thus has only one task: to model thecorrelations between first layer units induced by the data.As we will show, this hypothesis is supported by the factthat we are able to successfully train a two layer PD-DBMwithout greedy pretraining.

4 INFERENCE PROCEDURES

The goal of variational inference is to maximize the lowerbound on the log likelihood with respect to the approx-imate distribution Q over the unobserved variables. This isaccomplished by selecting the Q that minimizes theKullback-Leibler divergence:

DKLðQðh; sÞkP ðh; sjvÞÞ; ð2Þ

where Qðh; sÞ is drawn from a restricted family ofdistributions. This family can be chosen to ensure thatlearning and inference with Q is tractable.

Variational inference can be seen as analogous to theencoding step of the traditional SC algorithm. The keydifference is that while SC approximates the true posteriorwith a MAP point estimate of the latent variables,variational inference approximates the true posterior every-where with the distribution Q.

4.1 Variational Inference for S3C

When working with S3C, we constrain Q to be drawn fromthe family Qðh; sÞ ¼ �iQðhi; siÞ. This is a richer approxima-tion than the fully factorized family used in the mean fieldapproximation. It allows us to capture the tight correlationbetween each spike variable and its corresponding slabvariable while still allowing simple and efficient inferencein the approximating distribution. It also avoids a patho-logical condition in the mean field distribution, where QðsiÞcan never be updated if QðhiÞ ¼ 0.

Observing that (2) is an instance of the Euler-Lagrangeequation, we find that the solution must take the form

Qðhi ¼ 1Þ ¼ hi;Qðsi j hiÞ ¼ N ðsi j hisi; ð�i þ hiWT

i �WiÞ�1Þ;ð3Þ

where hi and si must be found by an iterative process. In atypical application of variational inference, the iterativeprocess consists of sequentially applying fixed pointequations that give the optimal value of the parameters hi

and si for one factor Qðhi; siÞ given the value all of the otherfactors’ parameters. This is, for example, the approach takenby Titsias and Lazaro-Gredilla ([36]) who independentlydeveloped a variational inference procedure for the sameproblem. This process is only guaranteed to decrease the KLdivergence if applied to each factor sequentially, i.e., firstupdating h1 and s1 to optimize Qðh1; s1Þ, then updating h2

and s2 to optimize Qðh2; s2Þ, and so on. In a typicalapplication of variational inference, the optimal values foreach update are simply given by the solutions to the Euler-Lagrange equations. For S3C, we make three deviationsfrom this standard approach.

Because we apply S3C to very large-scale problems, weneed an algorithm that can fully exploit the benefits ofparallel hardware such as GPUs. Sequential updates acrossall N factors require far too much runtime to be competitivein this regime.

We have considered two different methods that enableparallel updates to all units. In the first method, we starteach iteration by partially minimizing the KL divergencewith respect to s. The terms of the KL divergence thatdepend on s make up a quadratic function so this can beminimized via conjugate gradient descent. We implementconjugate gradient descent efficiently by using theR-operator (Pearlmutter, [29]) to perform Hessian-vectorproducts rather than computing the entire Hessian explicitly(Schraudolph [33]). This step is guaranteed to improve theKL divergence on each iteration. We next update h inparallel, shrinking the update by a damping coefficient. Thisapproach is not guaranteed to decrease the KL divergenceon each iteration, but it is a widely applied approach thatworks well in practice (Koller and Friedman [18]).

With the second method (Algorithm 1), we find inpractice that we obtain faster convergence, reaching equallygood solutions by replacing the conjugate gradient updateto s with a more heuristic approach. We use a paralleldamped update on s much like what we do for h. In thiscase, we make an additional heuristic modification to theupdate rule, which is made necessary by the unboundednature of s. We clip the update to s so that if snew has theopposite sign from s, its magnitude is at most �jsj. In all ofour experiments, we used � ¼ 0:5, but any value in ½0; 1� issensible. This prevents a case where multiple mutuallyinhibitory s units inhibit each other so strongly that ratherthan being driven to 0 they change sign and actuallyincrease in magnitude. This case is a failure mode of theparallel updates that can result in s amplifying withoutbound if clipping is not used.

Algorithm 1. Fixed-Point Inference

Initialize hð0Þ ¼ �ðbÞ, sð0Þ ¼ �, and k ¼ 0.

while not converged do

Compute the individually optimal value s�i for each

i simultaneously:

s�i ¼�i�iiþvT �Wi�Wi�

�Pj 6¼i WjhjsjðkÞ

��iiþWT

i �Wi

Clip reflections by assigning

ci ¼ �signðs�i ÞjsiðkÞjfor all i such that signðs�i Þ 6¼ signðsiðkÞÞ and js�i j > �jsiðkÞj,and assigning ci ¼ s�i for all other i.

Damp the updates by assigning

sðkþ 1Þi ¼ �scþ ð1� �sÞsðkÞwhere �s 2 ð0; 1�.

GOODFELLOW ET AL.: SCALING UP SPIKE-AND-SLAB MODELS FOR UNSUPERVISED FEATURE LEARNING 1905

Page 5: Scaling Up Spike-and-Slab Models for Unsupervised Feature Learning

Compute the individually optimal values for h:

zi ¼�v�

Xj 6¼i

Wjsjðkþ 1ÞhjðkÞ � 12Wisiðkþ 1Þ

�T�Wisiðkþ 1Þ þ bi � 1

2�iiðsiðkþ 1Þ � �iÞ2

� 12 logð�ii þWT

i �WiÞ þ 12 logð�iiÞ

h� ¼ �ðzÞ

Damp the update to h:

hðkþ 1Þ ¼ �hh� þ ð1� �hÞhðkÞ

k kþ 1

end while

Note that Algorithm 1 does not specify a convergencecriterion. Many convergence criteria are possible—theconvergence criterion could be based on the norm of thegradient of the KL divergence with respect to the varia-tional parameters, the amount that the KL divergence hasdecreased in the last iteration, or the amount that thevariational parameters have changed in the final iteration.Salakhutdinov and Hinton ([31]) use the third approachwhen training deep Boltzmann machines and we find that itworks well for S3C and the PD-DBM as well.

We include some visualizations that demonstrate theeffect of our inference procedure. Fig. 2 shows that itproduces a sparse representation. Fig. 3 shows that theexplaining-away effect incrementally makes the representa-tion more sparse. Fig. 4 shows that the inference procedureincreases the negative variational free energy.

4.2 Variational Inference for the PD-DBM

Inference in the PD-DBM is very similar to inference in S3C.We use the variational family

Qðs; hÞ ¼ �N0

i¼1Q�si; h

ð0Þi

��Ll¼1�Nl

i¼1Q�hðlÞi

�;

whose solutions take the form

QðhðlÞi ¼ 1Þ ¼ hðlÞi ;Qðsi j hð0Þi Þ ¼ N ðsi j h

ð0Þi si; ð�i þ hiWT

i �WiÞ�1Þ:

We apply more or less the same inference procedure as

in S3C. On each update step, we update either s or hðlÞ for

some value of l. The update to s is exactly the same as in

S3C. The update to hð0Þ changes slightly to incorporate top-

down influence from hð1Þ. When computing the individu-

ally optimal values of the elements of hð0Þ, we use the

following fixed-point formula:

hð0Þ� ¼ �ðzþW ð1Þhð1ÞÞ:

The update to hðlÞ for l > 0 is simple; it is the same as themean field update in the DBM. No damping is necessary forthis update. The conditional independence properties of theDBM guarantee that the optimal values of the elements ofhðlÞ do not depend on each other, so the individuallyoptimal values are globally optimal (for a given hðl�1Þ andhðlþ1Þ). The update is given by

hðlÞ� ¼ �ðbðlÞ þ hðl�1ÞTW ðlÞ þW ðlþ1Þhðlþ1ÞÞ;

where the term for layer lþ 1 is dropped if lþ 1 > L.

1906 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013

Fig. 2. This example histogram of IEQ½hisi� shows that Q is a sparsedistribution. For this 6,000 hidden unit S3C model trained on 6� 6 imagepatches, Qðhi ¼ 1Þ <0:01 99.7 percent of the time.

Fig. 3. The explaining-away effect makes the S3C representationbecome more sparse with each damped iteration of the variationalinference fixed point equations.

Fig. 4. The negative variational free energy of a batch of 5,000 imagepatches increases during the course of variational inference.

Page 6: Scaling Up Spike-and-Slab Models for Unsupervised Feature Learning

5 COMPARISON TO OTHER FEATURE ENCODING

METHODS

Here, we compare S3C as a feature discovery algorithm toother popular approaches. We describing how S3C occupiesa middle ground between two of these methods, SC and thessRBM, while avoiding many of the respective disadvan-tages when applied as feature discovery algorithms.

5.1 Comparison to SC

SC (Olshausen and Field [28]) has been widely used todiscover features for classification (Raina et al. [30]).Recently, Coates and Ng ([4]) showed that this approachachieves excellent performance on the CIFAR-10 objectrecognition dataset. SC refers to a class of generativemodels, where the observed data v are normally distributedgiven a set of continuous latent variables s and a dictionarymatrix W : v � NðWs; �IIÞ. SC places a factorial and heavytailed prior distribution over s (e.g., a Cauchy or Laplacedistribution) chosen to encourage the mode of the posteriorpðs j vÞ to be sparse. One can derive the S3C model from SCby replacing the factorial Cauchy or Laplace prior with aspike-and-slab prior.

One drawback of SC is that the latent variables are notmerely encouraged to be sparse; they are encouraged toremain close to 0, even when they are active. This kind ofregularization is not necessarily undesirable, but in the caseof simple but popular priors such as the Laplace prior(corresponding to an L1 penalty on the latent variables s),the degree of regularization on active units is confoundedwith the degree of sparsity. There is little reason to believethat in realistic settings, these two types of complexitycontrol should be so tightly bound together. The S3C modelavoids this issue by controlling the sparsity of units via theb parameter, which determines how likely each spike unit isto be active, while separately controlling the magnitude ofactive units via the � and � parameters that govern thedistribution over s. SC has no parameter analogous to � andcannot control these aspects of the posterior independently.

Another drawback of SC is that the factors are notactually sparse in the generative distribution. Indeed, eachfactor is zero with probability zero. The features extractedby SC are only sparse because they are obtained via MAPinference. In the S3C model, the spike variables ensure thateach factor is zero with nonzero probability in thegenerative distribution. Since this places a greater restric-tion on the code variables, we hypothesize that S3C featuresprovide more of a regularizing effect when solvingclassification problems.

SC is also difficult to integrate into a deep generativemodel of data such as natural images. While Yu et al. [41]and Zeiler et al. [42] have recently shown some success atlearning hierarchical SC, our goal is to integrate the featureextraction scheme into a proven generative model frame-work such as the deep Boltzmann machine (Salakhutdinovand Hinton [31]). Existing inference schemes known towork well in the DBM-type (deep Boltzmann machine)setting are all either sample based or are based onvariational approximations to the model posteriors, whileSC schemes typically employ MAP inference. Our use ofvariational inference makes the S3C framework well suited

to integrating into the known successful strategies forlearning and inference in DBM models. In fact, thecompatibility of the S3C and DBM inference procedures isconfirmed by the success of the PD-DBM inferenceprocedure. It is not obvious how one can employ avariational inference strategy to standard SC with the goalof achieving sparse feature encoding.

SC models can be learned efficiently by alternatelyrunning MAP inference for several examples and thenmaking a large, closed-form updates to the parameters. Thesame approach is also possible with S3C, and is in fact moreprincipled because it is based on maximizing a variationallower bound rather than the MAP approximation. We donot explore this learning method for S3C in this paper.

5.2 Comparison to RBMs

The S3C model also resembles another class of modelscommonly used for feature discovery: the RBM. An RBM(Smolensky [34]) is a model defined through an energyfunction that describes the interactions between theobserved data variables and a set of latent variables. It ispossible to interpret the S3C as an energy-based model byrearranging pðv; s; hÞ to take the form expf�Eðv; s; hÞg=Z,with the following energy function:

Eðv; s; hÞ ¼ 1

2v�

Xi

Wisihi

!T

� v�Xi

Wisihi

!

þ 1

2

XNi¼1

�iðsi � �ihiÞ2 �XNi¼1

bihi:

ð4Þ

The ssRBM model family is a good starting point for S3Cbecause it has demonstrated both reasonable performance asa feature discovery scheme and remarkable performance as agenerative model (Courville et al. [7]). Within the ssRBMfamily, S3C’s closest relative is a variant of the �-ssRBM,defined by the following energy function:

Eðv; s; hÞ ¼ �XNi¼1

vT�Wisihi þ1

2vT�v

þ 1

2

XNi¼1

�iðsi � �ihiÞ2 �XNi¼1

bihi;

ð5Þ

where the variables and parameters are defined identically tothose in S3C. Comparison of (4) and (5) reveals that thesimple addition of a latent factor interaction term 1

2 ðh � sÞT

WT�Wðh � sÞ to the ssRBM energy function turns the ssRBMinto the S3C model. With the inclusion of this term, S3Cmoves from an undirected ssRBM model to the directedgraphical model described in (1). This change from undir-ected modeling to directed modeling has three importanteffects, that we describe in the following paragraphs.

The effect on the partition function. The most immediateconsequence of the transition to directed modeling is that thepartition function becomes tractable. Because the RBMpartition function is intractable, most training algorithmsfor the RBM require making stochastic approximations tothe partition function, the same as our learning procedurefor the PD-DBM does. Since the S3C partition function istractable, we can follow its true gradient, which provides one

GOODFELLOW ET AL.: SCALING UP SPIKE-AND-SLAB MODELS FOR UNSUPERVISED FEATURE LEARNING 1907

Page 7: Scaling Up Spike-and-Slab Models for Unsupervised Feature Learning

advantage over the RBM. The partition function of S3C isalso guaranteed to exist for all possible settings of the modelparameters, which is not true of the ssRBM. In the ssRBM,for some parameter values it is possible for pðs; v j hÞ to takethe form of a normal distribution whose covariance matrix isnot positive definite. Courville et al. [7] have exploredresolving this issue by constraining the parameters, but thiswas found to hurt classification performance.

The effect on the posterior. RBMs have a factorial poster-ior, but S3C and SC have a complicated posterior due tothe “explaining away” effect. For this reason, RBMs canuse exact inference and maximum-likelihood estimation.Models with an intractable posterior such as S3C andDBMs must use approximate inference and are oftentrained with a variational lower bound on the likelihood.

The RBM’s factorial posterior means that featuresdefined by similar basis functions will have similaractivations, while in directed models, similar features willcompete so that only the most relevant features will remainsignificantly active. As shown by Coates and Ng [4], thesparse Gaussian RBM is not a very good feature extractor—the set of basis functions W learned by the RBM actuallywork better for supervised learning when these parametersare plugged into an SC model than when the RBM itself isused for feature extraction. We think this is due to thefactorial posterior. In the vastly overcomplete setting, beingable to selectively activate a small set of features thatcooperate to explain the input likely provides S3C with amajor advantage in discriminative capability.

Considerations of biological plausibility also motivatethe use of a model with a complicated posterior. Asdescribed in Hyvarinen et al. [15], a phenomenon called“end stopping,” similar to explaining away, has beenobserved in V1 simple cells. End stopping occurs when anedge detector is inhibited, when retinal cells near the endsof the edge it detects are stimulated. The inhibition occursdue to lateral interactions with other simple cells, and is amajor motivation for the lateral interactions present in theSC posterior.

The effect on the prior. The addition of the interaction termcauses S3C to have a factorial prior. This probably makes ita poor generative model, but this is not a problem for thepurpose of feature discovery. Moreover, the quality of thegenerative model can be improved by incorporating S3Cinto a deeper architecture, as we will show.

RBMs were designed with a nonfactorial prior becausefactor models with factorial priors are generally known toresult in poor generative models. However, in the case ofreal-valued data, typical RBM priors are not especiallyuseful. For example, the ssRBM variant described in (5) hasthe following prior:

pðs; hÞ / exp1

2

XNi¼1

Wisihi

!T

�XNi¼1

Wisihi

!8<:

� 1

2

XNi¼1

�i si � �ihið Þ2þXNi¼1

bihi

):

It is readily apparent from the first term (all other termsfactorize across hidden units) that this prior acts to correlate

units that have similar basis vectors, which is almost

certainly not a desirable property for feature extraction

tasks. Indeed, it is this nature of the RBM prior that causes

both the desirable (easy computation) and undesirable (no

explaining away) properties of the posterior.

5.3 Other Related Work

The notion of a spike-and-slab prior was established instatistics by Mitchell and Beauchamp [25]. Outside thecontext of unsupervised feature discovery for supervisedlearning, the basic form of the S3C model (i.e., a spike-and-slab latent factor model) has appeared a number of times indifferent domains (Lucke and Sheikh [24]; Garrigues andOlshausen [11]; Mohamed et al. [26]; Zhou et al. [43]; Titsiasand Lazaro-Gredilla [36]). To this literature, we contributean inference scheme that scales to the kinds of objectclassifications tasks that we consider. We outline thisinference scheme next.

6 RUNTIME RESULTS

Our inference scheme achieves very good computational

performance, both in terms of memory consumption and in

terms of runtime. The computational bottleneck in our

classification pipeline is SVM training, not feature learning

or feature extraction.Comparing the computational cost of our inference

scheme to others is a difficult task because it could beconfounded by differences in implementation and becauseit is not clear exactly what SC problem is equivalent to anequivalent S3C problem. However, we observed informallyduring our supervised learning experiments that featureextraction using S3C took roughly the same amount of timeas feature extraction using SC.

In Fig. 5, we show that our improvements to spike-and-

slab inference performance allow us to scale spike-and-

slab modeling to the problem sizes needed for object

recognition tasks. Previous work on spike-and-slab model-

1908 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013

Fig. 5. Our inference scheme enables us to extend spike-and-slabmodeling from small problems to the scale needed for objectrecognition. Previous object recognition work is from Coates and Ng[4], Courville et al. [7]. Previous spike-and-slab work is from Mohamed etal. [26], Zhou et al. [43], Garrigues and Olshausen [11], Lucke andSheikh [24], Titsias and Lazaro-Gredilla [36].

Page 8: Scaling Up Spike-and-Slab Models for Unsupervised Feature Learning

ing was not able to use similar amounts of hidden units ortraining examples.

As a large-scale test of our inference scheme’s ability, wetrained over 8,000 densely connected filters on full 32� 32color images. A visualization of the learned filters is shownin Fig. 6. This test demonstrated that our approach scaleswell to large (over 3,000 dimensional) inputs, though it isnot yet known how to use features for classification aseffectively as patch-based features that can be incorporatedinto a convolutional architecture with pooling. For compar-ison, to our knowledge the largest image patches used inprevious spike-and-slab models with lateral interactionswere 16� 16 (Garrigues and Olshausen [11]).

Finally, we performed a series of experiments tocompare our heuristic method of updating s with theconjugate gradient method of updating s. The conjugategradient method is guaranteed to reduce the KL divergenceon each update to s. The heuristic method has no suchguarantee. These experiments provide an empirical justifi-cation for the use of the heuristic method.

We considered three different models, each on a differentdataset. We used MNIST (LeCun et al. [23]), CIFAR-100(Krizhevsky and Hinton [19]), and whitened 6� 6 patchesdrawn from CIFAR-100 as the three datasets.

Because we wish to compare different inference algo-rithms and inference affects learning, we did not want tocompare the algorithms on models whose parameters werethe result of learning. Instead, we obtained the value of Wby drawing randomly selected patches ranging in sizefrom 6� 6 to the full image size for each dataset. Thisprovides a data-driven version of W with some of thesame properties, like local support, that learned filters tendto have. None of the examples used to initialize W wereused in the later timing experiments. We initialized b, �, �,and � randomly. We used 400 hidden units for someexperiments and 1,600 units for others to investigate theeffect of overcompleteness on runtime.

For each inference scheme considered, we found thefastest possible variant obtainable via a two-dimensionalgrid search over �h and either �s in the case of the heuristicmethod or the number of conjugate gradient steps to applyper s update in the case of the conjugate gradient method.We used the same value of these parameters on every pairof update steps. It may be possible to obtain faster results byvarying the parameters throughout the course of inference.

For these timing experiments, it is necessary to makesure that each algorithm is not able to appear faster byconverging early to an incorrect solution. We thus replacethe standard convergence criterion based on the size of the

change in the variational parameters with a requirementthat the KL divergence reach within 0.05 on average of ourbest estimate of the true minimum value of the KLdivergence found by batch gradient descent.

All experiments were performed on an Nvidia Ge-ForceGTX-580.

The results are summarized in Fig. 7.

7 CLASSIFICATION RESULTS

Because S3C forms the basis of all further model develop-ment in this line of research, we concentrate on validatingits value as a feature discovery algorithm. We conductedexperiments to evaluate the usefulness of S3C features forsupervised learning on the CIFAR-10 and CIFAR-100(Krizhevsky and Hinton [19]) datasets. Both datasets consistof color images of objects such as animals and vehicles.Each contains 50,000 train and 10,000 test examples. CIFAR-10 contains 10 classes, while CIFAR-100 contains 100classes, so there are fewer labeled examples per class inthe case of CIFAR-100.

For all experiments, we used the same overall procedureas Coates and Ng [4] except for feature learning. CIFAR-10consists of 32� 32 images. We train our feature extractor on6� 6 contrast-normalized and ZCA-whitened patches fromthe training set (this preprocessing step is not necessary toobtain good performance with S3C; we included itprimarily to facilitate comparison with other work). At testtime, we extract features from all 6� 6 patches on an image,then average-pool them. The average-pooling regions arearranged on a nonoverlapping grid. Finally, we train anL2-SVM with a linear kernel on the pooled features.

7.1 CIFAR-10

We use CIFAR-10 to evaluate our hypothesis that S3C issimilar to a more regularized version of SC.

Coates and Ng [4] used 1,600 basis vectors in all of theirSC experiments. They postprocessed the SC feature vectorsby splitting them into the positive and negative part for atotal of 3,200 features per average-pooling region. They

GOODFELLOW ET AL.: SCALING UP SPIKE-AND-SLAB MODELS FOR UNSUPERVISED FEATURE LEARNING 1909

Fig. 6. Example filters from a dictionary of over 8,000 learned on full32� 32 images.

Fig. 7. The inference speed for each method was computed based onthe inference time for the same set 100 examples from each dataset.The heuristic method is consistently faster than the conjugate gradientmethod. The conjugate gradient method is slowed more by problem sizethan the heuristic method is, as shown by the conjugate gradientmethod’s low speed on the CIFAR-100 full image task. The heuristicmethod has a very low cost per iteration but is strongly affected by thestrength of explaining—away interactions—moving from CIFAR-100 fullimages to CIFAR-100 patches actually slows it down because thedegree of overcompleteness increases.

Page 9: Scaling Up Spike-and-Slab Models for Unsupervised Feature Learning

average-pool on a 2� 2 grid for a total of 12,800 features perimage (i.e., each element of the 2� 2 grid averages over ablock with sides dð32� 6þ 1Þ=2e or bð32� 6þ 1Þ=2c). Weused IEQ½h� as our feature vector. Unlike the output of SC,this does not have a negative part, so using a 2� 2 grid wewould have only 6,400 features. To compare with similarsizes of feature vectors, we used a 3� 3 pooling grid for atotal of 14,400 features (i.e., each element of the 3� 3 gridaverages over 9� 9 locations) when evaluating S3C. Toensure this is a fair means of comparison, we confirmed thatrunning SC with a 3� 3 grid and absolute value rectifica-tion performs worse than SC with a 2� 2 grid and signsplitting (76.8 versus 77.9 percent on the validation set).

We tested the regularizing effect of S3C by training theSVM on small subsets of the CIFAR-10 training set, butusing features that were learned on patches drawn from theentire CIFAR-10 train set. The results, summarized in Fig. 8,show that S3C has the advantage over both thresholdingand SC for a wide range of amounts of labeled data. (In theextreme low-data limit, the confidence interval becomes toolarge to distinguish SC from S3C.)

On the full dataset, S3C achieves a test set accuracy of78:3� 0:9% with 95 percent confidence. Coates and Ng [4]do not report test set accuracy for SC with “naturalencoding” (i.e., extracting features in a model whoseparameters are all the same as in the model used fortraining), but SC with different parameters for featureextraction than training achieves an accuracy of 78:8� 0:9%(Coates and Ng [4]). Since we have not enhanced ourperformance by modifying parameters at feature extractiontime, these results seem to indicate that S3C is roughlyequivalent to SC for this classification task. S3C alsooutperforms ssRBMs, which require 4,096 basis vectorsper patch and a 3� 3 pooling grid to achieve 76:7� 0:9%accuracy. All of these approaches are close to the best result,using the pipeline from Coates and Ng [4], of 81.5 percentachieved using thresholding of linear features learned withOMP-1. These results show that S3C is a useful featureextractor that performs comparably to the best approacheswhen large amounts of labeled data are available.

7.2 CIFAR-100

Having verified that S3C features help to regularize aclassifier, we proceed to use them to improve performanceon the CIFAR-100 dataset, which has 10 times as manyclasses and 10 times fewer labeled examples per class. Wecompare S3C to two other feature extraction methods:OMP-1 with thresholding, which Coates and Ng [4] foundto be the best feature extractor on CIFAR-10, and SC, whichis known to perform well when less labeled data areavailable. We evaluated only a single set of hyperpara-meters for S3C. For SC and OMP-1, we searched over thesame set of hyperparameters as Coates and Ng [4] did:f0:5; 0:75; 1:0; 1:25; 1:25g for the SC penalty and f0:1; 0:25;0:5; 1:0g for the thresholding value. To use a comparableamount of computational resources in all cases, we used atmost 1,600 hidden units and a 3� 3 pooling grid for allthree methods. For S3C, this was the only feature encodingwe evaluated. For SC and OMP-1, which double theirnumber of features via sign splitting, we also evaluated 2�2 pooling with 1,600 latent variables and 3� 3 pooling with800 latent variables to be sure the models do not suffer fromoverfitting caused by the larger feature set. These results aresummarized in Fig. 9.

The best result to our knowledge on CIFAR-100 is 54:8�1% (Jia and Huang [16]), achieved using a learned poolingstructure on top of “triangle code” features from adictionary learned using k-means. This feature extractor isvery similar to thresholded OMP-1 features and is known toperform slightly worse on CIFAR-10. The validation setresults, which all use the same control pooling layer, inFig. 9 show that S3C is the best known detector layer onCIFAR-100. Using a pooling strategy of concatenating 1� 1,2� 2, and 3� 3 pooled features, we achieve a test setaccuracy of 53:7� 1%.

7.3 Transfer Learning Challenge

For the NIPS 2011 Workshop on Challenges in LearningHierarchical Models (Le et al. [21]), the organizers proposeda transfer learning competition. This competition used adataset consisting of 32� 32 color images, including 100,000unlabeled examples, 50,000 labeled examples of 100 objectclasses not present in the test set, and 120 labeled examplesof 10 object classes present in the test set. The test set wasnot made public until after the competition. We recognized

1910 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013

Fig. 8. Semi-supervised classification accuracy on subsets of CIFAR-10. Thresholding, the best feature extractor on the full dataset,performs worse than SC when few labels are available. S3C improvesupon SC’s advantage.

Fig. 9. CIFAR-100 classification accuracy for various models. Asexpected, S3C outperforms SC and OMP-1. S3C with spatial pyramidpooling is near the state-of-the-art method, which uses a learnedpooling structure.

Page 10: Scaling Up Spike-and-Slab Models for Unsupervised Feature Learning

this contest as a chance to demonstrate S3C’s ability toperform well with extremely small amounts of labeled data.We chose to disregard the 50,000 labels and treat this as asemi-supervised learning task.

We applied the same approach as on the CIFARdatasets, albeit with a small modification to the SVMtraining procedure. Due to the small labeled dataset size,we used leave-one-out cross validation rather thanfivefold cross validation.

We won the competition, with a test set accuracy of48.6 percent. We do not have any information about thecompeting entries, other than that we outperformed them.Our test set accuracy was tied with a method run by thecontest organizers based on a combination of methods(Coates et al. [5], Le et al. [20]). Since these methods do notuse transfer learning either, this suggests that the contestprimarily provides evidence that S3C is a powerful semi-supervised learning tool.

7.4 Ablative Analysis

To better understand which aspects of our S3C objectclassification method are most important to obtaining goodperformance, we conducted a series of ablative analysisexperiments. For these experiments, we trained on 5,000labels of the STL-10 dataset (Coates et al. [5]). Previous workon the STL-10 dataset is based on training on 1,000 labelsubsets of the training set, so the performance numbers inthis section should only be compared to each other, not toprevious work. The results are presented in Fig. 10.

Our best-performing method uses IEQ½h� as features. Thisallows us to abstract out the s variables so that they achievea form of per-component brightness invariance. Ourexperiments show that including the s variables or usingMAP inference in Q rather than an expectation hurtsclassification performance. We experimented with fixing �to 0 so that s is regularized to be small as well as sparse, asin SC. We found that this hurts performance even more.Last, we experimented with replacing S3C learning bysimply assigning W to be a set of randomly selected patchesfrom the training set. We call this approach S3C-RP. Wefound that this does not impair performance much, solearning is not very important compared to our inferencealgorithm. This is consistent with Coates and Ng’s [4]observation that the feature extractor matters more than thelearning algorithm and that learning matters less for largenumbers of hidden units.

8 SAMPLING RESULTS

To demonstrate the improvements in the generativemodeling capability conferred by adding a DBM prior onh, we trained an S3C model and a PD-DBM model on theMNIST dataset. We chose to use MNIST for this portion ofthe experiments because it is easy for a human observer toqualitatively judge whether samples come from the samedistribution as this dataset.

For the PD-DBM, we used L ¼ 1 for a total of twohidden layers. We did not use greedy, layerwise pre-training—the entire model was learned jointly. Such jointlearning without greedy pretraining has never beenaccomplished with similar deep models such as DBMsor DBNs.

The S3C samples and basis vectors are shown in Fig. 11.The samples do not resemble digits, suggesting that S3C hasfailed to model the data. However, inspection of the S3Cfilters shows that S3C has learned a good basis set forrepresenting MNIST digits using digit templates, penstrokes, and so on. It simply does not have the correctprior on these bases and as a result activates subsets of themthat do not correspond to MNIST digits. The PD-DBMsamples clearly resemble digits, as shown in Fig. 12. Forcomparison, Fig. 12 also shows samples from two DBMs. In

GOODFELLOW ET AL.: SCALING UP SPIKE-AND-SLAB MODELS FOR UNSUPERVISED FEATURE LEARNING 1911

Fig. 10. Performance of several limited variants of S3C.

Fig. 11. Left: Samples drawn from an S3C model trained on MNIST. Right: The filters used by this S3C model.

Page 11: Scaling Up Spike-and-Slab Models for Unsupervised Feature Learning

all cases, we display the expected value of the visible units

given the hidden units.The first DBM was trained by running the demo code

that accompanies Salakhutdinov and Hinton [31]. We used

the same number of units in each layer to make these

models comparable (500 in the first layer and 1,000 in the

second). This means that the PD-DBM has a slightly greater

number of parameters than the DBM because the first layer

units of the PD-DBM have both mean and precision

parameters, while the first layer units of the DBM have

only a bias parameter. Note that the DBM operates on a

binarized version of MNIST, while S3C and the PD-DBM

regard MNIST as real valued. Additionally, the DBM demo

code uses the MNIST labels during generative training,

while the PD-DBM and S3C were not trained with the

benefit of the labels. The DBM demo code is hardcoded to

pretrain the first layer for 100 epochs, the second layer for

200 epochs, and then jointly train the DBM for 300 epochs.

We trained the PD-DBM starting from a random initializa-

tion for 350 epochs.The second DBM was trained using two modifications

from the demo code to train it in as similar a fashion to our

PD-DBM model as possible: First, it was trained without

access to labels, and second, it did not receive any

pretraining. This model was trained for only 230 epochs

because it had already converged to a bad local optimum by

this time. This DBM is included to provide an example of

how DBM training fails when greedy layerwise pretraining

is not used. DBM training can fail in a variety of ways and no

example should be considered representative of all of them.

1912 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013

Fig. 12. Left: Samples drawn from a PD-DBM model trained on MNIST using joint training only. Center: Samples drawn from a DBM model of thesame size, trained using greedy layerwise pretraining followed by joint training. Right: Samples drawn from a DBM trained using joint training only.

Fig. 13. Each panel shows a visualization of the weights for a different model. Each row represents a different second layer hidden unit. We show10 units for each model corresponding to those with the largest weight vector norm. Within each row, we plot the weight vectors for the 10 most stronglyconnected first layer units. Black corresponds to inhibition, white to excitation, and gray to zero weight. This figure is best viewed in color—units plottedwith a yellow border have excitatory second layer weights while units plotted with a magenta border have inhibitory second layer weights. Left: PD-DBMmodel trained jointly. Note that each row contains many similar filters. This is how the second layer weights achieve invariance to some transformationssuch as image translation. This is one way that deep architectures are able to disentangle factors of variation. One can also see how the second layerhelps implement the correct prior for the generative task. For example, the unit plotted in the first row excites filters used to draw 7 s and inhibits filtersused to draw 1 s. Also, observe that the first layer filters are much more localized and contained fewer templates than those in Fig. 11 (right). Thissuggests that joint training has a significant effect on the quality of the first layer weights; greedy pretraining would have attempted to solve thegenerative task with more templates due to S3C’s independent prior. Center: DBM model with greedy pretraining followed by joint training. Theseweights show the same disentangling and invariance properties as those of the PD-DBM. Note that the filters have more black areas. This is becausethe RBM must use inhibitory weights to limit hidden unit activities, while S3C accomplishes the same purpose via the explaining-away effect. Right:DBM with joint training only. Note that many of the second layer weight vectors are duplicates of each other. This is because the second layer has apathological tendency to focus on modeling a handful of first-layer units that learn interesting responses earliest in learning.

Page 12: Scaling Up Spike-and-Slab Models for Unsupervised Feature Learning

To analyze the differences between these models, wedisplay a visualization of the weights of the models thatshows how the layers interact in Fig. 13.

9 CONCLUSION

We have motivated the use of the S3C model forunsupervised feature discovery. We have described avariational approximation scheme that makes it feasible toperform learning and inference in large-scale S3C and PD-DBM models. We have demonstrated that S3C is aneffective feature discovery algorithm for both supervisedand semi-supervised learning with small amounts oflabeled data. This work addresses two scaling problems:the computation problem of scaling spike-and-slab SC tothe problem sizes used in object recognition, and theproblem of scaling object recognition techniques to workwith more classes. We demonstrate that this work can beextended to a deep architecture using a similar inferenceprocedure, and show that the deeper architecture is betterable to model the input distribution. Remarkably, this deeparchitecture does not require greedy training, unlike itsDBM predecessor.

ACKNOWLEDGMENTS

This work was supported by the US Defense AdvancedResearch Projects Agency (DARPA) and NSERC. The authorswould like to thank Pascal Vincent for helpful discussions.The computation done for this work was conducted in part oncomputers of RESMIQ, Clumeq, and SharcNet. The authorswould like to thank the developers of Theano (Bergstra et al.[3]) and pylearn2 (Warde-Farley et al. [39]).

REFERENCES

[1] Y. Bengio, “Learning Deep Architectures for AI,” Foundations andTrends in Machine Learning, vol. 2, no. 1, pp. 1-127, 2009.

[2] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “GreedyLayer-Wise Training of Deep Networks,” Proc. Advances in NeuralInformation Processing Systems 19, B. Scholkopf, J. Platt, andT. Hoffman, eds., pp. 153-160, 2007.

[3] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G.Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: ACPU and GPU Math Expression Compiler,” Proc. Python forScientific Computing Conf., 2010.

[4] A. Coates and A.Y. Ng, “The Importance of Encoding versusTraining with Sparse Coding and Vector Quantization,” Proc. Int’lConf. Machine Learning, 2011.

[5] A. Coates, H. Lee, and A.Y. Ng, “An Analysis of Single-LayerNetworks in Unsupervised Feature Learning,” Proc. 13th Int’lConf. Artificial Intelligence and Statistics, 2011.

[6] A. Courville, J. Bergstra, and Y. Bengio, “A Spike and SlabRestricted Boltzmann Machine,” Proc. 13th Int’l Conf. ArtificialIntelligence and Statistics, 2011.

[7] A. Courville, J. Bergstra, and Y. Bengio, “Unsupervised Models ofImages by Spike-and-Slab RBMs,” Proc. 28th Int’l Conf. MachineLearning, 2011.

[8] L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. Hinton,“Binary Coding of Speech Spectrograms Using a Deep Auto-Encoder,” Proc. Interspeech ’10, 2010.

[9] G. Desjardins, A.C. Courville, and Y. Bengio, “On Training DeepBoltzmann Machines,” CoRR, abs/1203.4416, 2012.

[10] S. Douglas, S.-I. Amari, and S.-Y. Kung, “On Gradient Adaptationwith Unit-Norm Constraints,” IEEE Trans. Signal Processing,vol. 48, no. 6, pp. 1843-1847, June 2000.

[11] P. Garrigues and B. Olshausen, “Learning Horizontal Connectionsin a Sparse Coding Model of Natural Images,” Proc. NeuralInformation Processing Systems, pp. 505-512, 2008.

[12] G.E. Hinton, “Training Products of Experts by MinimizingContrastive Divergence,” Technical Report GCNU TR 2000-004,Gatsby Unit, Univ. College London, 2000.

[13] G.E. Hinton, “A Practical Guide to Training Restricted BoltzmannMachines,” Technical Report UTML TR 2010-003, Dept. ofComputer Science, Univ. of Toronto, 2010.

[14] G.E. Hinton, S. Osindero, and Y. Teh, “A Fast Learning Algorithmfor Deep Belief Nets,” Neural Computation, vol. 18, pp. 1527-1554.2006.

[15] A. Hyvarinen, J. Hurri, and P.O. Hoyer, Natural Image Statistics: AProbabilistic Approach to Early Computational Vision. Springer-Verlag, 2009.

[16] Y. Jia and C. Huang, “Beyond Spatial Pyramids: Receptive FieldLearning for Pooled Image Features,” Proc. Neural InformationProcessing Systems Workshop Deep Learning and Unsupervised FeatureLearning, 2011.

[17] K. Kavukcuoglu, P. Sermanet, Y.-L. Boureau, K. Gregor, M.Mathieu, and Y. LeCun, “Learning Convolutional FeatureHierarchies for Visual Recognition,” Proc. Neural InformationProcessing System, 2010.

[18] D. Koller and N. Friedman, Probabilistic Graphical Models: Principlesand Techniques. MIT Press, 2009.

[19] A. Krizhevsky and G. Hinton, “Learning Multiple Layers ofFeatures from Tiny Images,” technical report, Univ. of Toronto,2009.

[20] Q.V. Le, A. Karpenko, J. Ngiam, and A.Y. Ng, “ICA withReconstruction Cost for Efficient Overcomplete Feature Learn-ing,” Proc. Advances in Neural Information Processing Systems 24,J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, andK. Weinberger, eds., pp. 1017-1025, 2011.

[21] Q.V. Le, M. Ranzato, R. Salakhutdinov, A. Ng, and J. Tenenbaum,Proc. Neural Information Processing Systems Workshop Challenges inLearning Hierarchical Models: Transfer Learning and Optimization,https://sites.google.com/site/nips2011workshop, 2011.

[22] N. Le Roux and Y. Bengio, “Representational Power of RestrictedBoltzmann Machines and Deep Belief Networks,” Neural Compu-tation, vol. 20, no. 6, pp. 1631-1649, 2008.

[23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-BasedLearning Applied to Document Recognition,” Proc. IEEE, vol. 86,no. 11, pp. 2278-2324, Nov. 1998.

[24] J. Lucke and A.-S. Sheikh, “A Closed-Form EM Algorithm forSparse Coding,” arXiv:1105.2493, 2011.

[25] T.J. Mitchell and J.J. Beauchamp, “Bayesian Variable Selection inLinear Regression,” J. Am. Statistical Assoc., vol. 83, no. 404,pp. 1023-1032, 1988.

[26] S. Mohamed, K. Heller, and Z. Ghahramani, “Bayesian and l1Approaches to Sparse Unsupervised Learning,” Proc. Int’l Conf.Machine Learning, 2012.

[27] G. Montavon and K.-R. Muller, “Learning Feature Hierarchies withCented Deep Boltzmann Machines,” CoRR, abs/1203.4416, 2012.

[28] B.A. Olshausen and D.J. Field, “Sparse Coding with an Over-complete Basis Set: A Strategy Employed by V1?” Vision Research,vol. 37, pp. 3311-3325, 1997.

[29] B. Pearlmutter, “Fast Exact Multiplication by the Hessian,” NeuralComputation, vol. 6, no. 1, pp. 147-160, 1994.

[30] R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng, “Self-TaughtLearning: Transfer Learning from Unlabeled Data,” Proc. Int’lConf. Machine Learning, pp. 759-766, Z. Ghahramani, ed., 2007.

[31] R. Salakhutdinov and G. Hinton, “Deep Boltzmann Machines,”Proc. Int’l Conf. Artificial Intelligence and Statistics, 2009.

[32] L.K. Saul and M.I. Jordan, “Exploiting Tractable Substructures inIntractable Networks,” Proc. Advances in Neural InformationProcessing Systems, 1996.

[33] N.N. Schraudolph, “Fast Curvature Matrix-Vector Products forSecond-Order Gradient Descent,” Neural Computation, vol. 14,no. 7, pp. 1723-1738, 2002.

[34] P. Smolensky, “Information Processing in Dynamical Systems:Foundations of Harmony Theory,” Parallel Distributed Processing,vol. 1, chapter 6, D.E. Rumelhart and J.L. McClelland, eds.,pp. 194-281, MIT Press, 1986.

[35] T. Tieleman, “Training Restricted Boltzmann Machines UsingApproximations to the Likelihood Gradient,” Proc. 25th Int’l Conf.Machine Learning, W.W. Cohen, A. McCallum, and S.T. Roweis,eds., pp. 1064-1071, 2008.

[36] M.K. Titsias and M. Lazaro-Gredillad, “Spike and Slab VariationalInference for Multi-Task and Multiple Kernel Learning,” Proc.Advances in Neural Information Processing System, 2011.

GOODFELLOW ET AL.: SCALING UP SPIKE-AND-SLAB MODELS FOR UNSUPERVISED FEATURE LEARNING 1913

Page 13: Scaling Up Spike-and-Slab Models for Unsupervised Feature Learning

[37] D. Titterington, A. Smith, and U. Makov, Statistical Analysis ofFinite Mixture Distributions. Wiley, 1985.

[38] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol,“Extracting and Composing Robust Features with DenoisingAutoencoders,” Proc. Int’l Conf. Machine Learning, 2008.

[39] D. Warde-Farley, I. Goodfellow, P. Lamblin, G. Desjardins, F.Bastien, and Y. Bengio, “Pylearn2,” http://deeplearning.net/software/pylearn2, 2011.

[40] L. Younes, “On the Convergence of Markovian StochasticAlgorithms with Rapidly Decreasing Ergodicity Rates,” Stochasticsand Stochastics Models, pp. 177-228, 1998.

[41] K. Yu, Y. Lin, and J. Lafferty, “Learning Image Representationsfrom the Pixel Level via Hierarchical Sparse Coding,” Proc. IEEEConf. Computer Vision and Pattern Recognition, 2011.

[42] M. Zeiler, G. Taylor, and R. Fergus, “Adaptive DeconvolutionalNetworks for mid and High Level Feature Learning,” Proc. Int’lConf. Machine Learning, 2011.

[43] M. Zhou, H. Chen, J.W. Paisley, L. Ren, G. Sapiro, and L. Carin,“Non-Parametric Bayesian Dictionary Learning for Sparse ImageRepresentations,” Proc. Advances in Neural Information ProcessingSystems, pp. 2295-2303, 2009.

Ian J. Goodfellow received the BS and MSdegrees in computer science from StanfordUniversity in 2009 and is currently workingtoward the PhD degree at the Universite deMontreal. His research interests include ma-chine learning and computer vision, withspecific interests in probabilistic modeling anddeep learning.

Aaron Courville is an assistant professor inthe Department of Computer Science andOperations Research at the University ofMontreal. His recent research interests havefocused on the development of deep learningmodels and methods. He is particularly inter-ested in developing probabilistic models andnovel inference methods.

Yoshua Bengio is a full professor in theDepartment of Computer Science and Opera-tions Research and the head of the MachineLearning Laboratory (LISA) at the Universityof Montreal, a CIFAR Fellow in the NeuralComputation and Adaptive Perception pro-gram, Canada Research Chair in StatisticalLearning Algorithms, and he also holds theNSERC-Ubisoft Industrial Chair. His primaryresearch ambition is to understand principles

of learning that yield intelligence.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

1914 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013