CSC2535: 2011 Lecture 5b Object Recognition and Information Retrieval with Deep Belief Nets

CSC2535: 2011Lecture 5b

Object Recognition and Information Retrieval with Deep Belief Nets

Geoffrey Hinton

Training a deep network

• First train a layer of features that receive input directly from the binary pixels.

• Then treat the activations of the trained features as if they were pixels and learn features of features in a second hidden layer.

• It can be proved that each time we add another layer of features we get a better model of the set of training images (if we do it right).– The proof uses variational free energy.– The proof doesn’t apply if the layers get smaller. Also,

it requires the weights in each RBM to be initialized properly and to be trained by properly following the gradient of the likelihood.

Modeling real-valued data

• For images of digits it is possible to represent intermediate intensities as if they were probabilities by using “mean-field” logistic units.– We can treat intermediate values as the probability

that the pixel is inked.• This will not work for real images.

– In a real image, the intensity of a pixel is almost always almost exactly the average of the neighboring pixels.

– Mean-field logistic units cannot represent precise intermediate values.

A standard type of real-valued visible unit

• We can model pixels as Gaussian variables. Alternating Gibbs sampling is still easy, though learning needs to be much slower.

ijjji i

iv

hidjjj

visi i

ii whhbbv

,E

,

2

2

2

)()(

hv

E

energy-gradient produced by the total input to a visible unit

parabolic containment function

ii vb

Welling et. al. (2005) show how to extend RBM’s to the exponential family. See also Bengio et. al. (2007)

The two conditional distributions for a Gaussian-Bernoulli RBM

visiijjj

ihidj

jii

wi

ivbhp

Nijwhibvp

isticlog)|(

),0()|( 2

v

h

A random sample of 10,000 binary filters learned

by Alex Krizhevsky on a million 32x32 color images.

3D Object Recognition: The NORB dataset

Stereo-pairs of grayscale images of toy objects.

- 6 lighting conditions, 162 viewpoints-Five object instances per class in the training set- A different set of five instances per class in the test set- 24,300 training cases, 24,300 test cases

Animals

Humans

Planes

Trucks

Cars

Normalized-uniform version of NORB

NORB Pre-processing

Each image in a stereo pair is 96x96, so a training case is 18432-dimensional.

We reduce the dimensionality to 8976 by blurring the borders of an image. Only the central 64x64 region has full resolution.

We also make the data zero-mean and scale each dimension to have unit-variance.

We then use an RBM with Gaussian visible units to learn a layer of 4000 binary features from the unlabelled data. These features are not fine-tuned.

The model contains ~116 million parameters and is trained with only 24,300 labeled images.

4000 binary units for class 2

8976 Gaussian units

4000 binary units



or or or

Five competing generative models

Each class-specific model is trained generatively on data from its own class

All five models are also trained discriminatively to make the right model have the lowest free energy.

Results (Nair & Hinton, NIPS 2009)

Support Vector Machines 11.6%

Convolutional Neural Networks (built in geometric knowledge)

6.0%

Single top-level RBM with label variable generative + discriminative

training

10.4%

Competing top-level RBMs: generative + discriminative training

6.5%

Competing top-level RBMs: generative + discriminative training

plus extra unlabeled translated data

5.3%

Error rate

An embarrassing problem

• The front end was a Restricted Boltzmann Machine with Gaussian visible units and binary hidden units.

• We tried learning the residual noise model for each Gaussian visible unit and it did not work.– So we just fixed the residual noise variance to

be 1, which is much too big.

Why its hard to learn an appropriate noise level for each visible unit

• Suppose the model of the residual noise in a visible unit has a standard deviation of 0.1. The bottom-up weight is effectively multiplied by 10 and the top-down weight is effectively divided by 10.– The resulting bottom-up effects are

much too strong. They saturate the hidden units.

– The resulting top-down effects are much too weak. They cannot move the visible units far enough away from their mean.

hidden

i

j

visible

i

ijw

ijiw

How to implement an integer-valued variable (Teh and Hinton, 2001)

• One way to model an integer-valued variable is to make N identical copies of a Bernoulli unit. All copies have the same probability, of being “on”.– The total number of “on” copies (the “firing rate”) has

a binomial distribution with variance N p(1-p)– Provided p is small, the total number of “on” copies is

Poisson-distributed with variance approximately N p• It is very hard to stay in the Poisson regime because the

“firing rate” grows exponentially with the input to the unit.– So learning is very unstable.

)(logistic xp

How to fix the problem with an infinite model

• Make infinitely many copies of each binary hidden unit. • All copies have the same weights and the same adaptive

bias, b, but they have different fixed offsets to the bias:

• Now if the bottom-up effects become 10 times stronger, 10 times as many hidden units will turn on producing a top-down effect that is 10 times as big.

....,5.3,5.2,5.1,5.0 bbbb

x

A fast approximation

• Contrastive divergence learning works well for the sum of binary units with offset biases.

• It also works for rectified linear units. These are much faster to compute than the sum of many logistic units.

output = max(0, x + randn*sqrt(logistic(x)) )

)1log()5.0(logistic1

xn

n

enx

A harder version of the NORB dataset

• The objects are jittered in position, size, and intensity.

• A distractor object is placed off-center in the image

• A cluttered background is added to the image (but its at a different depth so stereo can be quite effective at ignoring it.)

• To make experiments quicker the 96x96 images were reduced to 32x32 images in the simplest possible way.

Get examples from relu paper

Results on jittered-cluttered NORB (Nair & Hinton, ICML 2010)

Multinomial logistic regression on the pixels

49.9%

Support Vector Machines 43.3%

Convolutional Neural Networks (built in geometric knowledge) 7.2%

Multinomial logistic regression on second hidden layer using binary units

18.8%

Multinomial logistic regression on second hidden layer using RELU units

15.2%

Error rate

Equivariance

• Simple convolutional neural nets give translational equivariance, not translational invariance.

• A small amount of translational invariance is then achieved at each layer by using local averaging or maxing.

representation

image

translated representation

translated image

Intensity equivariance

• Consider a net with multiple layers of rectified linear units that all have a bias of zero. – Ignore the sampling noise for now.

• If we scale all the intensities in the input image by S, exactly the same set of hidden units will be above threshold in every hidden layer, but all of their activities will be scaled up by S.– Its a linear model.– If the ratios of intensities do not change, it will not

switch to a different linear model.– So we have a very non-linear model (globally)

that respects the linear property )()( xx reprep

Intensity invariance

• Equivariance ensures that if we can represent an image well, we can also represent it just as well when all the intensities are scaled up.

• Given equivariance to intensity as the basic property of the model, we can then achieve invariance by a little bit of local normalization at each layer. – Local divisive normalization does for intensity

what local averaging or maxing does for translation in a convolutional net.

Test examples from the CIFAR-10 dataset plane car bird cat deer dog frog horse ship truck

The CIFAR-10 labeled subset of the MIT tiny images dataset

• There are 5000 32x32 carefully labelled training images and 1000 32x32 carefully labelled testing images for each of 10 different classes. (so its very like MNIST!)

• There are 80 million very badly labelled images that are found by trawling the web using about 50,000 search words or phrases taken from “Wordnet”.– Less than 10% of these images have a dominant object

that fits the search term.– For example, about 90% of images found using “cat”

do not contain a cat! • The badly labelled images are very useful for

unsupervised pre-training.

Some ways to make convolutional neural nets work really well (Alex Krizhevsky)

• Use RELU units instead of binary units in the first hidden layer.

• Use overlapping pools for “sub-sampling” – This retains more information about the positions of

features, but codes this information in units that have large receptive fields.

• In addition to the usual convolutional units that have local receptive fields with weight-sharing, use a separate set of global units. – The global units learn to take care of edge effects that

would otherwise mess up the local units.

Local filters learned by the 64 convolutional kernels in the first hidden layer of RELUs

After unsupervised pre-training

After supervised fine-tuning

Results of Alex Krizhevsky’s experiments

Pre-train DBN with 2 hidden layers of fully connected binary units. Add softmax layer

and fine-tune with backprop

~35%

Convolutional neural net

(2 hidden layers with pooling)

~29%

Convolutional RELU net + global filters (1 hidden layer but no pooling)

~29%

Convolutional RELU net + global filters (1 hidden layer with overlapping

pooling)

~25%

Convolutional RELU net + global filters (2 hidden layers with

overlapping pooling)

~21%

Error rate

Mixtures of linear models

• Mixtures of linear models are easy to fit and work surprisingly well on many simple problems.

• Given a set of images of digits that have been approximately normalized for size and position, we can get quite a good model by using a mixture of factor analysers.– Each factor analyser models one class or

subclass.– The factors typically model small local

translations of parts of the digit.• This type of model looks very different from a

multilayer neural net.

Some problems with mixtures of linear models

• For data with many different things going on at once we may need exponentially many components in the mixture– Consider modeling pairs of digits.

• If there are too many components in the mixture they lose statistical power.– Each linear model only gets trained on a very small

fraction of the data. – A mixture cannot benefit from the fact that the top of a

2 is like the top of a 3.• Its hard to stitch together the latent representations of

different factor analysers because they all use different coordinate systems.– This makes it hard to associate actions with the latent

representations.

An old idea for learning a mixture of exponentially many linear models (Sallans, Hinton & Ghahramani, 1997)

• First generate a hierarchical binary state by running a sigmoid belief net.

• Each unit in the SBN can switch in a corresponding linear unit.– So the SBM chooses which

factors to use in a hierarchical linear model.

• We get combinatorially many linear models that share factors– but inference is hard.

A better version of the old idea

• Combine the binary switch and the linear variable into a single rectified linear unit.

– Rectification is equivalent to the rule: If the total input is negative, exclude that unit from the linear model.

– This has the nice property that there is no discontinuity at the point at which the unit is switched in, because its value is 0.

• Use an undirected model with only one hidden layer so that inference is easy.

– We can infer the value of a hidden unit without knowing which other units are switched in.

BREAK?

Overview

• An efficient way to train a multilayer neural network to extract a low-dimensional representation.

• Document retrieval (published work with Russ Salakhutdinov)

– How to model a bag of words with an RBM– How to learn binary codes– Semantic hashing: retrieval in no time

• Image retrieval (unpublished work with Alex Krizhevsky)

– How good are 256-bit codes for retrieval of small color images?

– Ways to use the speed of semantic hashing for higher-quality image retrieval (work in progress).

Deep Autoencoders(with Ruslan Salakhutdinov)

• They always looked like a really nice way to do non-linear dimensionality reduction:– But it is very difficult to

optimize deep autoencoders using backpropagation.

• We now have a much better way to optimize them:– First train a stack of 4 RBM’s– Then “unroll” them. – Then fine-tune with backprop.

1000 neurons

500 neurons

500 neurons

250 neurons

250 neurons

30

1000 neurons

28x28

28x28

1

2

3

4

4

3

2

1

W

W

W

W

W

W

W

W

T

T

T

T

A comparison of methods for compressing digit images to 30 real numbers.

real data

30-D deep auto

30-D logistic PCA

30-D PCA

Compressing a document count vector to 2 numbers

• We train the autoencoder to reproduce its input vector as its output

• This forces it to compress as much information as possible into the 2 real numbers in the central bottleneck.

• These 2 numbers are then a good way to visualize documents.

2000 reconstructed counts

500 neurons

2000 word counts

500 neurons

250 neurons

250 neurons

2

output vector

We need a special type of RBM to model counts

real-valued units

First compress all documents to 2 numbers using a type of PCA Then use different colors for different document categories

Yuk!

First compress all documents to 2 numbers. Then use different colors for different document categories

The replicated softmax model: How to modify an RBM to model word count vectors

• Modification 1: Keep the binary hidden units but use “softmax” visible units that represent 1-of-N

• Modification 2: Make each hidden unit use the same weights for all the visible softmax units.

All the models in this family have 5 hidden units.

This model is for 8-word documents.

The replicated softmax model: How to modify an RBM to model word count vectors

• Modification 3: Use as many softmax visible units as there are non-stop words in the document.– So its actually a family of different-sized RBMs

that share weights. It not a single generative model.

• Modification 4: Multiply each hidden bias by the number of words in the document

• The replicated softmax model gives much higher probability to held out bags of words than LDA topic models (in NIPS 2009)

The relative negative log probs assigned to bags of words by LDA and replicated softmax RBM

Finding real-valued codes for retrieval

• Train an auto-encoder using 10 real-valued units in the code layer.

• Compare with Latent Semantic Analysis that uses PCA on the transformed count vector

• Non-linear codes are much better.


500 neurons

2000 word counts

500 neurons

250 neurons

250 neurons

10

Retrieval performance on 400,000 Reuters business news stories

Finding binary codes for documents

• Train an auto-encoder using 30 logistic units for the code layer.

• During the fine-tuning stage, add noise to the inputs to the code units.– The “noise” vector for each

training case is fixed. So we still get a deterministic gradient.

– The noise forces their activities to become bimodal in order to resist the effects of the noise.

– Then we simply threshold the activities of the 30 code units to get a binary code.


500 neurons

2000 word counts

500 neurons

250 neurons

250 neurons

30 noise

Using a deep autoencoder as a hash-function for finding approximate matches

hash function

“supermarket search”

Another view of semantic hashing

• Fast retrieval methods typically work by intersecting stored lists that are associated with cues extracted from the query.

• Computers have special hardware that can intersect 32 very long lists in one instruction.– Each bit in a 32-bit binary code specifies a list

of half the addresses in the memory.• Semantic hashing uses machine learning to map

the retrieval problem onto the type of list intersection the computer is good at.

How good is a shortlist found this way?

• Russ has only implemented it for a million documents with 20-bit codes --- but what could possibly go wrong?– A 20-D hypercube allows us to capture enough

of the similarity structure of our document set. • The shortlist found using binary codes actually

improves the precision-recall curves of TF-IDF.– Locality sensitive hashing (the fastest other

method) is much slower and has worse precision-recall curves.

Semantic hashing for image retrieval

• Currently, image retrieval is typically done by using the captions. Why not use the images too?– Pixels are not like words: individual pixels do

not tell us much about the content.– Extracting object classes from images is hard.

• Maybe we should extract a real-valued vector that has information about the content?– Matching real-valued vectors in a big database

is slow and requires a lot of storage• Short binary codes are easy to store and match

A two-stage method

• First, use semantic hashing with 28-bit binary codes to get a long “shortlist” of promising images.

• Then use 256-bit binary codes to do a serial search for good matches.– This only requires a few words of storage per

image and the serial search can be done using fast bit-operations.

• But how good are the 256-bit binary codes?– Do they find images that we think are similar?

Some depressing competition

• Inspired by the speed of semantic hashing, Weiss, Fergus and Torralba (NIPS 2008) used a very fast spectral method to assign binary codes to images.

– This eliminates the long learning times required by deep autoencoders.

• They claimed that their spectral method gave better retrieval results than training a deep auto-encoder using RBM’s.

– But they could not get RBM’s to work well for extracting features from RGB pixels so they started from 384 GIST features.

– This is too much dimensionality reduction too soon.

A comparison of deep auto-encoders and the spectral method using 256-bit codes

(Alex Krizhevsky)

• Train auto-encoders “properly”– Use Gaussian visible units with fixed variance.

Do not add noise to the reconstructions.– Use a cluster machine or a big GPU board.– Use a lot of hidden units in the early layers.

• Then compare with the spectral method– The spectral method has no free parameters.

• Also compare with Euclidean match in pixel space

Krizhevsky’s deep autoencoder

1024 1024 1024

8192

4096

2048

1024

512

256-bit binary codeThe encoder has about 67,000,000 parameters.

It takes a few GTX 285 GPU days to train on two million images.

There is no theory to justify this architecture

Reconstructions produced by 256-bit codes

A

S

E

A

S

E

A

S

E

A

S

E

A

S

E

Retrieval results

An obvious extension

• Use a multimedia auto-encoder that represents captions and images in a single code.– The captions should help it extract more

meaningful image features such as “contains an animal” or “indoor image”

• RBM’s already work much better than standard LDA topic models for modeling bags of words.– So the multimedia auto-encoder should be a

win (for images) a win (for captions) a win (for the interaction during training)

A less obvious extension

• Semantic hashing gives incredibly fast retrieval but its hard to go much beyond 32 bits.

• We can afford to use semantic hashing several times with variations of the query and merge the shortlists– Its easy to enumerate the hamming ball around a

query image address in ascending address order, so merging is linear time.

• Apply many transformations to the query image to get transformation independent retrieval.– Image translations are an obvious candidate.

A more interesting extension

• Computer vision uses images of uniform resolution. – Multi-resolution images still keep all the high-resolution

pixels.• Even on 32x32 images, people use a lot of eye movements to

attend to different parts of the image.– Human vision copes with big translations by moving the

fixation point.– It only samples a tiny fraction of the image at high resolution.

The “post-retinal’’ image has resolution that falls off rapidly outside the fovea.

– With less “neurons” intelligent sampling becomes even more important.

How we extract multiple images with fewer pixels per image

• This variable resolution “retina” is applied at 81 different locations in the image.

• Each of the 81 “post-retinal” images is used for semantic hashing.– We also use the

whole image.

A more human metric for image similarity

• Two images are similar if fixating at point X in one image and point Y in the other image gives similar post-retinal images.

• So use semantic hashing on post-retinal images. – The address space is used for post-retinal images and

each address points to the whole image that the post-retinal image came from.

– So we can accumulate similarity over multiple fixations.

• The whole image addresses found after each fixation have to be sorted to allow merging

Summary

• Restricted Boltzmann Machines provide an efficient way to learn a layer of features without any supervision.– Many layers of representation can be learned by

treating the hidden states of one RBM as the data for the next.

• This allows us to learn very deep nets that extract short binary codes for unlabeled images or documents.– Using 32-bit codes as addresses allows us to get

approximate matches at the speed of hashing.

• Semantic hashing is fast enough to allow many retrieval cycles for a single query image.– So we can try multiple transformations of the query.

Documents

CSC2535: 2011 Lecture 5b Object Recognition and Information Retrieval with Deep Belief Nets