Yoshua Bengio Statistical Machine Learning Chair, U. Montreal
ICML 2011 Workshop on Unsupervised and Transfer Learning July 2 nd
2011, Bellevue, WA
Slide 2
How to Beat the Curse of Many Factors of Variation?
Compositionality: exponential gain in representational power
Distributed representations Deep architecture
Slide 3
Distributed Representations Many neurons active simultaneously
Input represented by the activation of a set of features that are
not mutually exclusive Can be exponentially more efficient than
local representations
Slide 4
Local vs Distributed
Slide 5
RBM Hidden Units Carve Input Space h1h1 h2h2 h3h3 x1x1
x2x2
Slide 6
Unsupervised Deep Feature Learning Classical: pre-process data
with PCA = leading factors New: learning multiple levels of
features/factors, often over-complete Greedy layer-wise strategy:
raw input x unsupervised raw input x P(y|x) Supervised fine-tuning
(Hinton et al 2006, Bengio et al 2007, Ranzato et al 2007) raw
input x
Slide 7
Why Deep Learning? Hypothesis 1: need deep hierarchy of
features to efficiently represent and learn complex abstractions
needed for AI and mammal intelligence. Computational &
statististical efficiency Hypothesis 2: unsupervised learning of
representations is a crucial component of the solution.
Optimization & regularization. Theoretical and ML-experimental
support for both. Cortex: deep architecture, the same learning rule
everywhere
Slide 8
Deep Motivations Brains have a deep architecture Humans ideas
& artifacts composed from simpler ones Unsufficient depth can
be exponentially inefficient Distributed (possibly sparse)
representations necessary for non- local generalization,
exponentially more efficient than 1-of-N enumeration of latent
variable values Multiple levels of latent variables allow
combinatorial sharing of statistical strength raw input x task
1task 3task 2 shared intermediate representations
Slide 9
Deep Architectures are More Expressive Theoretical arguments:
123 2n2n 123 n = universal approximator2 layers of Logic gates
Formal neurons RBF units Theorems for all 3: (Hastad et al 86 &
91, Bengio et al 2007) Functions compactly represented with k
layers may require exponential size with k-1 layers
Slide 10
Sharing Components in a Deep Architecture Polynomial expressed
with shared components: advantage of depth may grow exponentially
(Bengio & Delalleau, Learning Workshop 2011) Sum-product
network
Slide 11
Parts Are Composed to Form Objects Layer 1: edges Layer 2:
parts Lee et al. ICML2009 Layer 3: objects
Slide 12
Deep Architectures and Sharing Statistical Strength, Multi-Task
Learning Generalizing better to new tasks is crucial to approach AI
Deep architectures learn good intermediate representations that can
be shared across tasks Good representations make sense for many
tasks raw input x task 1 output y 1 task 3 output y 3 task 2 output
y 2 shared intermediate representation h
Slide 13
Feature and Sub-Feature Sharing Different tasks can share the
same high- level features Different high-level features can be
built from the same set of lower-level features More levels = up to
exponential gain in representational efficiency task 1 output y 1
task N output y N High-level features Low-level features task 1
output y 1 task N output y N High-level features Low-level features
Sharing intermediate features Not sharing intermediate
features
Slide 14
Representations as Coordinate Systems PCA: removing
low-variance directions easy but what if signal has low variance?
We would like to disentangle factors of variation, keeping them
all. Overcomplete representations: richer, even if underlying
distribution concentrates near low-dim manifold. Sparse/saturated
features: allows for variable-dim manifolds. Different few
sensitive features at x = local chart coordinate system.
Slide 15
Effect of Unsupervised Pre-training AISTATS2009+JMLR 2010, with
Erhan, Courville, Manzagol, Vincent, S. Bengio
Slide 16
Effect of Depth w/o pre-training with pre-training
Slide 17
Unsupervised Feature Learning: a Regularizer to Find Better
Local Minima of Generalization Error Unsupervised pre- training
acts like a regularizer Helps to initialize in basin of attraction
of local minima with better generalization error
Slide 18
Non-Linear Unsupervised Feature Extraction Algorithms CD for
RBMs SML (PCD) for RBMs Sampling beyond Gibbs (e.g. tempered MCMC)
Mean-field + SML for DBMs Sparse auto-encoders Sparse Predictive
Decomposition Denoising Auto-Encoders Score Matching / Ratio
Matching Noise-Contrastive Estimation Pseudo-Likelihood Contractive
Auto-Encoders See my book / review paper (F&TML 2009): Learning
Deep Architectures for AI
Slide 19
Auto-Encoders Reconstruction=decoder(encoder(input)) Probable
inputs have small reconstruction error Linear decoder/encoder = PCA
up to rotation Can be stacked to form highly non-linear
representations, increasing disentangling (Goodfellow et al, NIPS
2009) code= latent features encoder decoder input
reconstruction
Slide 20
Restricted Boltzman Machine (RBM) The most popular building
block for deep architectures Bipartite undirected graphical model
Inference is trivial: P(h|x) & P(x|h) factorize observed
hidden
Slide 21
Gibbs Sampling in RBMs P(h|x) and P(x|h) factorize h 1 ~ P(h|x
1 ) x 2 ~ P(x|h 1 ) x 3 ~ P(x|h 2 ) x1 x1 h 2 ~ P(h|x 2 ) h 3 ~
P(h|x 3 ) Easy inference Convenient Gibbs sampling x h x h
Slide 22
RBMs are Universal Approximators Adding one hidden unit (with
proper choice of parameters) guarantees increasing likelihood With
enough hidden units, can perfectly model any discrete distribution
RBMs with variable nb of hidden units = non-parametric (Le Roux
& Bengio 2008, Neural Comp.)
Slide 23
Denoising Auto-Encoder Learns a vector field towards higher
probability regions Minimizes variational lower bound on a
generative model Similar to pseudo- likelihood A form of
regularized score matching Reconstruction Corrupted input
Slide 24
Stacked Denoising Auto- Encoders No partition function, can
measure training criterion Encoder & decoder: any
parametrization Performs as well or better than stacking RBMs for
usupervised pre-training Infinite MNIST
Slide 25
Contractive Auto-Encoders: Explicit Invariance During Feature
Extraction, Rifai, Vincent, Muller, Glorot & Bengio, ICML 2011.
Higher Order Contractive Auto-Encoders, Rifai, Mesnil, Vincent,
Muller, Bengio, Dauphin, Glorot, ECML 2011. Part of winning toolbox
in final phase of the Unsupervised & Transfer Learning
Challenge 2011 Contractive Auto-Encoders
Slide 26
Training criterion: wants contraction in all directions cannot
afford contraction in manifold directions Few active units
represent the active subspace (local chart) Jacobians spectrum is
peaked = local low- dimensional representation / relevant
factors
Slide 27
Unsupervised and Transfer Learning Challenge: 1 st Place in
Final Phase Raw data 1 layer 2 layers 4 layers 3 layers
Slide 28
Transductive Representation Learning Validation set and test
sets have different classes than training set, hence very different
input distributions Directions that matter to distinguish them
might have small variance under the training set Solution: perform
last level of unsupervised feature learning (PCA) on the validation
/ test set input data.
Slide 29
Small (4-domain) Amazon benchmark: we beat the state-of-the-art
handsomely Domain Adaptation (ICML 2011) Sparse rectifiers SDA
finds more features that tend to be useful either for predicting
domain or sentiment, not both
Slide 30
Sentiment Analysis: Transfer Learning 25 Amazon.com domains:
toys, software, video, books, music, beauty, Unsupervised pre-
training of input space on all domains Supervised SVM on 1 domain,
generalize out- of-domain Baseline: bag-of-words + SVM
Slide 31
Sentiment Analysis: Large Scale Out-of-Domain Generalization
Relative loss by going from in-domain to out-of-domain testing 340k
examples from Amazon, from 56 (tools) to 124k (music)
Slide 32
Deep Sparse Rectifier Neural Networks, Glorot, Bordes &
Bengio, AISTATS 2011. Sampled Reconstruction for Large-Scale
Learning of Embeddings, Dauphin, Glorot & Bengio, ICML 2011.
Representing Sparse High- Dimensional Stuff code= latent features
sparse input dense output probabilities cheap expensive
Slide 33
Speedup from Sampled Reconstruction
Slide 34
Deep Self-Taught Learning for Handwritten Character Recognition
Y. Bengio & 16 others (IFT6266 class project & AISTATS 2011
paper) discriminate 62 character classes (upper, lower, digits),
800k to 80M examples Deep learners beat state-of-the-art on NIST
and reach human-level performance Deep learners benefit more from
perturbed (out-of-distribution) data Deep learners benefit more
from multi-task setting
Slide 35
35
Slide 36
Improvement due to training on perturbed data Deep Shallow
{SDA,MLP}1: trained only on data with distortions: thickness,
slant, affine, elastic, pinch {SDA,MLP}2: all perturbations
Slide 37
Improvement due to multi-task setting Deep Shallow Multi-task:
Train on all categories, test on target categories (share
representations) Not multi-task: Train and test on target
categories only (no sharing, separate models)
Slide 38
Comparing against Humans and SOA All 62 character classes Deep
learners reach human performance [1] Granger et al., 2007 [2]
Prez-Cortes et al., 2000 (nearest neighbor) [3] Oliveira et al.,
2002b (MLP) [4] Milgram et al., 2005 (SVM)
Slide 39
Tips & Tricks Dont be scared by the many hyper- parameters:
use random sampling (not grid search) & clusters / GPUs
Learning rate is the most important, along with top-level dimension
Make sure selected hyper-param value is not on the border of
interval Early stopping Using NLDR for visualization Simulation of
Final Evaluation Scenario
Slide 40
Conclusions Deep Learning: powerful arguments &
generalization principles Unsupervised Feature Learning is crucial:
many new algorithms and applications in recent years DL
particularly suited for multi-task learning, transfer learning,
domain adaptation, self-taught learning, and semi- supervised
learning with few labels