Upload
vuongcong
View
354
Download
9
Embed Size (px)
Citation preview
Deep Learning I
Junhong Kim
School of Industrial Management Engineering
Korea University
2016. 02. 01.
1
Deep Learning I
Deep learning is research on learning models with multilayer representations
1. Multilayer (feed-forward) Neural Network
2. Multilayer graphical model (Deep Belief Network, Deep Boltzmann machine)
Each layer corresponds to a “distributed representation”
1. Units in layer are not mutually exclusive(1) Each unit is a separate feature of the input(2) Two units can be “active” at the same time
2. They do not correspond to a partitioning (clustering) of the input(1) In clustering, an input can only belong to a single cluster
Figure 1. Multilayer Feed Forward Neural Network
Deep Learning I
Figure 1. Multilayer Feed Forward Neural Network
Deep Learning I
Figure 1. Multilayer Feed Forward Neural Network Figure 2. Inspiration from visual cortex
Decision class based on each probability
Visualization of compare Multilayer FFNN with Visual cortex
Using Application
Deep Learning I
Success Stories in Deep Learning Area
Figure 3. Image Reconstruction Figure 4. Image Question Answering Figure 5. Application for Baduk Figure 6. Image Identification
Figure 7. Speech Recognition
And so on..
Facebook.com/AIkorea(Facebook Group)
Deep Learning I
Why training is hard ?
Figure 8. Weight update path in FFNN
[One] Optimization is harder (Underfitting) 1. Vanishing gradient problem2. Saturated units block gradient propagation
Low rate of change convergence to zero
Low rate of change convergence to zero
High rate of change
Problem…
Well known problem in RNN
Deep Learning I
Why training is hard ?
[Two] Overfitting1. We are exploring a space of complex functions2. Deep nets usually have lots of parameters
Already we know that Neural Network might be in a high variance / low bias situation
Figure 9. Visualization of Variance and Bias Problem
Deep Learning I
Why training is hard ?
Solution for [One] (Underfitting) Use better optimization method
-> this is an active area of research
Solution for [Two] (Overfitting) Unsupervised learning (Unsupervised pre-training ) Stochastic ‘Dropout’ training
Figure 10. Visualization of pre-training
Figure 11. Visualization of Dropout learning
Deep Learning I
Unsupervised Pre-training
Initialize hidden layers unsupervised learning ( RBM, Autoencoder )
We will use a greedy, layer-wise procedure-> Train one layer at a time, from first to last, with unsupervised criterion-> fix the parameters of previous hidden layers-> previous layers viewed as feature extraction
Figure 9. Visualization of pre-training It have a look of other feature extraction method( PCA, MDS and so on..)
Number of next layer unit(node)
Number of present layer unit(node)
10 10
10 15
15 10
Deep Learning I
Unsupervised Pre-training
We call this procedure unsupervised pre-training
First layer : find hidden unit features that are more common in training inputs than in random inputsSecond layer : find combinations of hidden unit features that are more common than random hidden
unit featuresThird layer : find combinations of combinations of combinations of…
Advantage Pre-training initializes the parameters in a region such that the near local optima overfit less the data
Figure 9. Visualization of pre-training It is have a look of other feature extraction method( PCA, MDS and so on..)
Number of next layer unit(node)
Number of present layer unit(node)
10 ( Same )
10
10( Smaller than..)
15
15( Bigger than..)
10
Deep Learning I
Fine Tuning
Once all layers are pre-trained-> add output layer-> train the whole network using supervised learning
Supervised learning is performed as in a regular feed-forward network-> forward/back propagation and update
We call this last phase fine-tuning-> all parameters are tuned for the supervised task at hand
Figure 10. Multi Layer FFNN
Q. When we use fine tuning before calculate error rate.If we calculate large number of iteration update,Maybe occur to ‘Vanishing gradient problem’ ?
Deep Learning I
Pseudo code
For I=1 to L ( Number of layers – 1 == L)->build unsupervised training set (with ℎ 𝑜 𝑥 = 𝑥 ) :
-> train ‘‘greedy module’’ (RBM, autoencoder) on-> use hidden layer weights and biases of greedy module
to initialize the deep network parameters 𝑊 𝑙 , 𝑏 𝑙
Initialize 𝑊 𝐿+1 , 𝑏 𝐿+1 , randomly (as usual)
Train the whole neural network using (supervised)stochastic gradient descent (with backprop)
pre-training
Fine-tuning
Deep Learning I
Stacked restricted Boltzmann machines:‣ Hinton, Teh and Osindero suggested this procedure with RBMs
- A fast learning algorithm for deep belief nets. Hinton, Teh, Osindero., 2006.- To recognize shapes, first learn to generate images. Hinton, 2006.
Stacked autoencoders:‣ Bengio, Lamblin, Popovici and Larochelle studied and generalized the procedure to autoencoders
- Greedy Layer-Wise Training of Deep Networks.Bengio, Lamblin, Popovici and Larochelle, 2007.‣ Ranzato, Poultney, Chopra and LeCun also generalized it to sparse autoencoders
- Efficient Learning of Sparse Representations with an Energy-Based Model.Ranzato, Poultney, Chopra and LeCun, 2007.
Kind of stacked unsupervised learning methods
Deep Learning I
Kind of stacked unsupervised learning methods
Stacked denoising autoencoders:‣ proposed by Vincent, Larochelle, Bengio and Manzagol- Extracting and Composing Robust Features with Denoising Autoencoders,Vincent, Larochelle, Bengio and Manzagol, 2008.
And more:‣ stacked semi-supervised embeddings
- Deep Learning via Semi-Supervised Embedding. Weston, Ratle and Collobert, 2008.‣ stacked kernel PCA
- Kernel Methods for Deep Learning. Cho and Saul, 2009.‣ stacked independent subspace analysis
- Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. Le, Zou, Yeung and Ng, 2011.
Deep Learning I
Impact of initialization
Without pre-training
Pre-training usingRegular autoencoder
Pre-training usingRestrict Boltzmman machine
Number of Hidden layers
Figure 11. Impact of initialization
An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation Larochelle, Erhan, Courville, 2007
Original MNIST
Harder thanOriginal MNIST problem
Deep Learning I
Impact of initialization
Why Does Unsupervised Pre-training Help Deep Learning? Erhan, Bengio, Courville, Manzagol, Vincent and Bengio, 2011http://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf
1. IF when we increase hidden layer for classification(1) Without pre-training -> increase test error rate(2) Using pre-training -> decrease test error rate
this result correct tend to comparatively smaller than
large hidden units
2. When we increase number of hidden units in 1,2 and 3 hidden layer
(1) Without pre-training -> at one point, increase error rate(2) Using pre-training -> always decrease
3. At one point, We fine Pre-training method better than without pre-training
All of results based on best parameter
Deep Learning I
Choice of hidden layer size
MINIST
RotationMNIST
Summary
Except some cases, when increase total number of
hidden units, test error rate is decreased
Deep Learning I
Performance on different datasets
Extracting and Composing Robust Features with Denoising Autoencoders, Vincent, Larochelle, Bengio and Manzagol, 2008.
In numerous instances, (not always)1. Stacked Denoising Autoencoder
compute best result2. Neural Network after pre-training
method better than SVM (using Radial Basis kernel Function)
Deep Learning I
Dropout
Dropout: A Simple Way to Prevent Neural Networks from Overfitting https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
Idea : ‘ripple’ neural network byremoving hidden units stochastically
-> each hidden unit is set to 0 with probability 0.5-> hidden units cannot co-adapt to other units
Could use a different dropout probability,
but ‘0.5’ usually works well
Deep Learning I
Dropout
Deep Learning I
Dropout back-propagation
If general back-propagation gradient decent method add a mask vector constraint,It’s a dropout back propagation
Deep Learning I
General back propagation simple example
1
2
3
4
56
x1 x2 x3 w14 w15 w24 w25 w34 w35 w36 w56 𝜃4 𝜃5 𝜃6
1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1
Deep Learning I
General back propagation simple example
1. Learning Rate = 0.92. ∆𝜽𝒋 = 𝒍 𝑬𝒓𝒓𝒋
𝜽𝒋 = 𝜽𝒋 + ∆𝜽𝒋3. ∆𝒘𝒊𝒋 = 𝒍 𝑬𝒓𝒓𝒋𝑶𝒊
𝒘𝒊𝒋=𝒘𝒊𝒋 + ∆𝒘𝒊𝒋
Deep Learning I
Dropout back propagation simple example
1
2
3
4
56
x1 x2 x3 w14 w24 w34 w36 w56 𝜃4 𝜃5 𝜃6
1 0 1 0.2 0.4 -0.5 -0.3 -0.2 -0.4 0.2 0.1
Deep Learning I
Dropout back propagation simple example
1. Learning Rate = 0.92. ∆𝜽𝒋 = 𝒍 𝑬𝒓𝒓𝒋
𝜽𝒋 = 𝜽𝒋 + ∆𝜽𝒋3. ∆𝒘𝒊𝒋 = 𝒍 𝑬𝒓𝒓𝒋𝑶𝒊
𝒘𝒊𝒋=𝒘𝒊𝒋 + ∆𝒘𝒊𝒋
Deep Learning I
Conclusion, Test time classification
Regardless of General back propagation or dropout back propagation method,Selected weights based on dropout method are same in first update weight
In conclusion, Dropout method have diversity. It depends on sequence of selected hidden units each hidden node.
How to combine results?
At test time, we replace the masks by their expectation
-> this is simply the constant vector 0.5 if dropout probability is 0.5-> for single hidden layer, can show this is equivalent to taking the
geometric average of all neural networks, with all possible binary masks
Thank you
27