Upload
barry-stafford
View
223
Download
4
Tags:
Embed Size (px)
Citation preview
What is the Best Multi-Stage Architecture for Object Recognition?
Ruiwen Wu
[1] Jarrett, Kevin, et al. "What is the best multi-stage architecture for object recognition?." Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009.(Cited by 396 till 2014.11.12)
Usual architecture of the neural networks
Each part of the neural networks
Unsupervised learning conception
Experiment
Contribution of this paper
2010 2011 2012 2013 20140
10000
20000
30000
40000
50000
60000
70000
80000
90000
2010; Series1; 64000
2011; Series1; 750002012; Series1; 78000
2013; Series1; 80000 2014; Series1; 78900
Papers about Neural Networks
2010 2011 2012 2013 20140
100
200
300
400
500
600
2010; Papers about Unsu-pervised Pre-Training; 110
2011; Papers about Unsu-pervised Pre-Training; 160
2012; Papers about Unsu-pervised Pre-Training; 256
2013; Papers about Unsu-pervised Pre-Training; 252
2014; Papers about Unsu-pervised Pre-Training; 552
Papers about Unsupervised Pre-Training
2009 2010 2011 2012 2013 20140
20
40
60
80
100
120
140
2009; citations every year; 4
2010; citations every year; 32
2011; citations every year; 622012; citations every year; 64
2013; citations every year; 1152014; citations every year; 113
Citations every year
Deep learning methods aims at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features[2]
Neural Networks with many hidden layers
Graphical Models with many levels of hidden layers
Other methods
Deep Learning Methods
[2]Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., & Bengio, S. Why Does Unsupervised Pre-training Help Deep Discriminant Learning?.
Usual architecture of neural networks
Non-linear Operation: Quantization, Winner-take-all, Sparsification, Normalization, S-function
Pooling Operation: Max, average, histogramming operator
Classifier: Neural Networks(NN), k-Nearest Neighbor(KNN), Support Vector Machine(SVM), Logistic Regression(LR)
This paper addresses three questions:
How does the non-linearities that follow the filter banks influence the recognition accuracy?
Does learning the filter banks in an unsupervised or supervised manner improve the performance over random filters or hardwired filters?
Is there any advantage to using an architecture with two stages of feature extraction, rather than one?
Questions to address
To address these three questions, they experimented with various combinations of architectures:
One stage or two stages of feature extraction
Different types of non-linearities
Different types of filters
Different filter learning methods(random, unsupervised and supervised)
Test Dataset: Caltech-101 dataset; NORB object dataset; MNIST dataset
Experiments Architecture
Filter Bank Layers(FCSG)
Local Contrast Normalization Layer(N)
Pooling and Subsampling Layer(PA or PM)
Model Architecture
The module computes:
Filter Bank Layer(FCSG)
tanh( * )i i ij ii
y g k x * is the convolution operater, tanh is hyperbolic tangent non-linearity, g is a trainable scalar coefficient.
Output size: assume each map is n1 x n2, each kernel is l1 x l2, then the output y is (n1-l1+1) x (n2-l2+1)
The kernel here could be either supervised trained or unsupervised pre-trained
Local Contrast Normalization Layer(N)
, ,ijk ijk pq i j p k qipqv x w x
/ max(c, )ijk ijk jky v
1 22, ,jk pq i j p k qipq
w v
C is the mean( )jk
I am not quiet understand this part
Wpq is Gaussian weighting window
Local Contrast Normalization Layer(N)
The result of this module:
20 40 60 80 100120
50
100
15020 40 60 80 100120140
20
40
60
80
100
120
140
It seems like this module is doing edge extraction
Pooling and Subsampling Layer(PA or PM)
For each of the small area:
, ,ijk pq i j p k qpqy w x
Where is a uniform weighting window or max weighting window
Each output feature map is then subsampled spatially by a factory S horizontally and vertically
pqw
Combine Modules
There could be three types of architectures of this network:
FCSG ---- PA
FCSG ---- N ---- PA
FCSG ---- PM
Training Protocol
Random Features and Supervised Classifier – R and RR
Unsupervised Features, Supervised Classifier - U and UU
Random Features, Global Supervised Refinement - R+ and R+R+
Unsupervised Feature, Global Supervised Refinement U+ and U+U+
Unsupervised Training of Filter Banks
For a given input X, a matrix W whose columns are the dictionary elements, feature vector Z is obtained by ∗minimizing the following energy function
2
2 1
*
(X,Z,W)
argmin (X,Z,W)
OF
OFZ
E X WZ Z
Z E
where λ is a sparsity hyper-parameter.
For any input X, one needs to run a rather expensive optimization algorithm to find Z , To alleviate the problem, ∗the PSD method is imported.
Predictive Sparse Decomposition(PSD)[3]
[3] Kavukcuoglu, Koray, Marc'Aurelio Ranzato, and Yann LeCun. "Fast inference in sparse coding algorithms with applications to object recognition." arXiv preprint arXiv:1010.3467 (2010).(cited by 94)
2 2
2 1 2
*
(X,Z,W,K) (X,K)
argmin (X,Z,W,K)
K {G,S,D}
C(X,K) C(X,G,S,D) Gtanh(SX D)
PSD
PSDZ
E X WZ Z Z C
Z E
where S R∈ m×n is a filter matrix, D R∈ m is a vector of biases
Result
Why does Unsupervised Pre-training Help Deep Discriminant Learning?[2]
Reference of the graph
[2]Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., & Bengio, S. Why Does Unsupervised Pre-training Help Deep Discriminant Learning?.[3] Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation,18, 1527–1554.[4] Zhu, L., Chen, Y., & Yuille, A. (2009). Unsupervised learning of probabilistic grammar-markov models for object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 114–128.[5] Weston, J., Ratle, F., & Collobert, R. (2008). Deep learning via semi-supervised embedding. Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08) (pp. 1168–1175). New York, NY, USA: ACM.[6]LeCun, Yann, et al. "Gradient-based learning applied to document recognition."Proceedings of the IEEE 86.11 (1998): 2278-2324.[7] Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation,18, 1527–1554.
non-convex function
In deep learning, the objective function is usually a highly non-convex function of the parameters, so there must be many local minima in the model parameter space
Supervised Learning use a fix point or a random point as the initialization. So in some or most situations, it converges at a local minima
Local Minima
Random Initialization
Unsupervised Pre-training
Reason
There are a few reasonable hypotheses why pre-training might work.
One possibility that unsupervised pre-training acts as a kind of regularizer, putting the parameter values in the appropriate range for discriminant training
Another possibility, is that pre-training initializes the model to a point in parameter space that somehow renders the optimization process more effective, in the sense of achieving a lower minimum of the empirical cost function.
Conclusion
Future work should clarify this hypothesis.
Understanding and improving deep architectures remains a challenge.
This work helps with such understanding via extensive simulations and puts forward and confirms a hypothesis explaining the mechanisms behind the effect of unsupervised pre-training for the final discriminant learning task.
Reference[1] Jarrett, Kevin, et al. "What is the best multi-stage architecture for object recognition?." Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009.(Cited by 396 till 2014.11.12)[2]Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., & Bengio, S. Why Does Unsupervised Pre-training Help Deep Discriminant Learning?.[3] Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation,18, 1527–1554.[4] Zhu, L., Chen, Y., & Yuille, A. (2009). Unsupervised learning of probabilistic grammar-markov models for object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 114–128.[5] Weston, J., Ratle, F., & Collobert, R. (2008). Deep learning via semi-supervised embedding. Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08) (pp. 1168–1175). New York, NY, USA: ACM.[6]LeCun, Yann, et al. "Gradient-based learning applied to document recognition."Proceedings of the IEEE 86.11 (1998): 2278-2324.[7] Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation,18, 1527–1554.
Thank You!