Upload
donhu
View
227
Download
2
Embed Size (px)
Citation preview
DEEP MACHINE LEARNING “A Shallow Introduction”
IAT 813, Instructor Steve DiPaola Guest Lecturer: Graeme McCaig
March 12, 2015
• Deep Learning (DL) is a complex topic
• Authors often employ heavy statistics, machine learning terminology
• This lecture: overview the field and de-mystify key terms, concepts • I hope to save you time/struggle getting started if you pursue DL in your work • Topics not covered much: Recurrent Nets, Autoencoders
2
Hazards of the Deep
OVERVIEW 1. Deep learning – believe the hype?
• DL in the news • “Depth” definition and benefits
2. What has changed? Is this just NNets?
• DL recent history timeline
3. Types of Deep Learning network and training • Restricted Boltzmann Machines &
Deep Belief Networks • Convolutional Networks
4. Practical advice
• Useful libraries; GPU • Further reading
3
TERMS & CONCEPTS: Minibatch Probabilistic/Stochastic Undirected, Energy-based Pre-training, Fine-tuning Convolution Dropout, ReLU
DEEP LEARNING IN THE NEWS
4
DEEP LEARNING IN THE NEWS
• Visual Object ___ • Recognition • Detection • Captioning
5
• Object recognition task
• Recent State-of-Art results
• He et al. (2015) Microsoft Research (arXiv preprint)
http://arxiv-web3.library.cornell.edu/pdf/1502.01852v1.pdf
6
7
http://googleresearch.blogspot.ca/2013/06/improving-photo-search-step-across.html
8
http://cs.stanford.edu/people/karpathy/deepimagesent/ Andrej Karpathy, Li Fei-Fei (2014) Stanford
9
Vinyals et al. (2014) Google Research Post http://googleresearch.blogspot.ca/2014/11/a-picture-is-worth-thousand-coherent.html
10
Andrej Karpathy, Li Fei-Fei (2014) Stanford http://cs.stanford.edu/people/karpathy/deepimagesent/
GoogLeNet Detection Model (2014) http://googleresearch.blogspot.ca/2014/09/building-deeper-understanding-of-images.html
DEEP LEARNING IN THE NEWS
• Applications • Self-driving cars • Biomedical imaging • Predicting DNA disease mapping • Drug discovery / virtual screening • Smartphone Apps
11
12
NVIDIA Drive PX http://www.nvidia.ca/object/drive-px.html
13
Cireşan et al. (2013). Mitosis detection in breast cancer histology images with deep neural networks. In Medical Image Computing and Computer-Assisted Intervention.
14
Scyfer (U Amsterdam spinoff) http://scyfer.nl/case-3d-mri-brain-scan-analysis/
DEEP LEARNING IN THE NEWS
• Audio applications • Music recommendation • Speech recognition
16
17
Recommending music on Spotify with Deep Learning – Sander Dieleman Blog Post (2014) http://benanne.github.io/2014/08/05/spotify-cnns.html
18
Baidu Research (2015) http://usa.baidu.com/deep-speech-accurate-speech-recognition-with-gpu-accelerated-deep-learning/
DEEP LEARNING IN THE NEWS
• Game-playing AI • Deep Reinforcement Learning
19
20
Two-dimensional t-SNE embedding of the representations in the last hidden layer assigned by DQN to game states experienced while playing Space Invaders.
Learning to play Atari 2600 games with Deep Reinforcement Learning - Mnih et al. (2015) Nature doi:10.1038/nature14236
A visualization of the learned value function on the game Breakout.
DEEP LEARNING IN THE NEWS
• GPU Enabling Technology • Mass-market hardware • CUDA libraries
21
22
http://devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/
23
http://blogs.nvidia.com/blog/2014/03/26/gpus-neural-cheap/
http://nl.hardware.info/reviews/2641/nvidia-geforce-gtx-680-quad-sli-review-english-version
“DEPTH” DEFINITION AND BENEFITS
25
WHAT IS “DEPTH”
• It's deep if it has more than one stage of non-linear feature transformation (LeCun & Ranzato 2013)
26
Figures from Bengio (2009)
Deep Feedforward Neural Net
27
Slide from LeCun & Ranzato (2013)
28
Slide from LeCun & Ranzato (2013)
BENEFITS OF DEPTH
• Replaces feature engineering “by hand”
• More compact (fewer nodes than equivalent shallow
net) • Theoretical arguments suggest improved training,
generalization • (Bengio et al. various papers)
• Appears to be how the brain works
• …Because it works (now giving state-of-art results on
many tasks)
29
31
Visualization of nearest-neighbors in top network layer code [Krizhevsky et al 2012]
Semantic class separation, visualized with t-SNE [Donahue et al 2014]
WHAT HAS CHANGED?
• Is Deep Learning anything different from previous Neural Nets research?
• In fact, both “yes” and “no” • And trends have flip-flopped in the short period
from 2006 - present
33
New Concepts
• Build a better representation via Unsupervised Learning
• Can then transfer to Supervised tasks
• Leverage massive amounts of unlabelled data
• Probabilistic, generative
network types • Restricted Boltzmann Machine
• New training algorithms • Greedy, layer-wise pre-
training • Stochastic sampling-based
estimation
34
More of the Same; Minor Tweaks
• More computing power • GPU • Cloud, cluster
• Big data • Crowd-sourced labels
• “Good old” feed-forward multi-layer NN’s
• Supervised learning • Backpropagation (SGD)
• New (and old*) tricks • Convolution*, dropout,
rectified linear units…
OR
• Hinton’s perspective on Deep Learning circa 2006-2010
• Geoff Hinton “The Next Generation of Neural Networks”, for GoogleTechTalks, 2007 https://www.youtube.com/watch?v=AyzOUbkUf3M
• E.g. 4:20
35
DEEP LEARNING HISTORY Excerpts from Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85-117. • 1969: book (Minsky & Papert, 1969) on the limitations of simple linear perceptrons
with a single layer discouraged some researchers from further studying NNs. • 1979: the Neocognitron (Fukushima, 1979, 1980, 2013a) was perhaps the first
artificial NN that deserved the attribute deep, and the first to incorporate […] neurophysiological insights
• 1986: a paper significantly contributed to the popularization of BP for NNs (Rumelhart, Hinton, & Williams, 1986), experimentally demonstrating the emergence of useful internal representions
• 1989: backpropagation (Section 5.5) was applied (LeCun et al., 1989; LeCun, Boser, et al., 1990; LeCun, Bottou, Bengio, & Haffner, 1998) to Neocognitron-like, weight-sharing, convolutional neural layers with adaptive connections.
• 1991: by the late 1980s, experiments had indicated that traditional deep feedforward or recurrent networks are hard to train by backpropagation (BP) Hochreiter’s (1991, thesis) work formally identified a major reason: Typical deep NNs suffer from the now famous problem of vanishing or exploding gradients.
36
DEEP LEARNING HISTORY Excerpts from Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85-117. • ~1995-2005: In the decade around 2000, many practical and commercial
pattern recognition applications were dominated by non-neural machine learning methods such as Support Vector Machines (SVMs) (Schölkopf et al., 1998; Vapnik, 1995).
• 2006: While learning networks with numerous non-linear layers date back at least to 1965 and explicit DL research results have been published at least since 1991, the expression Deep Learning was actually coined around 2006, when unsupervised pre-training of deep FNNs helped to accelerate subsequent SL through BP (Hinton, Osindero, & Teh, 2006; Hinton & Salakhutdinov, 2006). a DBN fine-tuned by BP achieved 1.2% error rate (Hinton & Salakhutdinov, 2006) on the MNIST handwritten digits (Sections This result helped to arouse interest in DBNs. DBNs also achieved good results on phoneme recognition, with an error rate of 26.7% on the TIMIT core test set (Mohamed & Hinton, 2010)
37
DEEP LEARNING HISTORY Excerpts from Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85-117. • 2012: an ensemble of (supervised) GPU-based Max-Pooling
Convolutional Neural Nets achieved best results on the ImageNet classification benchmark (Krizhevsky, Sutskever, & Hinton, 2012), which is popular in the computer vision community.
• Also in 2012: the biggest NN so far (109 free parameters) was trained in unsupervised mode on unlabeled data (Le et al., 2012), then applied to ImageNet. The codes across its top layer were used to train a simple supervised classifier, which achieved best results so far on 20,000 classes. Instead of relying on efficient GPU programming, this was done by brute force on 1000standard machines with 16,000 cores.
38
DEEP LEARNING HISTORY Excerpts from Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85-117. • ~2015 (present day): Most competition-winning or benchmark
record-setting Deep Learners actually use one of two supervised techniques: (a) recurrent Long Short-Term Memory (LSTM) (1997) trained by Connectionist Temporal Classification (CTC) (2006), or (b) feedforward GPU-based Max-Pooling Convolutional Neural Nets (2011) based on CNNs (1979) plus MP (1992), trained through Backpropagation (1989–2007).
39
• Y. LeCun in IEEE Spectrum interview: “A lot of us involved in the resurgence of Deep Learning in the mid-2000s, including Geoff Hinton, Yoshua Bengio, and myself—the so-called “Deep Learning conspiracy”—as well as Andrew Ng, started with the idea of using unsupervised learning more than supervised learning. Unsupervised learning could help “pre-train” very deep networks. We had quite a bit of success with this, but in the end, what ended up actually working in practice was good old supervised learning, but combined with convolutional nets, which we had over 20 years ago. But from a research point of view, what we’ve been interested in is how to do unsupervised learning properly. We now have unsupervised techniques that actually work. The problem is that you can beat them by just collecting more data, and then using supervised learning. This is why in industry, the applications of Deep Learning are currently all supervised. But it won’t be that way in the future.”
40
http://spectrum.ieee.org/automaton/robotics/artificial-intelligence/facebook-ai-director-yann-lecun-on-deep-learning
41
Shallow Deep
Adapted from LeCun & Ranzato (2013)
42
43
44
RESTRICTED BOLTZMANN MACHINE (RBM)
• Unsupervised, Probabilistic, Energy-based model • Shallow building block for Deep Belief Network (DBN) and Deep
Boltzmann Machine (DBM)
45
Restricted Boltzmann Machine
Hidden Layer
(or “v”) Visible Data Layer
OBJECTIVE FUNCTION FOR UNSUPERVISED LEARNING
46
• For supervised learning, minimize Training Error {difference between model’s P(y|x) and true (y,x) data}
• Equivalent for Unsupervised, Generative model? Maximize Likelihood of Train/Test sets under the model, i.e. model’s P(x) where x is training data
From http://imonad.com/rbm/restricted-boltzmann-machine/
Training Distribution Learned Model
• Energy of the network: • Likelihood of a datapoint P(v) is hard to find directly: probability is
known relative to all possible states of the net!! (sum of all states is called Partition Function Z)
• Block Gibbs Sampling: a “back and forth” technique • Use to find a hidden-layer “representation” for known visible vector
(inference) • Use to generate a sample from the model’s probability distribution
RESTRICTED BOLTZMANN MACHINE (RBM)
47
Binary, probabilistic neurons (on / off) Propagation of activation:
Contrastive Divergence learning • Uses 1 or a few passes of Gibbs
Sampling • Updates done on Minibatches (e.g.
100 to 1k input vectors at once) • Good for convergence • Efficient on GPU (matrix multiply)
48
Stochastic Gradient Descent (pink) vs. Batch Gradient Descent (red) [http://www.holehouse.org/mlclass/17_Large_Scale_Machine_Learning.html]
http://journal.frontiersin.org/article/10.3389/fnins.2013.00272/full
49
Contrastive Divergence for DBNs (Bengio et al. 2009)
STACKING RBMS TO FORM A DEEP BELIEF NET (DBN)
• Now comes the “deep” part… • Greedy, Layer-wise, Unsupervised learning
• Hold low-layer weights constant and “stack” a new RBM on top of net, train that using Contrastive Divergence again
50
http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DeepBeliefNetworks
Lower layers now operate like a straight feedforward (or feedback) net!
51
Greedy, layer-wise stacking for DBNs (Bengio et al. 2009)
• To use a DBN for classification, supply class label data along with bottom-up node data when training the top layer
• Another method is to add a logistic regression layer (or other classifier) on the top and train it from the top-layer representation of data
52
http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DeepVsShallowComparisonICML2007
• “Fine-tuning” techniques adjust the whole network’s parameters simultaneously
• For supervised learning, can use Backpropagation! • Unsupervised learning algorithms also exist…
• “Mean field” algorithms propagate real-number
probability values as activations instead of stochastically sampling binary values
53
RBM/DBN - MISC. IMPORTANT CONCEPTS
DEEP BOLTZMANN MACHINES (DBM)
• Unlike DBN, the DBM retains true bidirectional connections at all layers, even once stacked
• Potentially better for generative use • More complicated to train • Slower to run
54
From Salakhutdinov & Hinton (2009 AISTATS)
NON-BINARY NODES FOR RBM
• Useful for e.g. image data • Gaussian-Bernoulli nodes: handles real values at input layer,
binary values at hidden layer
• Spike-and-Slab nodes (Courville et al. 2011, AISTATS)
55
(from Hinton 2012)
GENERATING SAMPLES FROM THE MODEL
56
Digit images generated from a DBN with digit labels clamped (per row) [Hinton et al 2006]
TESTING GENERATIVE (IMAGE) DL WITH DATA VISUALIZATION METHODS
(MOSTLY QUALITATIVE) • Display generated
samples in paper; very common
• Look for novel re-combination of factors
• Generate image completions
Shape Boltzmann Machine [Eslami et al 2012]
HD-DBM [Salakhutdinov et al 2013]
TssRBM, TGaussRBM [Luo et al 2012]
57
(Courville et al 2013)
Spike-and-Slab RBM Samples
Nearest Pixel-wise Training Samples
58
IMPLEMENTING RBM, DBN IN CODE
• Matlab code example (Salakhutdinov DBM) • In essence, a lot of “Back and Forth” propagation for sampling • Handled well on a GPU: Matrix multiplication (whole mini-batch at
a time)
59
Let’s look at Matlab code from http://www.utstat.toronto.edu/~rsalakhu/DBM.html ...
60 RBM Training (1/2)
61 RBM Training (2/2)
62
CONVOLUTIONAL NETWORKS (CNN)
• Most commonly for image processing (also audio, etc.) • Receptive fields & Tied weights create “Feature Maps” similar to convolution kernel
in image proc.
63
Tiled CNNs (Le et al. NIPS 2010)
http://www.deeplearning.net/tutorial/lenet.html
64
Technique for Visualizing CNN Features (Zeiler & Fergus, ECCV 2014)
CNN STRUCTURE 65
• local receptive fields - nodes which only connect to a limited subset of lower-layer nodes, based on topology
• shared weights - separate connections in the neural net which are constrained to have the same weight
• sub-sampling - combining spatially adjacent low-level features into one high-level feature
A convolution layer implements receptive fields and shared weights. It is divided into multiple “feature maps”, which can be visualized as 2d grids (“planes”). Each node in a given grid learns the same set of weights, but is connected to a different spatial patch (receptive field) in its input layer. This is conceptually equivalent to a filter kernel, or feature detector, being swept over each location of (i.e. convolved with) the input. A pooling layer performs subsampling by having planes of half the length and width of the lower layer, in which each node integrates inputs from a 2x2 receptive field.
LeCun et al. (1998)
66
Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton (2012) ImageNet Classification with Deep Convolutional Neural Networks, NIPS.
CONVOLUTIONAL NETS FOR IMAGE RECOGNITION
• “Supervised training using stochastic gradient descent and the backpropagation algorithm (just repeated application of the chain rule)” from Krizhevsky et al. (2012) LSVRC slides
• New tricks: Dropout, ReLUs, Data augmentation
67
68
Slide from LeCun & Ranzato (2013)
ImageNet classification examples from [Krizhevsky et al 2012]. “Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated.” [http://www.image-net.org/about-overview]
70
RECTIFIED LINEAR UNITS (RELU)
71
from Krizhevsky et al. (2012) LSVRC slides
DROPOUT 72
from Krizhevsky et al. (2012) LSVRC slides
• Reduces over-fitting due to co-adaptation of units
DL LIBRARIES
• THEANO • Python • Includes RBM, DBN types; can experiment with custom algorithms • Helpful tutorials • Emphasis on automatic symbolic differentiation can be confusing
• PYLEARN2: built on Theano, offers some newer models (DBM, S3C), also complicated • CUDA-CONVNET
• Base library for high GPU performance • Feedforward (convolutional) only • C++ • I have not tried it • Apparently can be wrapped inside Theano
• Matlab Code, e.g. http://www.cs.toronto.edu/~rsalakhu/DBM.html • See also https://github.com/rasmusbergpalm/DeepLearnToolbox I have not tried it
• CAFFE • Feedforward (convolutional) only • C++ • Optional Python wrapper:
http://nbviewer.ipython.org/github/BVLC/caffe/blob/master/examples/filter_visualization.ipynb
• NVIDIA CuDNN • Fast primitive operations (e.g. propagate activation thru sigmoid) • Incorporated into Caffe
73
• TORCH7 • Apparently
like Theano but w/ Lua
FURTHER READING
• My favorite comprehensive review papers are:
• Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8), 1798–1828.
• Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85-117.
• Suggested readings at http://deeplearning.net/reading-list/ • Google Group on Deep Learning for current events • Hinton, LeCun, Schmidhuber have done recent Reddit A.M.A.s • Yoshua Bengio’s web site at U. Montreal is good reading • FastML Blog can be interesting www.fastml.com • ICML, NIPS (also AISTATS) conferences have many of the top papers
74
THE END – CLOSING THOUGHTS
• It’s fun to observe from the midst of a purported “revolution” • Relevant to SIAT work?
• As consumers • As researchers
• Thanks • QUESTIONS WELCOME
• Also can e-mail me for questions (or grab a coffee)
75