DEEP MACHINE LEARNING - SFU.ca · PDF fileDEEP MACHINE LEARNING “A Shallow ... Welcome to the Slide Presentation and Question/Answer session for my PhD Comprehensive ... • Convolution*,

DEEP MACHINE LEARNING “A Shallow Introduction”

IAT 813, Instructor Steve DiPaola Guest Lecturer: Graeme McCaig

March 12, 2015

Presenter

Presentation Notes

Welcome to the Slide Presentation and Question/Answer session for my PhD Comprehensive Exam. Thanks to you all for making the time to attend.

• Deep Learning (DL) is a complex topic

• Authors often employ heavy statistics, machine learning terminology

• This lecture: overview the field and de-mystify key terms, concepts • I hope to save you time/struggle getting started if you pursue DL in your work • Topics not covered much: Recurrent Nets, Autoencoders

2

Hazards of the Deep

OVERVIEW 1. Deep learning – believe the hype?

• DL in the news • “Depth” definition and benefits

2. What has changed? Is this just NNets?

• DL recent history timeline

3. Types of Deep Learning network and training • Restricted Boltzmann Machines &

Deep Belief Networks • Convolutional Networks

4. Practical advice

• Useful libraries; GPU • Further reading

3

TERMS & CONCEPTS: Minibatch Probabilistic/Stochastic Undirected, Energy-based Pre-training, Fine-tuning Convolution Dropout, ReLU

DEEP LEARNING IN THE NEWS

4


• Visual Object ___ • Recognition • Detection • Captioning

5

• Object recognition task

• Recent State-of-Art results

• He et al. (2015) Microsoft Research (arXiv preprint)

http://arxiv-web3.library.cornell.edu/pdf/1502.01852v1.pdf

6

7

http://googleresearch.blogspot.ca/2013/06/improving-photo-search-step-across.html

8

http://cs.stanford.edu/people/karpathy/deepimagesent/ Andrej Karpathy, Li Fei-Fei (2014) Stanford

http://cs.stanford.edu/people/karpathy/deepimagesent/

9

Vinyals et al. (2014) Google Research Post http://googleresearch.blogspot.ca/2014/11/a-picture-is-worth-thousand-coherent.html

http://googleresearch.blogspot.ca/2014/11/a-picture-is-worth-thousand-coherent.html














10

Andrej Karpathy, Li Fei-Fei (2014) Stanford http://cs.stanford.edu/people/karpathy/deepimagesent/

GoogLeNet Detection Model (2014) http://googleresearch.blogspot.ca/2014/09/building-deeper-understanding-of-images.html




http://googleresearch.blogspot.ca/2014/09/building-deeper-understanding-of-images.html


• Applications • Self-driving cars • Biomedical imaging • Predicting DNA disease mapping • Drug discovery / virtual screening • Smartphone Apps

11

12

NVIDIA Drive PX http://www.nvidia.ca/object/drive-px.html

http://www.nvidia.ca/object/drive-px.html

13

Cireşan et al. (2013). Mitosis detection in breast cancer histology images with deep neural networks. In Medical Image Computing and Computer-Assisted Intervention.

14

Scyfer (U Amsterdam spinoff) http://scyfer.nl/case-3d-mri-brain-scan-analysis/

http://scyfer.nl/case-3d-mri-brain-scan-analysis/

15

“Beautiful Me” App http://btfl.me/

http://btfl.me/


• Audio applications • Music recommendation • Speech recognition

16

17

Recommending music on Spotify with Deep Learning – Sander Dieleman Blog Post (2014) http://benanne.github.io/2014/08/05/spotify-cnns.html

http://benanne.github.io/2014/08/05/spotify-cnns.html

18

Baidu Research (2015) http://usa.baidu.com/deep-speech-accurate-speech-recognition-with-gpu-accelerated-deep-learning/

http://usa.baidu.com/deep-speech-accurate-speech-recognition-with-gpu-accelerated-deep-learning/






















• Game-playing AI • Deep Reinforcement Learning

19

20

Two-dimensional t-SNE embedding of the representations in the last hidden layer assigned by DQN to game states experienced while playing Space Invaders.

Learning to play Atari 2600 games with Deep Reinforcement Learning - Mnih et al. (2015) Nature doi:10.1038/nature14236

A visualization of the learned value function on the game Breakout.


• GPU Enabling Technology • Mass-market hardware • CUDA libraries

21

22

http://devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/






23

http://blogs.nvidia.com/blog/2014/03/26/gpus-neural-cheap/

http://nl.hardware.info/reviews/2641/nvidia-geforce-gtx-680-quad-sli-review-english-version





“DEPTH” DEFINITION AND BENEFITS

25

WHAT IS “DEPTH”

• It's deep if it has more than one stage of non-linear feature transformation (LeCun & Ranzato 2013)

26

Figures from Bengio (2009)

Deep Feedforward Neural Net

27

Slide from LeCun & Ranzato (2013)

28


BENEFITS OF DEPTH

• Replaces feature engineering “by hand”

• More compact (fewer nodes than equivalent shallow

net) • Theoretical arguments suggest improved training,

generalization • (Bengio et al. various papers)

• Appears to be how the brain works

• …Because it works (now giving state-of-art results on

many tasks)

29

31

Visualization of nearest-neighbors in top network layer code [Krizhevsky et al 2012]

Semantic class separation, visualized with t-SNE [Donahue et al 2014]

WHAT HAS CHANGED?

• Is Deep Learning anything different from previous Neural Nets research?

• In fact, both “yes” and “no” • And trends have flip-flopped in the short period

from 2006 - present

33

New Concepts

• Build a better representation via Unsupervised Learning

• Can then transfer to Supervised tasks

• Leverage massive amounts of unlabelled data

• Probabilistic, generative

network types • Restricted Boltzmann Machine

• New training algorithms • Greedy, layer-wise pre-

training • Stochastic sampling-based

estimation

34

More of the Same; Minor Tweaks

• More computing power • GPU • Cloud, cluster

• Big data • Crowd-sourced labels

• “Good old” feed-forward multi-layer NN’s

• Supervised learning • Backpropagation (SGD)

• New (and old*) tricks • Convolution*, dropout,

rectified linear units…

OR

• Hinton’s perspective on Deep Learning circa 2006-2010

• Geoff Hinton “The Next Generation of Neural Networks”, for GoogleTechTalks, 2007 https://www.youtube.com/watch?v=AyzOUbkUf3M

• E.g. 4:20

35

https://www.youtube.com/watch?v=AyzOUbkUf3M

DEEP LEARNING HISTORY Excerpts from Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85-117. • 1969: book (Minsky & Papert, 1969) on the limitations of simple linear perceptrons

with a single layer discouraged some researchers from further studying NNs. • 1979: the Neocognitron (Fukushima, 1979, 1980, 2013a) was perhaps the first

artificial NN that deserved the attribute deep, and the first to incorporate […] neurophysiological insights

• 1986: a paper significantly contributed to the popularization of BP for NNs (Rumelhart, Hinton, & Williams, 1986), experimentally demonstrating the emergence of useful internal representions

• 1989: backpropagation (Section 5.5) was applied (LeCun et al., 1989; LeCun, Boser, et al., 1990; LeCun, Bottou, Bengio, & Haffner, 1998) to Neocognitron-like, weight-sharing, convolutional neural layers with adaptive connections.

• 1991: by the late 1980s, experiments had indicated that traditional deep feedforward or recurrent networks are hard to train by backpropagation (BP) Hochreiter’s (1991, thesis) work formally identified a major reason: Typical deep NNs suffer from the now famous problem of vanishing or exploding gradients.

36

DEEP LEARNING HISTORY Excerpts from Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85-117. • ~1995-2005: In the decade around 2000, many practical and commercial

pattern recognition applications were dominated by non-neural machine learning methods such as Support Vector Machines (SVMs) (Schölkopf et al., 1998; Vapnik, 1995).

• 2006: While learning networks with numerous non-linear layers date back at least to 1965 and explicit DL research results have been published at least since 1991, the expression Deep Learning was actually coined around 2006, when unsupervised pre-training of deep FNNs helped to accelerate subsequent SL through BP (Hinton, Osindero, & Teh, 2006; Hinton & Salakhutdinov, 2006). a DBN fine-tuned by BP achieved 1.2% error rate (Hinton & Salakhutdinov, 2006) on the MNIST handwritten digits (Sections This result helped to arouse interest in DBNs. DBNs also achieved good results on phoneme recognition, with an error rate of 26.7% on the TIMIT core test set (Mohamed & Hinton, 2010)

37

DEEP LEARNING HISTORY Excerpts from Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85-117. • 2012: an ensemble of (supervised) GPU-based Max-Pooling

Convolutional Neural Nets achieved best results on the ImageNet classification benchmark (Krizhevsky, Sutskever, & Hinton, 2012), which is popular in the computer vision community.

• Also in 2012: the biggest NN so far (109 free parameters) was trained in unsupervised mode on unlabeled data (Le et al., 2012), then applied to ImageNet. The codes across its top layer were used to train a simple supervised classifier, which achieved best results so far on 20,000 classes. Instead of relying on efficient GPU programming, this was done by brute force on 1000standard machines with 16,000 cores.

38

DEEP LEARNING HISTORY Excerpts from Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85-117. • ~2015 (present day): Most competition-winning or benchmark

record-setting Deep Learners actually use one of two supervised techniques: (a) recurrent Long Short-Term Memory (LSTM) (1997) trained by Connectionist Temporal Classification (CTC) (2006), or (b) feedforward GPU-based Max-Pooling Convolutional Neural Nets (2011) based on CNNs (1979) plus MP (1992), trained through Backpropagation (1989–2007).

39

• Y. LeCun in IEEE Spectrum interview: “A lot of us involved in the resurgence of Deep Learning in the mid-2000s, including Geoff Hinton, Yoshua Bengio, and myself—the so-called “Deep Learning conspiracy”—as well as Andrew Ng, started with the idea of using unsupervised learning more than supervised learning. Unsupervised learning could help “pre-train” very deep networks. We had quite a bit of success with this, but in the end, what ended up actually working in practice was good old supervised learning, but combined with convolutional nets, which we had over 20 years ago. But from a research point of view, what we’ve been interested in is how to do unsupervised learning properly. We now have unsupervised techniques that actually work. The problem is that you can beat them by just collecting more data, and then using supervised learning. This is why in industry, the applications of Deep Learning are currently all supervised. But it won’t be that way in the future.”

40

http://spectrum.ieee.org/automaton/robotics/artificial-intelligence/facebook-ai-director-yann-lecun-on-deep-learning




41

Shallow Deep

Adapted from LeCun & Ranzato (2013)

42

43

44

RESTRICTED BOLTZMANN MACHINE (RBM)

• Unsupervised, Probabilistic, Energy-based model • Shallow building block for Deep Belief Network (DBN) and Deep

Boltzmann Machine (DBM)

45

Restricted Boltzmann Machine

Hidden Layer

(or “v”) Visible Data Layer

OBJECTIVE FUNCTION FOR UNSUPERVISED LEARNING

46

• For supervised learning, minimize Training Error {difference between model’s P(y|x) and true (y,x) data}

• Equivalent for Unsupervised, Generative model? Maximize Likelihood of Train/Test sets under the model, i.e. model’s P(x) where x is training data

From http://imonad.com/rbm/restricted-boltzmann-machine/

Training Distribution Learned Model

• Energy of the network: • Likelihood of a datapoint P(v) is hard to find directly: probability is

known relative to all possible states of the net!! (sum of all states is called Partition Function Z)

• Block Gibbs Sampling: a “back and forth” technique • Use to find a hidden-layer “representation” for known visible vector

(inference) • Use to generate a sample from the model’s probability distribution

RESTRICTED BOLTZMANN MACHINE (RBM)

47

Binary, probabilistic neurons (on / off) Propagation of activation:

Contrastive Divergence learning • Uses 1 or a few passes of Gibbs

Sampling • Updates done on Minibatches (e.g.

100 to 1k input vectors at once) • Good for convergence • Efficient on GPU (matrix multiply)

48

Stochastic Gradient Descent (pink) vs. Batch Gradient Descent (red) [http://www.holehouse.org/mlclass/17_Large_Scale_Machine_Learning.html]

http://journal.frontiersin.org/article/10.3389/fnins.2013.00272/full

49

Contrastive Divergence for DBNs (Bengio et al. 2009)

STACKING RBMS TO FORM A DEEP BELIEF NET (DBN)

• Now comes the “deep” part… • Greedy, Layer-wise, Unsupervised learning

• Hold low-layer weights constant and “stack” a new RBM on top of net, train that using Contrastive Divergence again

50

http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DeepBeliefNetworks

Lower layers now operate like a straight feedforward (or feedback) net!



51

Greedy, layer-wise stacking for DBNs (Bengio et al. 2009)

• To use a DBN for classification, supply class label data along with bottom-up node data when training the top layer

• Another method is to add a logistic regression layer (or other classifier) on the top and train it from the top-layer representation of data

52

http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DeepVsShallowComparisonICML2007




• “Fine-tuning” techniques adjust the whole network’s parameters simultaneously

• For supervised learning, can use Backpropagation! • Unsupervised learning algorithms also exist…

• “Mean field” algorithms propagate real-number

probability values as activations instead of stochastically sampling binary values

53

RBM/DBN - MISC. IMPORTANT CONCEPTS

DEEP BOLTZMANN MACHINES (DBM)

• Unlike DBN, the DBM retains true bidirectional connections at all layers, even once stacked

• Potentially better for generative use • More complicated to train • Slower to run

54

From Salakhutdinov & Hinton (2009 AISTATS)

NON-BINARY NODES FOR RBM

• Useful for e.g. image data • Gaussian-Bernoulli nodes: handles real values at input layer,

binary values at hidden layer

• Spike-and-Slab nodes (Courville et al. 2011, AISTATS)

55

(from Hinton 2012)

GENERATING SAMPLES FROM THE MODEL

56

Digit images generated from a DBN with digit labels clamped (per row) [Hinton et al 2006]

TESTING GENERATIVE (IMAGE) DL WITH DATA VISUALIZATION METHODS

(MOSTLY QUALITATIVE) • Display generated

samples in paper; very common

• Look for novel re-combination of factors

• Generate image completions

Shape Boltzmann Machine [Eslami et al 2012]

HD-DBM [Salakhutdinov et al 2013]

TssRBM, TGaussRBM [Luo et al 2012]

57

(Courville et al 2013)

Spike-and-Slab RBM Samples

Nearest Pixel-wise Training Samples

58

IMPLEMENTING RBM, DBN IN CODE

• Matlab code example (Salakhutdinov DBM) • In essence, a lot of “Back and Forth” propagation for sampling • Handled well on a GPU: Matrix multiplication (whole mini-batch at

a time)

59

Let’s look at Matlab code from http://www.utstat.toronto.edu/~rsalakhu/DBM.html ...

http://www.utstat.toronto.edu/~rsalakhu/DBM.html

60 RBM Training (1/2)

61 RBM Training (2/2)

62

CONVOLUTIONAL NETWORKS (CNN)

• Most commonly for image processing (also audio, etc.) • Receptive fields & Tied weights create “Feature Maps” similar to convolution kernel

in image proc.

63

Tiled CNNs (Le et al. NIPS 2010)

http://www.deeplearning.net/tutorial/lenet.html



64

Technique for Visualizing CNN Features (Zeiler & Fergus, ECCV 2014)

CNN STRUCTURE 65

• local receptive fields - nodes which only connect to a limited subset of lower-layer nodes, based on topology

• shared weights - separate connections in the neural net which are constrained to have the same weight

• sub-sampling - combining spatially adjacent low-level features into one high-level feature

A convolution layer implements receptive fields and shared weights. It is divided into multiple “feature maps”, which can be visualized as 2d grids (“planes”). Each node in a given grid learns the same set of weights, but is connected to a different spatial patch (receptive field) in its input layer. This is conceptually equivalent to a filter kernel, or feature detector, being swept over each location of (i.e. convolved with) the input. A pooling layer performs subsampling by having planes of half the length and width of the lower layer, in which each node integrates inputs from a 2x2 receptive field.

LeCun et al. (1998)

66

Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton (2012) ImageNet Classification with Deep Convolutional Neural Networks, NIPS.

CONVOLUTIONAL NETS FOR IMAGE RECOGNITION

• “Supervised training using stochastic gradient descent and the backpropagation algorithm (just repeated application of the chain rule)” from Krizhevsky et al. (2012) LSVRC slides

• New tricks: Dropout, ReLUs, Data augmentation

67

68


ImageNet classification examples from [Krizhevsky et al 2012]. “Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated.” [http://www.image-net.org/about-overview]

70

RECTIFIED LINEAR UNITS (RELU)

71

from Krizhevsky et al. (2012) LSVRC slides

DROPOUT 72

from Krizhevsky et al. (2012) LSVRC slides

• Reduces over-fitting due to co-adaptation of units

DL LIBRARIES

• THEANO • Python • Includes RBM, DBN types; can experiment with custom algorithms • Helpful tutorials • Emphasis on automatic symbolic differentiation can be confusing

• PYLEARN2: built on Theano, offers some newer models (DBM, S3C), also complicated • CUDA-CONVNET

• Base library for high GPU performance • Feedforward (convolutional) only • C++ • I have not tried it • Apparently can be wrapped inside Theano

• Matlab Code, e.g. http://www.cs.toronto.edu/~rsalakhu/DBM.html • See also https://github.com/rasmusbergpalm/DeepLearnToolbox I have not tried it

• CAFFE • Feedforward (convolutional) only • C++ • Optional Python wrapper:

http://nbviewer.ipython.org/github/BVLC/caffe/blob/master/examples/filter_visualization.ipynb

• NVIDIA CuDNN • Fast primitive operations (e.g. propagate activation thru sigmoid) • Incorporated into Caffe

73

• TORCH7 • Apparently

like Theano but w/ Lua

http://www.cs.toronto.edu/~rsalakhu/DBM.html

https://github.com/rasmusbergpalm/DeepLearnToolbox

http://nbviewer.ipython.org/github/BVLC/caffe/blob/master/examples/filter_visualization.ipynb

FURTHER READING

• My favorite comprehensive review papers are:

• Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8), 1798–1828.

• Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85-117.

• Suggested readings at http://deeplearning.net/reading-list/ • Google Group on Deep Learning for current events • Hinton, LeCun, Schmidhuber have done recent Reddit A.M.A.s • Yoshua Bengio’s web site at U. Montreal is good reading • FastML Blog can be interesting www.fastml.com • ICML, NIPS (also AISTATS) conferences have many of the top papers

74

http://deeplearning.net/reading-list/

http://www.fastml.com/

THE END – CLOSING THOUGHTS

• It’s fun to observe from the midst of a purported “revolution” • Relevant to SIAT work?

• As consumers • As researchers

• Thanks • QUESTIONS WELCOME

• Also can e-mail me for questions (or grab a coffee)

75

Documents

DEEP MACHINE LEARNING - SFU.ca · PDF fileDEEP MACHINE LEARNING “A Shallow ... Welcome to the Slide Presentation and Question/Answer session for my PhD Comprehensive ... • Convolution*,