INTRODUCTION TO DEEP LEARNING - cw.fel.cvut.cz · INTRODUCTION TO DEEP LEARNING Dmytro Mishkin Czech Technical University in Prague Clear Research Corporation [email protected]

INTRODUCTION TO DEEP

LEARNING

Dmytro Mishkin

Czech Technical University in Prague

Clear Research Corporation

[email protected]

MY BACKGROUND

CTO of Clear Research. Using deep learning

at work since 2014.

PhD student of Czech Technical university in

Prague. Beat Deep Learning approaches at

VPRiCE Challenge 2015 with classical

methods

Now fully work in DL, recent paper “All you

need is a good init” added to Stanford CS231n

course.

Kaggler. 9th out of 1049 teams at National

Data Science Bowl

AGENDA

Why deep learning (DL)? Some applications

and motivations

What is the core idea behind DL?

Basics of convolutional networks (CNN)

Practical recommendation for CNN-based image

classification. State-of-art approaches

Deep Learning libraries overview

How to apply CNNs to different tasks

EC2 hands-on experience on Cats-vs-Dogs

competition. Homework

XKCD. NOT TRUE ANYMORE

DEEP LEARNING APPLICATIONS

Alpha Go :)

Image recognition

Speech Recognition. Cortana, Siri

Translation

Anomaly detection

Fraud detection

Video recognition

Robotics

Recommendation systems

DNA, biology, and more..

ALPHAGO

Mastering the game of Go with deep neural networks and tree search Silver et.al 2016

IMAGE CLASSIFICATION

Select all dogs. Our assignment…almost :)

State-of-art since 2012. Krizhevsky et.al 2012Superhuman level an ImageNet classification since 2015.

He et.al 2015, Szegedy et.al 2015

OBJECT DETECTION

SPEECH RECOGNITION

Cortana

Siri

OK, Google

Figure from Huang et.al. 2015.

ANOMALY DETECTION

VIDEO CAPTIONING

Translating Videos to Natural Language Using Deep Recurrent Neural Networks. Venugopalan et.al. 2015

TEXT TRANSLATION

From [Bahadanau et al., 2015] slides at ICLR 2015.

DEEP LEARNING FRAMEWORKS FOR

REGULATORY GENOMICS AND EPIGENOMICS

https://www.youtube.com/watch?v=2vpKB3j-OY0

ROBOTICS: NAVIGATION

https://www.youtube.com/watch?v=umRdt3zGgpU

FRAUD DETECTION

As simple classificationhttp://www.slideshare.net/0xdata/

paypal-fraud-detection-with-deep-learning-in-h2o-presentationh2oworld2014

AGENDA

Why deep learning (DL)? Some applications and

motivations









DL IS NOT THE BEST CHOICE WHEN

You have little number of heterogenous of

(enumeration) features.

E.g. almost all kaggle competitions:

Given browser, session id, gender, determine if

customer wants revenge :)

Given some anonymized features, predict stock paper

price

Given gender, profession, age, etc. predict insurance

risk

Как нафармить рейтинг на Хабре (sorry :)

WHAT IS COMMON IN DL-FRIENDLY

TASKS?

Extremely hard to explicit write algorithms

Even if features are obvious – how to extract

them?

Lots of structured homogenous data (image,

speech , text).

You can and have to transform input. Could you

transform browser version?

DEEP LEARNING IS HIERARCHICAL

REPRESENTATION LEARNING

Quoc.V.Le et.al.,2011. Building high-level features using large scale unsupervised learning

DEPTH IS ESSENTIAL IN DEEP LEARNING




AGENDA


motivations









CONVOLUTIONS? WHY NOT JUST MLP?

http://cs231n.github.io/convolutional-networks/

NN

CNN

WHAT IS CONVOLUTION

https://developer.apple.com/library/ios/documentation/Performance/Conceptual/vImage/ConvolutionOperations/ConvolutionOperations.html

Classical NN

for image

is convolution

with image

size kernel


Let`s look on filters:

Local

The most values are mean

(non-informative)

Wasted computation

and memory!!!

Also lots of parameters

and low data -> overfitting

http://www.cs.toronto.edu/~ranzato/research/projects.html


Krizhevsky et al. 2012. conv1 filters

11x11x3. Much less wasted space!

DO WE NEED MATH?

If yes, go to the whiteboard

POOLING


MAX POOLING


TYPICAL CNN STRUCTURE (LENET-5)

http://eblearn.sourceforge.net/lib/exe/lenet5.png

• (Conv-ReLU-Pool)xN Softmax. Simple• (Conv-ReLU)xN-Pool- (Conv-Relu)x2N-Pool….Softmax. Popular.

• Some Inception arch. Have fun :)

NON-LINEARITIES

NON-LINEARITIES BENCHMARK

https://github.com/ducha-aiki/caffenet-benchmark/blob/master/Activations.md

NON-LINEARITIES

AGENDA


motivations



Practical recommendation for CNN-based

image classification. State-of-art

approaches





IMAGE PREPROCESSING

Subtract mean pixel (training set), divide by std.

RGB is the best colorspace for CNN

Do nothing more…

…unless you have specific dataset.

Subtract local mean pixelB.Graham, 2015

Kaggle Diabetic Retinopathy Competition report

TRAINING. SOLVERS AND REGULARIZATION

Use SGD with momentum.

Try learning rates 0.01, 0.005, 0.001

Momentums: 0, 0.5, 0.9, 0.95

Try L2 weight decay 0.0005, 0.0001. Prevents

from overconfidence

Fancy solvers (ADAM, RMSProp, AdaDelta)

sometimes work better, sometimes not.

ARCHITECTURE

Use as small filters as possible

3x3 + ReLU + 3x3 + ReLU > 5x5 + ReLU.

Exception: 1st layer. Too computationally

ineffective to use 3x3 there.

Convolutional Neural Networks at Constrained Time Cost. He and Sun, 2015

WEIGHTS INITIALIZATION

Preserve var=1

of all layers

output.

How?

There are lots of

papers with

variants

Mishkin and Matas. All you need is a good init. ICLR, 2016

WEIGHTS INITIALIZATION

Gaussian noise with some coefficient:

Xavier:

He (0.5 * Xavier for ReLU)

Orthonormal (Saxe et.al. 2013)

Data-dependent: LSUV

Mishkin and Matas. All you need is a good init. ICLR, 2016

BATCH NORMALIZATION

Ioffe et.al 2015

BATCH NORMALIZATION

DROPOUT

DROPOUT

Play with rates. 0.5 is rarely optimal choice (but

often good)

DROPOUT

Dropout_rate * width = constant – doesn`t work!

DATA AUGMENTATION

Common (helps 99% cases):

Random crop: e.g., 227x227 from 256x256 px

(AlexNet)

Horizontal mirror

Dataset dependent:

Random rotation

Affine transform

Random scale

Color augmentation

Noise input

Thin plate deformation

Unleash your imagination

PADDING. VALID AND SAME CONVOLUTION

http://www.johnloomis.org/ece563/notes/filter/conv/convolution.html

Same = padding with zerosby ½ kernel size.

The most common choice

PADDING

Padding:

Preserving spatial size, not “washing out”

information

Dropout-like augmentation by zeros

Caffenet128

with conv padding: 47% top-1 acc

w/o conv padding: 41% top-1 acc.

It is huge difference

RESUME FROM CS231N

AGENDA


motivations





Deep Learning libraries overview. Why

caffe.




DEEP LEARNING TOOLBOXES

Caffe

Torch

Theano

TensorFlow

MXNet

…

Nervana

DeepLearning4j

ConvnetJS

CNTK

Veles

H20...sorry, guys :(

MAIN DEEP LEARNING TOOLBOXES

SPEED BENCHMARK ALEXNET

Library Class Time (ms) forward (ms) backward (ms)

CuDNN[R4]-fp16

(Torch)cudnn.SpatialConvolution 71 25 46

Nervana-neon-fp16 ConvLayer 78 25 52

CuDNN[R4]-fp32

(Torch)cudnn.SpatialConvolution 81 27 53

Nervana-neon-fp32 ConvLayer 87 28 58

fbfft (Torch) fbnn.SpatialConvolution 104 31 72

TensorFlow conv2d 151 34 117

Chainer Convolution2D 177 40 136

cudaconvnet2* ConvLayer 177 42 135

CuDNN[R2] * cudnn.SpatialConvolution 231 70 161

Caffe (native) ConvolutionLayer 324 121 203

Torch-7 (native) SpatialConvolutionMM 342 132 210

CL-nn (Torch) SpatialConvolutionMM 963 388 574

Caffe-CLGreenTea ConvolutionLayer 1442 210 1232

https://github.com/soumith/convnet-benchmarks

https://github.com/soumith/cudnn.torch/blob/master/SpatialConvolution.lua

https://github.com/soumith/convnet-benchmarks/blob/master/nervana/README.md



https://github.com/facebook/fbcunn/tree/master/src/fft

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/nn.py

https://github.com/pfnet/chainer/blob/master/chainer/links/connection/convolution_2d.py

https://github.com/soumith/cuda-convnet2.torch/blob/master/cudaconv3/src/filter_acts.cu


https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu

https://github.com/torch/cunn/blob/master/SpatialConvolutionMM.cu

https://github.com/hughperkins/clnn/blob/master/SpatialConvolutionMM.cl

https://github.com/naibaf7/caffe

SPEEDBENCHMARK. GOOGLENET

Library Class Time (ms)forward

(ms)

backward

(ms)

Nervana-

neon-fp16ConvLayer 230 72 157

Nervana-

neon-fp32ConvLayer 270 84 186

CuDNN[R4]-

fp16 (Torch)

cudnn.Spatial

Convolution462 112 349

CuDNN[R4]-

fp32 (Torch)

cudnn.Spatial

Convolution470 130 340

ChainerConvolution2

D687 189 497

TensorFlow conv2d 905 187 718

CaffeConvolutionL

ayer1935 786 1148

CL-nn (Torch)SpatialConvol

utionMM7016 3027 3988

Caffe-

CLGreenTea

ConvolutionL

ayer9462 746 8716

https://github.com/soumith/convnet-benchmarks





https://github.com/pfnet/chainer/blob/master/chainer/links/connection/convolution_2d.py

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/nn.py

https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu

https://github.com/hughperkins/clnn/blob/master/SpatialConvolutionMM.cl

https://github.com/naibaf7/caffe

CAFFE

CAFFE

AGENDA


motivations






How to apply CNNs to various tasks


competition. Homework.

HOW TO DO – LET`S GO TO WHITEBOARD

Image retrieval Babenko et. al (2014)

Person identification Chopra et. al 2006

Ranking Wang et.al 2014

Playing games. Atari (2013) Go (2016)

Text generation https://github.com/karpathy/char-rnn

Image generation Radford et.al 2016

Action recognition Simonyan et.al 2014

Anomaly detection https://www.youtube.com/watch?v=ds73ULGjnpc&feature=youtu.be

Translation Cho et al 2014

Fraud detection at PayPal http://university.h2o.ai/cds-lp/cds02.html

http://arxiv.org/abs/1404.1777

http://yann.lecun.com/exdb/publis/pdf/chopra-05.pdf



http://go.nature.com/Q5fiiy

https://github.com/karpathy/char-rnn



https://www.youtube.com/watch?v=ds73ULGjnpc&feature=youtu.be


http://university.h2o.ai/cds-lp/cds02.html

AGENDA


motivations








competition. Homework.

IMAGE RETRIEVAL

Figure from Babenko et.al.2014

1. Pass image through ImageNet-pretrained CNN.2. Use some layer activations as description

3. L2-normalize (must!)

4. Put in some fast NN search like kd-tree.

EMBEDDINGS WITH SIAMESE NETWORKS

1. Put 2 images through copies of the same networks2. L2 difference < 1 if same person, >1 if different

https://www.cs.nyu.edu/~sumit/research/research.html

WHAT ABOUT 3 COPIES? TRIPLETS


1. Put 2 images through copies of the same networks

2. D(x, x+) < D (x,x-)

Drawback: 1) slow training :(

2) Have to select hard triplets. Random

ones easily satisfy equation above.

GENERATING IMAGES WITH GANS

http://soumith.ch/eyescream/


• Generator tries to generate image undistinguishable from natural.• Discriminatior tries to distinguish.

• Both learn simultaneously

AUTO-ENCODERS

http://deeplearning4j.org/deepautoencoder.html

DE-NOISING AUTO-ENCODERS

Clean Input Corrupted input (what net sees) Reconstructed

If compare input and reconstruction, we can detect anomalies

http://www.cs.toronto.edu/~ranzato/research/projects.html

QUESTIONS?

Documents

INTRODUCTION TO DEEP LEARNING - cw.fel.cvut.cz · INTRODUCTION TO DEEP LEARNING Dmytro Mishkin Czech Technical University in Prague Clear Research Corporation [email protected]