Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Deep Learning

Neural networks have come back!

Akinori ABE (M1)

Sumii LaboratoryGraduate School of Information Science

Tohoku University

Dec 8, 2014

Deep learning and neural network

(Artificial) Neural Network (ANN, NN)

• An information processing model imitating biologicalnervous system

• A network constructed of

• units corresponding to neurons and• layers containing them

1st layer

2nd layer

3rd layer

4th layer

(output layer)

inputs

(input layer)

Deep Learning

• A set of algorithms for deeply structured NNs (of about 7, 8 or more layers)

• A trend of recent researches in machine learning

• Actively used for image processing, speech recognition, natural language processing,etc.

• Also applied to analysis of programming languages

2 / 81

History of NNs

1943 Threshold logic units [McCulloch and Pitts]1949 Hebbian learning rule [Hebb]

1950–1960 The first golden age1957 Perceptron [Rosenblatt]1969 Limitations of perceptron [Minsky and Papert]

1970s The first “quiet years”1980s The second golden age1986 Backpropagation [Rumelhart, Hinton and Williams]

mid 1990s–The second “quiet years”

–early 2000s2006– The third golden age (Deep Learning)

2006 Pretraining by restricted Boltzmann machine [Hinton and Salakhutdinov]2010 Rectified linear unit (ReLU) [Nair and Hinton]2012 Dropout [Hinton, et al.]

3 / 81

Outline

1 Fundamental of machine learning: classification

2 Perceptron

3 Multilayer feed-forward NN and backpropagation

4 Deep Learning

5 Application of deep learning

4 / 81

Outline


2 Perceptron


4 Deep Learning


5 / 81

Is this apple tasty (for you)?

The apple is produced in Aomori, and # of pips is five.

6 / 81

Empirical estimation

• The apples you have eaten.

x1 (# of pips) x2 (Aomori or not) y (tasty)

3 +1 +14 +1 +13 −1 −14 −1 −13 −1 −16 −1 +1

• Is the unknown apple tasty?

x1 (# of pips) x2 (Aomori or not) y (tasty)

5 +1 ?

7 / 81

Consider the coordinate space

tasty

65432

+1

-1

not tasty

unknown

(# of pips)

(Aomori or not)

10

The unknown apple is probably tasty.

8 / 81

Decision boundary

tasty

65432

+1

-1

not tasty

unknown

(# of pips)

(Aomori or not)

10

normal vector

decision boundary

F (x) = w0 + w1x1 + w2x2 = w>x where

• a weight vector w = (−3, 1, 2)> and

• a feature vector x = (1, x1, x2).

• y = +1 if F (x) > 0

• y = −1 if F (x) < 0

9 / 81

Classification

• We have a (large) training set S ⊆ RD+1 × {+1,−1}, i.e., pairs of

• an input vector x and• a target value t (of y).

• We classify an input vector x by

y =

{+1 if F (x) > 0

−1 if F (x) < 0where F (x) = w>x.

• We want to compute w s.t.tF (x) = tw>x > 0

for all (x, t) ∈ S.

Expectation: if training data can be classified, unknown data can be done as well.

10 / 81

Outline


2 PerceptronIntroduction to perceptronSingle-layer NNs and threshold logic unitsThe first “quiet years” of NNs


4 Deep Learning


11 / 81

Outline




4 Deep Learning


12 / 81

Perceptron

Perceptron [Rosenblatt, 1957]

• A fundamental classification approach

• A single-layer NN (described later)

Algorithm Training of Perceptron

Initialize w.repeat

Get (x, t) ∈ S with replacement.if tw>x < 0 thenw ← w + tx

end ifuntil converged

The algorithm updates w if a prediction is wrong.

13 / 81

Updating a decision boundary

Updating by w ← w + tx (if tw>x < 0, i.e., prediction is wrong):

(a) If false negative (F (x) < 0 but t = +1), w is updated to w + x.

(b) If false positive (F (x) > 0 but t = −1), w is updated to w − x.

before updating

decision boundary

after updating

(a) false negative (b) false positive

14 / 81

Why can a perceptron learn?

(a) False negative (F (x) < 0 but t = +1)

• An updated weight vector: w′ := w + x• Trying to classify x again: w′>x = (w + x)>x = w>x+ x>x

• w>x: misprediction (negative)• x>x = ‖x‖2: L2-norm (always positive)

Thus w>x gets closer to t.

(b) False positive (F (x) > 0 but t = −1)

• An updated weight vector: w′ := w − x• Trying to classify x again: w′>x = (w − x)>x = w>x− x>x

• w>x: misprediction (positive)• x>x = ‖x‖2: L2-norm (always positive)

Thus w>x gets closer to t.

15 / 81

Perceptron convergence theorem

Theorem (perceptron convergence) [Ronsenblatt, 1962]

A perceptron always converges and returns w s.t.

tw>x > 0 for all (x, t) ∈ S

if a training set is linearly separable.

Definition (linear separability)

In binary classification, a data set is linearly separable if all points oftwo classes in the set can be discriminated by some hyperplane.

However, a perceptron converges very slowly for real data.

16 / 81

Outline




4 Deep Learning


17 / 81

Artificial neural network

(Artificial) Neural Network (ANN, NN)

• An information processing model imitating biologicalnervous system

• NN is not a strict model of biological nervoussystem, but it is frequently used as an intuitiveexplanation.

• A network constructed of

• units corresponding to neurons and• layers containing them

1st layer

2nd layer

3rd layer

4th layer

(output layer)

inputs

(input layer)

18 / 81

Artificial neurons

Threshold logic units (TLU) [McCulloch and Pitts, 1943]An artificial neuron (i.e., a unit)

• takes input signals x1, . . . , xD,

• is activated if

D∑i=1

wixi > θ

and

• outputs y = +1 if activated(otherwise y = 0).

Hebbian learning rule [Hebb, 1949]• Learning is achieved by changing efficiency of signal propagation between two biological

neurons in the long term.• It corresponds to updating wi.

19 / 81

Units

• x = (1, x1, . . . , xD) ∈ RD+1 is an input vector.

• w = (w0, w1, . . . , wD) ∈ RD+1 is a weight vector.

• y is an output given by

y = h(a) where a = w>x.

• a is an activation.• h is a (nonlinear) activation function.

20 / 81

Examples of activation function

-1

-0.5

0

0.5

1

1.5

2

-6 -4 -2 0 2 4 6

H(a)sigm(a)tanh(a)

Step function:

H(a) =

{+1 a > 0,

−1 a < 0.

Logistic sigmoid function:

sigm(a) =1

1 + exp(−a).

Hyperbolic tangent:

tanh(a) =exp(a)− exp(−a)

exp(a) + exp(−a).

21 / 81

Single-layer NN

the 1st layer

(output layer)

input signals

output signals

• Single-layer NN is an array of units.

• Usable for multiclass classification (e.g., classification of tasty, sweat or bitter apples).

• The above NN is also called two-layer in many papers.

• In this talk, we call it single-layer as well as PRML.

22 / 81

Outline




4 Deep Learning


23 / 81

The first “quiet years” of NNs

Perceptrons: an introduction to computational geometry[Minsky and Papert, 1969]

• The authors proved that single-layer NNs cannot solvelinearly non-separable problems in this book.

• People misinterpreted the proof: the proof could begeneralized to multilayer NNs; they could not solve suchproblems!

• Actually, multilayer NNs (described later) can solve them.

Unfortunately, research budgets for NNs were reduced until the mid-1980s.

24 / 81

Outline


2 Perceptron

3 Multilayer feed-forward NN and backpropagationMultilayer feed-forward NNBackpropagationProblem: over-fittingProblem: vanishing gradientThe second “quiet years” of NNs

4 Deep Learning

5 Application of deep learning25 / 81

Outline


2 Perceptron


4 Deep Learning


Multilayer feed-forward NN

#N

#N-1

#2

#1

(inputs)

(outputs)

hidden

layers

output

layer

An N -layer feed-forward NN is constructed of Nsingle-layer NNs.

• The top is called an output layer.

• Others are hidden layers.

• Input & output signals are observable.

• Outputs of hidden layers are unobservable.

Feed-forward Propagation

• Signals are propagated from lower to upper layers.

• Skipping is allowed. (skip-layer connection)

• Connection can be sparse.

27 / 81

Units of multilayer NN

An N -layer NN:

• G = (V, E) is a directed graph of a multilayer NN.

• V is a set of units.• E ⊆ V × V is a set of edges.

((i, j) ∈ E is the edge to i from j.)

• wi,j is a weight of (i, j) ∈ E .

• si is an output signal of unit i ∈ V, given by

si = hi(ai) where ai =∑

j s.t. (i,j)∈E

wi,jsj .

• ai is an activation.• hi is a (nonlinear) activation function

(e.g., sigm, tanh, etc.).

s0 = 1 and (i, 0) ∈ E for all i ∈ V for biases.

28 / 81

Outputs of a multilayer feed-forward NN (without skip-layer connection)

• Input signals: x1, x2, . . . , xD• The output signals of the 1st layer:

si = hi

∑j

wi,jxj

for all i ∈ V in the 1st layer

• The output signals of the 2nd layer:

sk = hk

∑i

wk,ihi

∑j

wi,jxj

for all k ∈ V in the 2nd layer

• . . .

• Output signals: y1, y2, . . . , yd

29 / 81

Ability of a multilayer feed-forward NN

Theorem (universal approximation) [Cybenko, 1989]

A two-layer feed-forward NN with linear output can approximate any continuous function if ithas the enough # of hidden units.

f(x) = x2 f(x) = sin(x) f(x) = |x|

30 / 81

Training for a multilayer feed-forward NN

#N

#N-1

#2

#1

(inputs)

(outputs)

(targets)

We have a training set S ⊆ RD × Rd,i.e., pairs of

• an input vector x = (x1, x2, . . . , xD)> and

• a target vector t = (y1, y2, . . . , yd)>.

Purpose: to compute w s.t.

t ≈ y for all (x, t) ∈ S

where y = f(x,w) is an output vector of a NN.

31 / 81

Outline


2 Perceptron


4 Deep Learning


The purpose of training for a multilayer NN

The optimum weight vector w minimizes the error function E(w):

w = arg minw

E(w)

where

E(w) =1

2

∑(x,t)∈S

‖f(x,w)− t‖2.

However, this is a nonlinear least square problem since f is nonlinear.Generally, finding w is very difficult, so that we try to find a local optimum.

33 / 81

Gradient descent method

Gradient descent method

Finding a local optimum of f : R→ R can be achieved by iteration given as

x(τ+1) = x(τ) − ηf ′(x(τ)

)where

• x(τ) is a value of x at the τ -th iteration and

• η > 0 is a learning rate.

34 / 81

Backpropagation

We find a local optimum weight by

w(τ+1) = w(τ) − η∇E(w(τ)

).

Gradient ∇E(w(τ)

)is computed by backpropagation [Rumelhart et al., 1986]:

• The basic idea is propagation of error δi = ∂E/∂ai from upper to lower layers.

zi

zj

δjδk

δ1

wji wkj

• (The concrete algorithm is omitted because it is complex.)

35 / 81

Outline


2 Perceptron


4 Deep Learning


Over-fitting

• Over-fitting: to obtain w that cannot apply to real data, i.e.,

• the fitting error for a training set is very small,• the error for real data is very large.

Cause:

• a multilayer NN is sensitive to errors in training data;• backpropagation tends to fall into a (bad) local optimum.

37 / 81

Over-fitting

-4

-2

0

2

4

-4 -2 0 2 4

Class 1Class 2

(# of layer = 2, # of hidden units = 10, all activation functions are sigmoid.)

38 / 81

Over-fitting and errors

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 1000 2000 3000 4000 5000

ERMS

# of iteration times

TraningTest

Evaluation of theroot-mean-square (RMS) error

ERMS =√

2E(w)/N.

for

• a training set and

• a test set.

• The training error decreases,

• but the test error increases.

39 / 81

Regularization

Regularization (a.k.a. weight decay) is a approach to prevent over-fitting by minimizing, e.g.,

E(w) +λ

2‖w‖2

where λ > 0 is a regularization coefficient. (λ2‖w‖2 is a penalty.)

Why can regularization prevent over-fitting?

• Empirically, ‖w‖ gets large when over-fitting.

• Stochastically, it is a little advanced estimation method.

• minE(w) maximizes the likelihood p(S | w),assuming p(w) is an uniform distribution (maximum likelihood estimation).

• min(E(w) + λ

2‖w‖2)

maximizes the posterior p(w | S),assuming p(w) is a zero-mean gaussian distribution (MAP estimation).

40 / 81

Outline


2 Perceptron


4 Deep Learning


Vanishing gradient problem

Vanishing gradient problem

• Errors δi are not propagated to lower layers if a NN has large # of layers.

• Most errors are absorbed in the top some layers.

Cause:

• a multilayer NN can solve a problem without lower layers because it is strong(cf. universal approximation).

42 / 81

Vanishing gradient

1e-12

1e-10

1e-08

1e-06

0.0001

0.01

1

0 2000 4000 6000 8000 10000

‖δ‖2

ofea

chla

yer

# of iteration times

10th9nd8rd7th6th5th4th3rd2nd1st

(Each layer contains 10 units, and all activation functions are sigmoid.) 43 / 81

Outline


2 Perceptron


4 Deep Learning


The second “quiet years” of NNs

The first paragraph of a paper [Simard et al., ICDAR 2003]:

After being extremely popular in the early 1990s, neural networks have fallen out offavor in research in the last 5 years. In 2000, it was even pointed out by theorganizers of the Neural Information Processing System (NIPS) conference that theterm “neural networks” in the submission title was negatively correlated withacceptance. In contrast, positive correlations were made with support vectormachines (SVMs), Bayesian networks, and variational methods.

45 / 81

http://ce.sharif.edu/courses/85-86/2/ce667/resources/root/15%20-%20Convolutional%20N.%20N./fugu9.pdf

Outline


2 Perceptron


4 Deep LearningUnsupervised pretraining by restricted Boltzmann machineRectified linear units (ReLU) and MaxoutDropout


46 / 81

Deep Leaning

Deep Learning (2006–)

• A set of algorithms for deeply structured NNs (of about 7, 8 or more layers)

• A trend of recent researches in machine learning

• Actively used for image processing, speech recognition, natural language processing,etc.

• Also applied to analysis of programming languages

47 / 81

Outline


2 Perceptron




48 / 81

Supervised & unsupervised training

Supervised training

• Training using a training set containing inputs and targets

• To obtain f : X → Y from S ⊆ X × Y (f(x) ≈ t for most (x, t) ∈ S)

• Typical problems:

• Classification: giving labels for inputs• Regression: finding relationship between two continuous variables

Unsupervised training

• Training using a training set only containing inputs

• To obtain f : X → Y from S ⊆ X• Typical problems:

• Clustering: finding grouping for inputs• Dimensionality reduction: converting high-dimensional vectors into low-dimensional

ones (preserving the original information)

49 / 81

Unsupervised pretraining by restricted Boltzmann machineHinton and Salakhutdinov, Reducing the Dimensionality of Data with Neural Networks,Science, Vol. 313, No. 5786, pp. 504–507, 2006. PDF

Autoencoder network

• A NN for (nonlinear) dimensionality reduction

• Difficulty of finding appropriate weights

• The initial weights must be close to a good solution.

Hinton’s approach

• Process:

1. Pretraining: layer-wise computation of good initial weights by restricted Boltzmannmachine (RBM)

2. Unrolling: construction of an autoencoder network using the weights3. Fine-tuning: adjustment of the whole network by backpropagation

• Actually, the initial weights are suitable for other problems.

• Empirically, lower-dimensional representation of inputs is a good initial weight.

50 / 81

http://www.cs.toronto.edu/~hinton/science.pdf

Autoencoder

inputsignalsx

h

y

outputlayer

hiddenlayer

W

W

outputsignals

• A two-layer NN that reconstructs inputs (x ≈ y)

• Bottleneck: # of hidden units < # of inputs

• Computation:

• Encoder: h = f(Wx+ b)• Decoder: y = f(W>h+ b′)

51 / 81

Restricted Boltzmann machine (RBM)

c1 c2 cn

b1 b2 bmb3

v1 v2 vmv3

h1 h2 hn

wij

v = (v1, . . . , vm) ∈ {0, 1}m

h = (h1, . . . , hn) ∈ {0, 1}n

b = (b1, . . . , bm) ∈ Rm

c = (v1, . . . , vn) ∈ Rn

W = {wij} ∈ Rm×n

θ = (b, c,W ) (parameters)

RBM is an undirected graph constructed of

• a visible layer (for an input vector v) and

• a hidden layer (as low-dimensional representation h).

Estimation of θ: Maximizing likelihood p(v | θ) where

• p(v,h | θ) = exp (−E(v,h)) /Z and

• E(v,h) = −b>v − c>h− v>Wh

by Contrastive Divergence learning (approximation).

52 / 81

Process of pretraining, unrolling and fine-tuning

W1

W2

W3

W1

W2

W3

W3

W2

W1

Encoder

Decoder

(1) pretraining (2) unrolling

Code layer

W1

W2

W3

W3

W2

W1

(3) fine-tuning

1

2

3

4

5

6

copy

copy

input vectorRBM

RBM

RBM

53 / 81

Dimensionality reduction for retrieved documents

2000-500-250-125-2 autoencoder for 804,414 newswire stories

54 / 81

Outline


2 Perceptron




55 / 81

The problem of logistic sigmoid function

0

0.2

0.4

0.6

0.8

1

-6 -4 -2 0 2 4 6

• An output of an unit is given by f(w>x) where

f(a) = sigm(a) =1

1 + exp(−a).

• The gradient of sigm decreases when training progresses.

• Then it makes training slow.

High precision requires long training!

56 / 81

Rectified linear units (ReLU)Nair and Hinton, Rectified Linear Units Improve Restricted Boltzmann Machines, ICML 2010. PDF

0

1

2

3

4

5

-4 -2 0 2 4

ReLUsoftplus

Softplus function: f(a) = log(1 + ea)

• More biologically plausible than sigmoid (rarelysaturated)

Rectified linear units (ReLU): practical approx. of softplus

• ReLU: f(a) = max(0, a)

• NReLU (Noisy ReLU): f(a) = max(0, a+N (0, σ(a)))

Features of ReLUs:

• Fast convergence

• High precision for real data

57 / 81

http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_NairH10.pdf

MaxoutGodfellow, Maxout Networks, ICML 2013. PDF

0

1

2

3

4

5

-4 -2 0 2 4

maxout

An example of maxout output

Maxout: f(x) = max(x1, x2, . . . , xn)

• An activation function that outputs the max of inputs

• Equal to a piecewise linear function

• Higher precision for real data

1 2 3 4 5 6 7#layers

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

Error

MNIST classification error versus network depth

Maxout test errorRectifier test errorMaxout train errorRectifier train error

58 / 81

http://www-etud.iro.umontreal.ca/~goodfeli/maxout.pdf

Outline


2 Perceptron




59 / 81

DropoutHinton, et al. Improving neural networks by preventing co-adaptation of feature detectors. Technicalreport, arXiv:1207.0580, 2012. PDF

50%

50%

50%

20%

hiddenlayers

inputs

outputlayer (100%)

Dropout probability(frequently used values)

• An important approach to prevent over-fitting

• The performance was better than otherregularization methods.

• To randomly drop inputs & hidden units duringtraining

Present withprobability p

wAlwayspresent

pw

Training Test

60 / 81

http://arxiv.org/pdf/1207.0580.pdf

Dropout is like bagging

Bagging

Traning set

Random samplingwith replacement

Traning set #1

Traning set #2

Traning set #N

Classifier #1

Classifier #2

Classifier #N

Training

Training

Classifier #1

Classifier #2

Classifier #N

Test (Classification)

xInput

y1

y2

yN

y

Majorityvote

Output

Dropout• Each neuron is trained using different data.• It is similar to using combination of many NNs.

61 / 81

Effect of dropout of inputs

MNIST test set (handwritten digits)/classification/784-{800-800,1200-1200,2000-2000,1200-1200-1200}-1062 / 81

Effect of dropout

TIMIT core test set (English speech reconition)/classification/4 fully-connected hidden layers × 4000 units63 / 81

Outline


2 Perceptron


4 Deep Learning

5 Application of deep learningImage processingOther fieldsProgramming language analysis

64 / 81

Outline


2 Perceptron


4 Deep Learning


65 / 81

Image processing and feature extraction

Conventional approaches (shallow learning)

Raw Data (RGB pixels)

FeatureExtraction

Features(SIFT, SURF, etc.)

Classifier

"Animal"

Label(s)

• Features needed to bedesigned by hand.(SIFT, SURF, etc.)

• Craftsmanship was required.

Deep learning

Raw Data (RGB pixels)

"Animal"

Label(s)

Features

• Information correspondingto features is automaticallyobtained by a deep CNN.

In addition, accuracy is higher than the traditional ways!

66 / 81

ILSVRC 2012Krizhevsky, Ilya & Hinton, ImageNet Classification with Deep Convolutional Neural Networks, pp.1097–1105, NIPS 2012. PDF Ranking

• Large Scale Visual Recognition Challenge 2012 (ILSVRC 2012)

• A competition for object recognition in photos (1000 labels)• Outclassing of Hinton’s team (Toronto Univ.)

Team name ErrorSuperVision 0.15315ISI 0.26172OXFORD VGG 0.26979XRCE/INRIA 0.27058Univ. of Amsterdam 0.29576etc.

67 / 81

http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

http://image-net.org/challenges/LSVRC/2012/results.html

Google’s grandmother neuronsLe et al, Building High-level Features Using Large Scale Unsupervised Learning, ICML 2012. PDF

Pretraining by 10 million 200x200 pixel images

• 9-layer autoencoder

• 1 billion connections

Neurons responding specific stimulus (grandmother neurons)are automatically obtained.

Human faces Cat faces Human bodies

Image recognition accuracy is improved by training after the pretraining.

68 / 81

http://static.googleusercontent.com/media/research.google.com/ja//archive/unsupervised_icml2012.pdf

Outline


2 Perceptron


4 Deep Learning


69 / 81

Deep Q-learningMnih et al., Playing Atari with Deep Reinforcement Learning, CoRR abs/1312.5602, 2013. PDF.

• Reinforcement Learning of Atari 2600 games by deep CNN (input: raw images of gamescreen)

• Winning Breakout, Enduro and Pong with an expert human player

• However, losing Q*bert, Seaquest, Space Invader

• Long-time strategy is required.

Pong Breakout Space Invaders Seaquest Beam Rider

70 / 81

http://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

Outline


2 Perceptron


4 Deep Learning


71 / 81

Leaning to executeZaremba & Ilya, Learning to Execute, CoRR abs/1410.4615, 2014. Slides PDF

Can NNs predict execution output of short programs written in an unknown language?

Input :j=8584for x in range(8):j+=920

b=(1500+j)print((b+7564))

Target : 1218.

72 / 81

http://cs.nyu.edu/~zaremba/docs/Learning%20to%20Execute.pdf

http://arxiv.org/pdf/1410.4615v1.pdf

Problem setting

Model: recurrent neural network (RNN) with long-short term memory (LSTM)

• 2 layer × 400 units

• A program is given as character stream.

• The model does not know syntax and semantics of a given program.

Short programs

• The Python syntax

• Containing addition, multiplication, variable assignment, if-statement, and for-loops

• Double loops are forbidden.• One of the operands of multiplication and the range of for-loops is constant.

• Complexity parameters of programs:

• length: the maximum # of digits in integers in a program• nesting : the depth of a parse trees

73 / 81

Examples of short programs

length = 4, nesting = 3:


b=(1500+j)print((b+7564))

Target : 1218.

Input :i=8827c=(i-5347)print((c+8704) if 2641<8500 else

5308)Target : 1218.

An example with scrambled characters:

Input :vqppknsqdvfljmncy2vxdddsepnimcbvubkomhrpliibtwztbljipccTarget : hkhpg

74 / 81

Exact prediction examples

Input :f=(8794 if 8887<9713 else (3*8334))print((f+574))

Target : 9368.Model prediction : 9368.

Input :c=445d=(c-4223)for x in range(1):d+=5272

print((8942 if d<3749 else 2951))Target : 8942.Model prediction : 8942.

75 / 81

Misprediction examples


b=(1500+j)print((b+7567))

Target : 25011.Model prediction : 23011.

Input :a=1027for x in range(2):a+=(402 if 6358>8211 else 2158)

print(a)Target : 5343.Model prediction : 5293.

76 / 81

Training strategies

• Baseline

• Learning with target distribution (with length = a, nesting = b)

• Naive (naive curriculum learning) [Bengio et al., 2009]

• Gradually increasing the “difficulty level” of training samples• Giving them from (length,nesting) = (1, 1) to (a, b)

• Mix (mixed strategy)

• Mix of all levels of hardness• Picking length ∈ [1, a] and nesting ∈ [1, b] independently for every sample

• Combined (mixed strategy)

• Combination of mix with naive curriculum learning

77 / 81

Absolute prediction accuracy (baseline & combined)

78 / 81

Relative prediction accuracy (naive, mix & combined)

Naive Mix Combined

79 / 81

Conclusion

• Naive curriculum learning is sometimes worse than the baseline.

• Naive: giving training samples from (length,nesting) = (1, 1) to (a, b)• The model reconstructs its memory to take larger numbers (e.g., 5 digits Ô 6 digits).• The memory pattern reconstruction might be difficult.

• The authors said “we don’t know how much our networks understand the meaning ofprograms.”

80 / 81

Summary

• Fundamentals of machine learning

• Feature vectors• Classification• Dimensionality reduction

• History of NNs

• Perceptron and single-layer NNs• Multi-layer NNs, gradient decent method and backpropagation• Deep learning

• Pretraning by RBM• Rectified linear units (ReLU) and Maxout• Dropout

• Applications

• Image processing (ILSVRC 2012, Google’s grandmother neurons)• Programming language analysis (Learning to execute)

81 / 81

APPENDIX

82 / 81

Outline

6 Derivation of back propagation

83 / 81

Derivation of backpropagation (1)

The updating of weight wi,j is defined by

w(τ+1)i,j = w

(τ)i,j − η

∂E

∂w(τ)i,j

.

The differentiation is

∂E

∂w(τ)i,j

=∑

(x,t)∈S

∂E

∂a(τ)i (x)

∂a(τ)i (x)

∂w(τ)i,j

=∑

(x,t)∈S

δ(τ)i (x)s

(τ)j (x)

where

δ(τ)i (x) ≡ ∂E

∂a(τ)i (x)

.

84 / 81

Derivation of backpropagation (2)

• If i ∈ V is in the output layer, then

δ(τ)i (x) ≡ ∂E

∂a(τ)i (x)

=∂E

∂f(τ)i (x)

∂f(τ)i (x)

∂a(τ)i (x)

= E′(f(τ)i (x)

)h′i

(a(τ)i (x)

).

• Otherwise we obtain

δ(τ)i (x) =

∑k s.t. (k,i)∈S

∂E

∂a(τ)k (x)

∂a(τ)k (x)

∂a(τ)i (x)

=∑

k s.t. (k,i)∈S

∂E

∂a(τ)k (x)

∂a(τ)k (x)

∂s(τ)i (x)

∂s(τ)i (x)

∂a(τ)i (x)

=

∑k s.t. (k,i)∈S

δ(τ)k (x)wk,i

h′i

(a(τ)i (x)

).

85 / 81

Algorithm Backpropagation (batch mode)

Initialize w(0) randomly.for τ = 0, 1, 2, . . . do

Compute signals s(τ)i (x) for all i ∈ V, (x, t) ∈ S from lower to upper layers.

Update weights for all (i, j) ∈ E from upper to lower layers by w(τ+1)i,j ← w

(τ)i,j − η∆w

(τ)i,j

where∆w

(τ)i,j =

∑(x,t)∈S

δ(τ)i (x)s

(τ)j (x)

and

δ(τ)i (x) =

(y(τ)i (x)− ti

)h′i

(a(τ)i (x)

)if i ∈ V is in the output layer, ∑

k s.t. (k,i)∈E

δ(τ)k (x)wk,i

h′i

(a(τ)i (x)

)otherwise.

end for

86 / 81

Documents

Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer