86
Deep Learning Neural networks have come back! Akinori ABE (M1) Sumii Laboratory Graduate School of Information Science Tohoku University Dec 8, 2014

Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Deep Learning

Neural networks have come back!

Akinori ABE (M1)

Sumii LaboratoryGraduate School of Information Science

Tohoku University

Dec 8, 2014

Page 2: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Deep learning and neural network

(Artificial) Neural Network (ANN, NN)

• An information processing model imitating biologicalnervous system

• A network constructed of

• units corresponding to neurons and• layers containing them

1st layer

2nd layer

3rd layer

4th layer

(output layer)

inputs

(input layer)

Deep Learning

• A set of algorithms for deeply structured NNs (of about 7, 8 or more layers)

• A trend of recent researches in machine learning

• Actively used for image processing, speech recognition, natural language processing,etc.

• Also applied to analysis of programming languages

2 / 81

Page 3: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

History of NNs

1943 Threshold logic units [McCulloch and Pitts]1949 Hebbian learning rule [Hebb]

1950–1960 The first golden age1957 Perceptron [Rosenblatt]1969 Limitations of perceptron [Minsky and Papert]

1970s The first “quiet years”1980s The second golden age1986 Backpropagation [Rumelhart, Hinton and Williams]

mid 1990s–The second “quiet years”

–early 2000s2006– The third golden age (Deep Learning)

2006 Pretraining by restricted Boltzmann machine [Hinton and Salakhutdinov]2010 Rectified linear unit (ReLU) [Nair and Hinton]2012 Dropout [Hinton, et al.]

3 / 81

Page 4: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 Perceptron

3 Multilayer feed-forward NN and backpropagation

4 Deep Learning

5 Application of deep learning

4 / 81

Page 5: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 Perceptron

3 Multilayer feed-forward NN and backpropagation

4 Deep Learning

5 Application of deep learning

5 / 81

Page 6: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Is this apple tasty (for you)?

The apple is produced in Aomori, and # of pips is five.

6 / 81

Page 7: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Empirical estimation

• The apples you have eaten.

x1 (# of pips) x2 (Aomori or not) y (tasty)

3 +1 +14 +1 +13 −1 −14 −1 −13 −1 −16 −1 +1

• Is the unknown apple tasty?

x1 (# of pips) x2 (Aomori or not) y (tasty)

5 +1 ?

7 / 81

Page 8: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Consider the coordinate space

tasty

65432

+1

-1

not tasty

unknown

(# of pips)

(Aomori or not)

10

The unknown apple is probably tasty.

8 / 81

Page 9: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Decision boundary

tasty

65432

+1

-1

not tasty

unknown

(# of pips)

(Aomori or not)

10

normal vector

decision boundary

F (x) = w0 + w1x1 + w2x2 = w>x where

• a weight vector w = (−3, 1, 2)> and

• a feature vector x = (1, x1, x2).

• y = +1 if F (x) > 0

• y = −1 if F (x) < 0

9 / 81

Page 10: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Classification

• We have a (large) training set S ⊆ RD+1 × {+1,−1}, i.e., pairs of

• an input vector x and• a target value t (of y).

• We classify an input vector x by

y =

{+1 if F (x) > 0

−1 if F (x) < 0where F (x) = w>x.

• We want to compute w s.t.tF (x) = tw>x > 0

for all (x, t) ∈ S.

Expectation: if training data can be classified, unknown data can be done as well.

10 / 81

Page 11: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 PerceptronIntroduction to perceptronSingle-layer NNs and threshold logic unitsThe first “quiet years” of NNs

3 Multilayer feed-forward NN and backpropagation

4 Deep Learning

5 Application of deep learning

11 / 81

Page 12: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 PerceptronIntroduction to perceptronSingle-layer NNs and threshold logic unitsThe first “quiet years” of NNs

3 Multilayer feed-forward NN and backpropagation

4 Deep Learning

5 Application of deep learning

12 / 81

Page 13: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Perceptron

Perceptron [Rosenblatt, 1957]

• A fundamental classification approach

• A single-layer NN (described later)

Algorithm Training of Perceptron

Initialize w.repeat

Get (x, t) ∈ S with replacement.if tw>x < 0 thenw ← w + tx

end ifuntil converged

The algorithm updates w if a prediction is wrong.

13 / 81

Page 14: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Updating a decision boundary

Updating by w ← w + tx (if tw>x < 0, i.e., prediction is wrong):

(a) If false negative (F (x) < 0 but t = +1), w is updated to w + x.

(b) If false positive (F (x) > 0 but t = −1), w is updated to w − x.

before updating

decision boundary

after updating

(a) false negative (b) false positive

14 / 81

Page 15: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Why can a perceptron learn?

(a) False negative (F (x) < 0 but t = +1)

• An updated weight vector: w′ := w + x• Trying to classify x again: w′>x = (w + x)>x = w>x+ x>x

• w>x: misprediction (negative)• x>x = ‖x‖2: L2-norm (always positive)

Thus w>x gets closer to t.

(b) False positive (F (x) > 0 but t = −1)

• An updated weight vector: w′ := w − x• Trying to classify x again: w′>x = (w − x)>x = w>x− x>x

• w>x: misprediction (positive)• x>x = ‖x‖2: L2-norm (always positive)

Thus w>x gets closer to t.

15 / 81

Page 16: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Perceptron convergence theorem

Theorem (perceptron convergence) [Ronsenblatt, 1962]

A perceptron always converges and returns w s.t.

tw>x > 0 for all (x, t) ∈ S

if a training set is linearly separable.

Definition (linear separability)

In binary classification, a data set is linearly separable if all points oftwo classes in the set can be discriminated by some hyperplane.

However, a perceptron converges very slowly for real data.

16 / 81

Page 17: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 PerceptronIntroduction to perceptronSingle-layer NNs and threshold logic unitsThe first “quiet years” of NNs

3 Multilayer feed-forward NN and backpropagation

4 Deep Learning

5 Application of deep learning

17 / 81

Page 18: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Artificial neural network

(Artificial) Neural Network (ANN, NN)

• An information processing model imitating biologicalnervous system

• NN is not a strict model of biological nervoussystem, but it is frequently used as an intuitiveexplanation.

• A network constructed of

• units corresponding to neurons and• layers containing them

1st layer

2nd layer

3rd layer

4th layer

(output layer)

inputs

(input layer)

18 / 81

Page 19: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Artificial neurons

Threshold logic units (TLU) [McCulloch and Pitts, 1943]An artificial neuron (i.e., a unit)

• takes input signals x1, . . . , xD,

• is activated if

D∑i=1

wixi > θ

and

• outputs y = +1 if activated(otherwise y = 0).

Hebbian learning rule [Hebb, 1949]• Learning is achieved by changing efficiency of signal propagation between two biological

neurons in the long term.• It corresponds to updating wi.

19 / 81

Page 20: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Units

• x = (1, x1, . . . , xD) ∈ RD+1 is an input vector.

• w = (w0, w1, . . . , wD) ∈ RD+1 is a weight vector.

• y is an output given by

y = h(a) where a = w>x.

• a is an activation.• h is a (nonlinear) activation function.

20 / 81

Page 21: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Examples of activation function

-1

-0.5

0

0.5

1

1.5

2

-6 -4 -2 0 2 4 6

H(a)sigm(a)tanh(a)

Step function:

H(a) =

{+1 a > 0,

−1 a < 0.

Logistic sigmoid function:

sigm(a) =1

1 + exp(−a).

Hyperbolic tangent:

tanh(a) =exp(a)− exp(−a)

exp(a) + exp(−a).

21 / 81

Page 22: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Single-layer NN

the 1st layer

(output layer)

input signals

output signals

• Single-layer NN is an array of units.

• Usable for multiclass classification (e.g., classification of tasty, sweat or bitter apples).

• The above NN is also called two-layer in many papers.

• In this talk, we call it single-layer as well as PRML.

22 / 81

Page 23: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 PerceptronIntroduction to perceptronSingle-layer NNs and threshold logic unitsThe first “quiet years” of NNs

3 Multilayer feed-forward NN and backpropagation

4 Deep Learning

5 Application of deep learning

23 / 81

Page 24: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

The first “quiet years” of NNs

Perceptrons: an introduction to computational geometry[Minsky and Papert, 1969]

• The authors proved that single-layer NNs cannot solvelinearly non-separable problems in this book.

• People misinterpreted the proof: the proof could begeneralized to multilayer NNs; they could not solve suchproblems!

• Actually, multilayer NNs (described later) can solve them.

Unfortunately, research budgets for NNs were reduced until the mid-1980s.

24 / 81

Page 25: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 Perceptron

3 Multilayer feed-forward NN and backpropagationMultilayer feed-forward NNBackpropagationProblem: over-fittingProblem: vanishing gradientThe second “quiet years” of NNs

4 Deep Learning

5 Application of deep learning25 / 81

Page 26: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 Perceptron

3 Multilayer feed-forward NN and backpropagationMultilayer feed-forward NNBackpropagationProblem: over-fittingProblem: vanishing gradientThe second “quiet years” of NNs

4 Deep Learning

5 Application of deep learning26 / 81

Page 27: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Multilayer feed-forward NN

#N

#N-1

#2

#1

(inputs)

(outputs)

hidden

layers

output

layer

An N -layer feed-forward NN is constructed of Nsingle-layer NNs.

• The top is called an output layer.

• Others are hidden layers.

• Input & output signals are observable.

• Outputs of hidden layers are unobservable.

Feed-forward Propagation

• Signals are propagated from lower to upper layers.

• Skipping is allowed. (skip-layer connection)

• Connection can be sparse.

27 / 81

Page 28: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Units of multilayer NN

An N -layer NN:

• G = (V, E) is a directed graph of a multilayer NN.

• V is a set of units.• E ⊆ V × V is a set of edges.

((i, j) ∈ E is the edge to i from j.)

• wi,j is a weight of (i, j) ∈ E .

• si is an output signal of unit i ∈ V, given by

si = hi(ai) where ai =∑

j s.t. (i,j)∈E

wi,jsj .

• ai is an activation.• hi is a (nonlinear) activation function

(e.g., sigm, tanh, etc.).

s0 = 1 and (i, 0) ∈ E for all i ∈ V for biases.

28 / 81

Page 29: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outputs of a multilayer feed-forward NN (without skip-layer connection)

• Input signals: x1, x2, . . . , xD• The output signals of the 1st layer:

si = hi

∑j

wi,jxj

for all i ∈ V in the 1st layer

• The output signals of the 2nd layer:

sk = hk

∑i

wk,ihi

∑j

wi,jxj

for all k ∈ V in the 2nd layer

• . . .

• Output signals: y1, y2, . . . , yd

29 / 81

Page 30: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Ability of a multilayer feed-forward NN

Theorem (universal approximation) [Cybenko, 1989]

A two-layer feed-forward NN with linear output can approximate any continuous function if ithas the enough # of hidden units.

f(x) = x2 f(x) = sin(x) f(x) = |x|

30 / 81

Page 31: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Training for a multilayer feed-forward NN

#N

#N-1

#2

#1

(inputs)

(outputs)

(targets)

We have a training set S ⊆ RD × Rd,i.e., pairs of

• an input vector x = (x1, x2, . . . , xD)> and

• a target vector t = (y1, y2, . . . , yd)>.

Purpose: to compute w s.t.

t ≈ y for all (x, t) ∈ S

where y = f(x,w) is an output vector of a NN.

31 / 81

Page 32: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 Perceptron

3 Multilayer feed-forward NN and backpropagationMultilayer feed-forward NNBackpropagationProblem: over-fittingProblem: vanishing gradientThe second “quiet years” of NNs

4 Deep Learning

5 Application of deep learning32 / 81

Page 33: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

The purpose of training for a multilayer NN

The optimum weight vector w minimizes the error function E(w):

w = arg minw

E(w)

where

E(w) =1

2

∑(x,t)∈S

‖f(x,w)− t‖2.

However, this is a nonlinear least square problem since f is nonlinear.Generally, finding w is very difficult, so that we try to find a local optimum.

33 / 81

Page 34: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Gradient descent method

Gradient descent method

Finding a local optimum of f : R→ R can be achieved by iteration given as

x(τ+1) = x(τ) − ηf ′(x(τ)

)where

• x(τ) is a value of x at the τ -th iteration and

• η > 0 is a learning rate.

34 / 81

Page 35: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Backpropagation

We find a local optimum weight by

w(τ+1) = w(τ) − η∇E(w(τ)

).

Gradient ∇E(w(τ)

)is computed by backpropagation [Rumelhart et al., 1986]:

• The basic idea is propagation of error δi = ∂E/∂ai from upper to lower layers.

zi

zj

δjδk

δ1

wji wkj

• (The concrete algorithm is omitted because it is complex.)

35 / 81

Page 36: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 Perceptron

3 Multilayer feed-forward NN and backpropagationMultilayer feed-forward NNBackpropagationProblem: over-fittingProblem: vanishing gradientThe second “quiet years” of NNs

4 Deep Learning

5 Application of deep learning36 / 81

Page 37: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Over-fitting

• Over-fitting: to obtain w that cannot apply to real data, i.e.,

• the fitting error for a training set is very small,• the error for real data is very large.

Cause:

• a multilayer NN is sensitive to errors in training data;• backpropagation tends to fall into a (bad) local optimum.

37 / 81

Page 38: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Over-fitting

-4

-2

0

2

4

-4 -2 0 2 4

Class 1Class 2

(# of layer = 2, # of hidden units = 10, all activation functions are sigmoid.)

38 / 81

Page 39: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Over-fitting and errors

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 1000 2000 3000 4000 5000

ERMS

# of iteration times

TraningTest

Evaluation of theroot-mean-square (RMS) error

ERMS =√

2E(w)/N.

for

• a training set and

• a test set.

• The training error decreases,

• but the test error increases.

39 / 81

Page 40: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Regularization

Regularization (a.k.a. weight decay) is a approach to prevent over-fitting by minimizing, e.g.,

E(w) +λ

2‖w‖2

where λ > 0 is a regularization coefficient. (λ2‖w‖2 is a penalty.)

Why can regularization prevent over-fitting?

• Empirically, ‖w‖ gets large when over-fitting.

• Stochastically, it is a little advanced estimation method.

• minE(w) maximizes the likelihood p(S | w),assuming p(w) is an uniform distribution (maximum likelihood estimation).

• min(E(w) + λ

2‖w‖2)

maximizes the posterior p(w | S),assuming p(w) is a zero-mean gaussian distribution (MAP estimation).

40 / 81

Page 41: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 Perceptron

3 Multilayer feed-forward NN and backpropagationMultilayer feed-forward NNBackpropagationProblem: over-fittingProblem: vanishing gradientThe second “quiet years” of NNs

4 Deep Learning

5 Application of deep learning41 / 81

Page 42: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Vanishing gradient problem

Vanishing gradient problem

• Errors δi are not propagated to lower layers if a NN has large # of layers.

• Most errors are absorbed in the top some layers.

Cause:

• a multilayer NN can solve a problem without lower layers because it is strong(cf. universal approximation).

42 / 81

Page 43: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Vanishing gradient

1e-12

1e-10

1e-08

1e-06

0.0001

0.01

1

0 2000 4000 6000 8000 10000

‖δ‖2

ofea

chla

yer

# of iteration times

10th9nd8rd7th6th5th4th3rd2nd1st

(Each layer contains 10 units, and all activation functions are sigmoid.) 43 / 81

Page 44: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 Perceptron

3 Multilayer feed-forward NN and backpropagationMultilayer feed-forward NNBackpropagationProblem: over-fittingProblem: vanishing gradientThe second “quiet years” of NNs

4 Deep Learning

5 Application of deep learning44 / 81

Page 45: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

The second “quiet years” of NNs

The first paragraph of a paper [Simard et al., ICDAR 2003]:

After being extremely popular in the early 1990s, neural networks have fallen out offavor in research in the last 5 years. In 2000, it was even pointed out by theorganizers of the Neural Information Processing System (NIPS) conference that theterm “neural networks” in the submission title was negatively correlated withacceptance. In contrast, positive correlations were made with support vectormachines (SVMs), Bayesian networks, and variational methods.

45 / 81

Page 46: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 Perceptron

3 Multilayer feed-forward NN and backpropagation

4 Deep LearningUnsupervised pretraining by restricted Boltzmann machineRectified linear units (ReLU) and MaxoutDropout

5 Application of deep learning

46 / 81

Page 47: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Deep Leaning

Deep Learning (2006–)

• A set of algorithms for deeply structured NNs (of about 7, 8 or more layers)

• A trend of recent researches in machine learning

• Actively used for image processing, speech recognition, natural language processing,etc.

• Also applied to analysis of programming languages

47 / 81

Page 48: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 Perceptron

3 Multilayer feed-forward NN and backpropagation

4 Deep LearningUnsupervised pretraining by restricted Boltzmann machineRectified linear units (ReLU) and MaxoutDropout

5 Application of deep learning

48 / 81

Page 49: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Supervised & unsupervised training

Supervised training

• Training using a training set containing inputs and targets

• To obtain f : X → Y from S ⊆ X × Y (f(x) ≈ t for most (x, t) ∈ S)

• Typical problems:

• Classification: giving labels for inputs• Regression: finding relationship between two continuous variables

Unsupervised training

• Training using a training set only containing inputs

• To obtain f : X → Y from S ⊆ X• Typical problems:

• Clustering: finding grouping for inputs• Dimensionality reduction: converting high-dimensional vectors into low-dimensional

ones (preserving the original information)

49 / 81

Page 50: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Unsupervised pretraining by restricted Boltzmann machineHinton and Salakhutdinov, Reducing the Dimensionality of Data with Neural Networks,Science, Vol. 313, No. 5786, pp. 504–507, 2006. PDF

Autoencoder network

• A NN for (nonlinear) dimensionality reduction

• Difficulty of finding appropriate weights

• The initial weights must be close to a good solution.

Hinton’s approach

• Process:

1. Pretraining: layer-wise computation of good initial weights by restricted Boltzmannmachine (RBM)

2. Unrolling: construction of an autoencoder network using the weights3. Fine-tuning: adjustment of the whole network by backpropagation

• Actually, the initial weights are suitable for other problems.

• Empirically, lower-dimensional representation of inputs is a good initial weight.

50 / 81

Page 51: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Autoencoder

inputsignalsx

h

y

outputlayer

hiddenlayer

W

W

outputsignals

• A two-layer NN that reconstructs inputs (x ≈ y)

• Bottleneck: # of hidden units < # of inputs

• Computation:

• Encoder: h = f(Wx+ b)• Decoder: y = f(W>h+ b′)

51 / 81

Page 52: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Restricted Boltzmann machine (RBM)

c1 c2 cn

b1 b2 bmb3

v1 v2 vmv3

h1 h2 hn

wij

v = (v1, . . . , vm) ∈ {0, 1}m

h = (h1, . . . , hn) ∈ {0, 1}n

b = (b1, . . . , bm) ∈ Rm

c = (v1, . . . , vn) ∈ Rn

W = {wij} ∈ Rm×n

θ = (b, c,W ) (parameters)

RBM is an undirected graph constructed of

• a visible layer (for an input vector v) and

• a hidden layer (as low-dimensional representation h).

Estimation of θ: Maximizing likelihood p(v | θ) where

• p(v,h | θ) = exp (−E(v,h)) /Z and

• E(v,h) = −b>v − c>h− v>Wh

by Contrastive Divergence learning (approximation).

52 / 81

Page 53: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Process of pretraining, unrolling and fine-tuning

W1

W2

W3

W1

W2

W3

W3

W2

W1

Encoder

Decoder

(1) pretraining (2) unrolling

Code layer

W1

W2

W3

W3

W2

W1

(3) fine-tuning

1

2

3

4

5

6

copy

copy

input vectorRBM

RBM

RBM

53 / 81

Page 54: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Dimensionality reduction for retrieved documents

2000-500-250-125-2 autoencoder for 804,414 newswire stories

54 / 81

Page 55: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 Perceptron

3 Multilayer feed-forward NN and backpropagation

4 Deep LearningUnsupervised pretraining by restricted Boltzmann machineRectified linear units (ReLU) and MaxoutDropout

5 Application of deep learning

55 / 81

Page 56: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

The problem of logistic sigmoid function

0

0.2

0.4

0.6

0.8

1

-6 -4 -2 0 2 4 6

• An output of an unit is given by f(w>x) where

f(a) = sigm(a) =1

1 + exp(−a).

• The gradient of sigm decreases when training progresses.

• Then it makes training slow.

High precision requires long training!

56 / 81

Page 57: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Rectified linear units (ReLU)Nair and Hinton, Rectified Linear Units Improve Restricted Boltzmann Machines, ICML 2010. PDF

0

1

2

3

4

5

-4 -2 0 2 4

ReLUsoftplus

Softplus function: f(a) = log(1 + ea)

• More biologically plausible than sigmoid (rarelysaturated)

Rectified linear units (ReLU): practical approx. of softplus

• ReLU: f(a) = max(0, a)

• NReLU (Noisy ReLU): f(a) = max(0, a+N (0, σ(a)))

Features of ReLUs:

• Fast convergence

• High precision for real data

57 / 81

Page 58: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

MaxoutGodfellow, Maxout Networks, ICML 2013. PDF

0

1

2

3

4

5

-4 -2 0 2 4

maxout

An example of maxout output

Maxout: f(x) = max(x1, x2, . . . , xn)

• An activation function that outputs the max of inputs

• Equal to a piecewise linear function

• Higher precision for real data

1 2 3 4 5 6 7#layers

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

Error

MNIST classification error versus network depth

Maxout test errorRectifier test errorMaxout train errorRectifier train error

58 / 81

Page 59: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 Perceptron

3 Multilayer feed-forward NN and backpropagation

4 Deep LearningUnsupervised pretraining by restricted Boltzmann machineRectified linear units (ReLU) and MaxoutDropout

5 Application of deep learning

59 / 81

Page 60: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

DropoutHinton, et al. Improving neural networks by preventing co-adaptation of feature detectors. Technicalreport, arXiv:1207.0580, 2012. PDF

50%

50%

50%

20%

hiddenlayers

inputs

outputlayer (100%)

Dropout probability(frequently used values)

• An important approach to prevent over-fitting

• The performance was better than otherregularization methods.

• To randomly drop inputs & hidden units duringtraining

Present withprobability p

wAlwayspresent

pw

Training Test

60 / 81

Page 61: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Dropout is like bagging

Bagging

Traning set

Random samplingwith replacement

Traning set #1

Traning set #2

Traning set #N

Classifier #1

Classifier #2

Classifier #N

Training

Training

Classifier #1

Classifier #2

Classifier #N

Test (Classification)

xInput

y1

y2

yN

y

Majorityvote

Output

Dropout• Each neuron is trained using different data.• It is similar to using combination of many NNs.

61 / 81

Page 62: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Effect of dropout of inputs

MNIST test set (handwritten digits)/classification/784-{800-800,1200-1200,2000-2000,1200-1200-1200}-1062 / 81

Page 63: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Effect of dropout

TIMIT core test set (English speech reconition)/classification/4 fully-connected hidden layers × 4000 units63 / 81

Page 64: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 Perceptron

3 Multilayer feed-forward NN and backpropagation

4 Deep Learning

5 Application of deep learningImage processingOther fieldsProgramming language analysis

64 / 81

Page 65: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 Perceptron

3 Multilayer feed-forward NN and backpropagation

4 Deep Learning

5 Application of deep learningImage processingOther fieldsProgramming language analysis

65 / 81

Page 66: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Image processing and feature extraction

Conventional approaches (shallow learning)

Raw Data (RGB pixels)

FeatureExtraction

Features(SIFT, SURF, etc.)

Classifier

"Animal"

Label(s)

• Features needed to bedesigned by hand.(SIFT, SURF, etc.)

• Craftsmanship was required.

Deep learning

Raw Data (RGB pixels)

"Animal"

Label(s)

Features

• Information correspondingto features is automaticallyobtained by a deep CNN.

In addition, accuracy is higher than the traditional ways!

66 / 81

Page 67: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

ILSVRC 2012Krizhevsky, Ilya & Hinton, ImageNet Classification with Deep Convolutional Neural Networks, pp.1097–1105, NIPS 2012. PDF Ranking

• Large Scale Visual Recognition Challenge 2012 (ILSVRC 2012)

• A competition for object recognition in photos (1000 labels)• Outclassing of Hinton’s team (Toronto Univ.)

Team name ErrorSuperVision 0.15315ISI 0.26172OXFORD VGG 0.26979XRCE/INRIA 0.27058Univ. of Amsterdam 0.29576etc.

67 / 81

Page 68: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Google’s grandmother neuronsLe et al, Building High-level Features Using Large Scale Unsupervised Learning, ICML 2012. PDF

Pretraining by 10 million 200x200 pixel images

• 9-layer autoencoder

• 1 billion connections

Neurons responding specific stimulus (grandmother neurons)are automatically obtained.

Human faces Cat faces Human bodies

Image recognition accuracy is improved by training after the pretraining.

68 / 81

Page 69: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 Perceptron

3 Multilayer feed-forward NN and backpropagation

4 Deep Learning

5 Application of deep learningImage processingOther fieldsProgramming language analysis

69 / 81

Page 70: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Deep Q-learningMnih et al., Playing Atari with Deep Reinforcement Learning, CoRR abs/1312.5602, 2013. PDF.

• Reinforcement Learning of Atari 2600 games by deep CNN (input: raw images of gamescreen)

• Winning Breakout, Enduro and Pong with an expert human player

• However, losing Q*bert, Seaquest, Space Invader

• Long-time strategy is required.

Pong Breakout Space Invaders Seaquest Beam Rider

70 / 81

Page 71: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

1 Fundamental of machine learning: classification

2 Perceptron

3 Multilayer feed-forward NN and backpropagation

4 Deep Learning

5 Application of deep learningImage processingOther fieldsProgramming language analysis

71 / 81

Page 72: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Leaning to executeZaremba & Ilya, Learning to Execute, CoRR abs/1410.4615, 2014. Slides PDF

Can NNs predict execution output of short programs written in an unknown language?

Input :j=8584for x in range(8):j+=920

b=(1500+j)print((b+7564))

Target : 1218.

72 / 81

Page 73: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Problem setting

Model: recurrent neural network (RNN) with long-short term memory (LSTM)

• 2 layer × 400 units

• A program is given as character stream.

• The model does not know syntax and semantics of a given program.

Short programs

• The Python syntax

• Containing addition, multiplication, variable assignment, if-statement, and for-loops

• Double loops are forbidden.• One of the operands of multiplication and the range of for-loops is constant.

• Complexity parameters of programs:

• length: the maximum # of digits in integers in a program• nesting : the depth of a parse trees

73 / 81

Page 74: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Examples of short programs

length = 4, nesting = 3:

Input :j=8584for x in range(8):j+=920

b=(1500+j)print((b+7564))

Target : 1218.

Input :i=8827c=(i-5347)print((c+8704) if 2641<8500 else

5308)Target : 1218.

An example with scrambled characters:

Input :vqppknsqdvfljmncy2vxdddsepnimcbvubkomhrpliibtwztbljipccTarget : hkhpg

74 / 81

Page 75: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Exact prediction examples

Input :f=(8794 if 8887<9713 else (3*8334))print((f+574))

Target : 9368.Model prediction : 9368.

Input :c=445d=(c-4223)for x in range(1):d+=5272

print((8942 if d<3749 else 2951))Target : 8942.Model prediction : 8942.

75 / 81

Page 76: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Misprediction examples

Input :j=8584for x in range(8):j+=920

b=(1500+j)print((b+7567))

Target : 25011.Model prediction : 23011.

Input :a=1027for x in range(2):a+=(402 if 6358>8211 else 2158)

print(a)Target : 5343.Model prediction : 5293.

76 / 81

Page 77: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Training strategies

• Baseline

• Learning with target distribution (with length = a, nesting = b)

• Naive (naive curriculum learning) [Bengio et al., 2009]

• Gradually increasing the “difficulty level” of training samples• Giving them from (length,nesting) = (1, 1) to (a, b)

• Mix (mixed strategy)

• Mix of all levels of hardness• Picking length ∈ [1, a] and nesting ∈ [1, b] independently for every sample

• Combined (mixed strategy)

• Combination of mix with naive curriculum learning

77 / 81

Page 78: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Absolute prediction accuracy (baseline & combined)

78 / 81

Page 79: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Relative prediction accuracy (naive, mix & combined)

Naive Mix Combined

79 / 81

Page 80: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Conclusion

• Naive curriculum learning is sometimes worse than the baseline.

• Naive: giving training samples from (length,nesting) = (1, 1) to (a, b)• The model reconstructs its memory to take larger numbers (e.g., 5 digits Ô 6 digits).• The memory pattern reconstruction might be difficult.

• The authors said “we don’t know how much our networks understand the meaning ofprograms.”

80 / 81

Page 81: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Summary

• Fundamentals of machine learning

• Feature vectors• Classification• Dimensionality reduction

• History of NNs

• Perceptron and single-layer NNs• Multi-layer NNs, gradient decent method and backpropagation• Deep learning

• Pretraning by RBM• Rectified linear units (ReLU) and Maxout• Dropout

• Applications

• Image processing (ILSVRC 2012, Google’s grandmother neurons)• Programming language analysis (Learning to execute)

81 / 81

Page 82: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

APPENDIX

82 / 81

Page 83: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Outline

6 Derivation of back propagation

83 / 81

Page 84: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Derivation of backpropagation (1)

The updating of weight wi,j is defined by

w(τ+1)i,j = w

(τ)i,j − η

∂E

∂w(τ)i,j

.

The differentiation is

∂E

∂w(τ)i,j

=∑

(x,t)∈S

∂E

∂a(τ)i (x)

∂a(τ)i (x)

∂w(τ)i,j

=∑

(x,t)∈S

δ(τ)i (x)s

(τ)j (x)

where

δ(τ)i (x) ≡ ∂E

∂a(τ)i (x)

.

84 / 81

Page 85: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Derivation of backpropagation (2)

• If i ∈ V is in the output layer, then

δ(τ)i (x) ≡ ∂E

∂a(τ)i (x)

=∂E

∂f(τ)i (x)

∂f(τ)i (x)

∂a(τ)i (x)

= E′(f(τ)i (x)

)h′i

(a(τ)i (x)

).

• Otherwise we obtain

δ(τ)i (x) =

∑k s.t. (k,i)∈S

∂E

∂a(τ)k (x)

∂a(τ)k (x)

∂a(τ)i (x)

=∑

k s.t. (k,i)∈S

∂E

∂a(τ)k (x)

∂a(τ)k (x)

∂s(τ)i (x)

∂s(τ)i (x)

∂a(τ)i (x)

=

∑k s.t. (k,i)∈S

δ(τ)k (x)wk,i

h′i

(a(τ)i (x)

).

85 / 81

Page 86: Deep Learning - Neural networks have come back!akabe.github.io/pub/HistoryOfNeuralNetworksAndDeepLearning.pdf · 1 Fundamental of machine learning: classi cation 2 Perceptron 3 Multilayer

Algorithm Backpropagation (batch mode)

Initialize w(0) randomly.for τ = 0, 1, 2, . . . do

Compute signals s(τ)i (x) for all i ∈ V, (x, t) ∈ S from lower to upper layers.

Update weights for all (i, j) ∈ E from upper to lower layers by w(τ+1)i,j ← w

(τ)i,j − η∆w

(τ)i,j

where∆w

(τ)i,j =

∑(x,t)∈S

δ(τ)i (x)s

(τ)j (x)

and

δ(τ)i (x) =

(y(τ)i (x)− ti

)h′i

(a(τ)i (x)

)if i ∈ V is in the output layer, ∑

k s.t. (k,i)∈E

δ(τ)k (x)wk,i

h′i

(a(τ)i (x)

)otherwise.

end for

86 / 81