Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Deep Learning
Neural networks have come back!
Akinori ABE (M1)
Sumii LaboratoryGraduate School of Information Science
Tohoku University
Dec 8, 2014
Deep learning and neural network
(Artificial) Neural Network (ANN, NN)
• An information processing model imitating biologicalnervous system
• A network constructed of
• units corresponding to neurons and• layers containing them
1st layer
2nd layer
3rd layer
4th layer
(output layer)
inputs
(input layer)
Deep Learning
• A set of algorithms for deeply structured NNs (of about 7, 8 or more layers)
• A trend of recent researches in machine learning
• Actively used for image processing, speech recognition, natural language processing,etc.
• Also applied to analysis of programming languages
2 / 81
History of NNs
1943 Threshold logic units [McCulloch and Pitts]1949 Hebbian learning rule [Hebb]
1950–1960 The first golden age1957 Perceptron [Rosenblatt]1969 Limitations of perceptron [Minsky and Papert]
1970s The first “quiet years”1980s The second golden age1986 Backpropagation [Rumelhart, Hinton and Williams]
mid 1990s–The second “quiet years”
–early 2000s2006– The third golden age (Deep Learning)
2006 Pretraining by restricted Boltzmann machine [Hinton and Salakhutdinov]2010 Rectified linear unit (ReLU) [Nair and Hinton]2012 Dropout [Hinton, et al.]
3 / 81
Outline
1 Fundamental of machine learning: classification
2 Perceptron
3 Multilayer feed-forward NN and backpropagation
4 Deep Learning
5 Application of deep learning
4 / 81
Outline
1 Fundamental of machine learning: classification
2 Perceptron
3 Multilayer feed-forward NN and backpropagation
4 Deep Learning
5 Application of deep learning
5 / 81
Is this apple tasty (for you)?
The apple is produced in Aomori, and # of pips is five.
6 / 81
Empirical estimation
• The apples you have eaten.
x1 (# of pips) x2 (Aomori or not) y (tasty)
3 +1 +14 +1 +13 −1 −14 −1 −13 −1 −16 −1 +1
• Is the unknown apple tasty?
x1 (# of pips) x2 (Aomori or not) y (tasty)
5 +1 ?
7 / 81
Consider the coordinate space
tasty
65432
+1
-1
not tasty
unknown
(# of pips)
(Aomori or not)
10
The unknown apple is probably tasty.
8 / 81
Decision boundary
tasty
65432
+1
-1
not tasty
unknown
(# of pips)
(Aomori or not)
10
normal vector
decision boundary
F (x) = w0 + w1x1 + w2x2 = w>x where
• a weight vector w = (−3, 1, 2)> and
• a feature vector x = (1, x1, x2).
• y = +1 if F (x) > 0
• y = −1 if F (x) < 0
9 / 81
Classification
• We have a (large) training set S ⊆ RD+1 × {+1,−1}, i.e., pairs of
• an input vector x and• a target value t (of y).
• We classify an input vector x by
y =
{+1 if F (x) > 0
−1 if F (x) < 0where F (x) = w>x.
• We want to compute w s.t.tF (x) = tw>x > 0
for all (x, t) ∈ S.
Expectation: if training data can be classified, unknown data can be done as well.
10 / 81
Outline
1 Fundamental of machine learning: classification
2 PerceptronIntroduction to perceptronSingle-layer NNs and threshold logic unitsThe first “quiet years” of NNs
3 Multilayer feed-forward NN and backpropagation
4 Deep Learning
5 Application of deep learning
11 / 81
Outline
1 Fundamental of machine learning: classification
2 PerceptronIntroduction to perceptronSingle-layer NNs and threshold logic unitsThe first “quiet years” of NNs
3 Multilayer feed-forward NN and backpropagation
4 Deep Learning
5 Application of deep learning
12 / 81
Perceptron
Perceptron [Rosenblatt, 1957]
• A fundamental classification approach
• A single-layer NN (described later)
Algorithm Training of Perceptron
Initialize w.repeat
Get (x, t) ∈ S with replacement.if tw>x < 0 thenw ← w + tx
end ifuntil converged
The algorithm updates w if a prediction is wrong.
13 / 81
Updating a decision boundary
Updating by w ← w + tx (if tw>x < 0, i.e., prediction is wrong):
(a) If false negative (F (x) < 0 but t = +1), w is updated to w + x.
(b) If false positive (F (x) > 0 but t = −1), w is updated to w − x.
before updating
decision boundary
after updating
(a) false negative (b) false positive
14 / 81
Why can a perceptron learn?
(a) False negative (F (x) < 0 but t = +1)
• An updated weight vector: w′ := w + x• Trying to classify x again: w′>x = (w + x)>x = w>x+ x>x
• w>x: misprediction (negative)• x>x = ‖x‖2: L2-norm (always positive)
Thus w>x gets closer to t.
(b) False positive (F (x) > 0 but t = −1)
• An updated weight vector: w′ := w − x• Trying to classify x again: w′>x = (w − x)>x = w>x− x>x
• w>x: misprediction (positive)• x>x = ‖x‖2: L2-norm (always positive)
Thus w>x gets closer to t.
15 / 81
Perceptron convergence theorem
Theorem (perceptron convergence) [Ronsenblatt, 1962]
A perceptron always converges and returns w s.t.
tw>x > 0 for all (x, t) ∈ S
if a training set is linearly separable.
Definition (linear separability)
In binary classification, a data set is linearly separable if all points oftwo classes in the set can be discriminated by some hyperplane.
However, a perceptron converges very slowly for real data.
16 / 81
Outline
1 Fundamental of machine learning: classification
2 PerceptronIntroduction to perceptronSingle-layer NNs and threshold logic unitsThe first “quiet years” of NNs
3 Multilayer feed-forward NN and backpropagation
4 Deep Learning
5 Application of deep learning
17 / 81
Artificial neural network
(Artificial) Neural Network (ANN, NN)
• An information processing model imitating biologicalnervous system
• NN is not a strict model of biological nervoussystem, but it is frequently used as an intuitiveexplanation.
• A network constructed of
• units corresponding to neurons and• layers containing them
1st layer
2nd layer
3rd layer
4th layer
(output layer)
inputs
(input layer)
18 / 81
Artificial neurons
Threshold logic units (TLU) [McCulloch and Pitts, 1943]An artificial neuron (i.e., a unit)
• takes input signals x1, . . . , xD,
• is activated if
D∑i=1
wixi > θ
and
• outputs y = +1 if activated(otherwise y = 0).
Hebbian learning rule [Hebb, 1949]• Learning is achieved by changing efficiency of signal propagation between two biological
neurons in the long term.• It corresponds to updating wi.
19 / 81
Units
• x = (1, x1, . . . , xD) ∈ RD+1 is an input vector.
• w = (w0, w1, . . . , wD) ∈ RD+1 is a weight vector.
• y is an output given by
y = h(a) where a = w>x.
• a is an activation.• h is a (nonlinear) activation function.
20 / 81
Examples of activation function
-1
-0.5
0
0.5
1
1.5
2
-6 -4 -2 0 2 4 6
H(a)sigm(a)tanh(a)
Step function:
H(a) =
{+1 a > 0,
−1 a < 0.
Logistic sigmoid function:
sigm(a) =1
1 + exp(−a).
Hyperbolic tangent:
tanh(a) =exp(a)− exp(−a)
exp(a) + exp(−a).
21 / 81
Single-layer NN
the 1st layer
(output layer)
input signals
output signals
• Single-layer NN is an array of units.
• Usable for multiclass classification (e.g., classification of tasty, sweat or bitter apples).
• The above NN is also called two-layer in many papers.
• In this talk, we call it single-layer as well as PRML.
22 / 81
Outline
1 Fundamental of machine learning: classification
2 PerceptronIntroduction to perceptronSingle-layer NNs and threshold logic unitsThe first “quiet years” of NNs
3 Multilayer feed-forward NN and backpropagation
4 Deep Learning
5 Application of deep learning
23 / 81
The first “quiet years” of NNs
Perceptrons: an introduction to computational geometry[Minsky and Papert, 1969]
• The authors proved that single-layer NNs cannot solvelinearly non-separable problems in this book.
• People misinterpreted the proof: the proof could begeneralized to multilayer NNs; they could not solve suchproblems!
• Actually, multilayer NNs (described later) can solve them.
Unfortunately, research budgets for NNs were reduced until the mid-1980s.
24 / 81
Outline
1 Fundamental of machine learning: classification
2 Perceptron
3 Multilayer feed-forward NN and backpropagationMultilayer feed-forward NNBackpropagationProblem: over-fittingProblem: vanishing gradientThe second “quiet years” of NNs
4 Deep Learning
5 Application of deep learning25 / 81
Outline
1 Fundamental of machine learning: classification
2 Perceptron
3 Multilayer feed-forward NN and backpropagationMultilayer feed-forward NNBackpropagationProblem: over-fittingProblem: vanishing gradientThe second “quiet years” of NNs
4 Deep Learning
5 Application of deep learning26 / 81
Multilayer feed-forward NN
#N
#N-1
#2
#1
(inputs)
(outputs)
hidden
layers
output
layer
An N -layer feed-forward NN is constructed of Nsingle-layer NNs.
• The top is called an output layer.
• Others are hidden layers.
• Input & output signals are observable.
• Outputs of hidden layers are unobservable.
Feed-forward Propagation
• Signals are propagated from lower to upper layers.
• Skipping is allowed. (skip-layer connection)
• Connection can be sparse.
27 / 81
Units of multilayer NN
An N -layer NN:
• G = (V, E) is a directed graph of a multilayer NN.
• V is a set of units.• E ⊆ V × V is a set of edges.
((i, j) ∈ E is the edge to i from j.)
• wi,j is a weight of (i, j) ∈ E .
• si is an output signal of unit i ∈ V, given by
si = hi(ai) where ai =∑
j s.t. (i,j)∈E
wi,jsj .
• ai is an activation.• hi is a (nonlinear) activation function
(e.g., sigm, tanh, etc.).
s0 = 1 and (i, 0) ∈ E for all i ∈ V for biases.
28 / 81
Outputs of a multilayer feed-forward NN (without skip-layer connection)
• Input signals: x1, x2, . . . , xD• The output signals of the 1st layer:
si = hi
∑j
wi,jxj
for all i ∈ V in the 1st layer
• The output signals of the 2nd layer:
sk = hk
∑i
wk,ihi
∑j
wi,jxj
for all k ∈ V in the 2nd layer
• . . .
• Output signals: y1, y2, . . . , yd
29 / 81
Ability of a multilayer feed-forward NN
Theorem (universal approximation) [Cybenko, 1989]
A two-layer feed-forward NN with linear output can approximate any continuous function if ithas the enough # of hidden units.
f(x) = x2 f(x) = sin(x) f(x) = |x|
30 / 81
Training for a multilayer feed-forward NN
#N
#N-1
#2
#1
(inputs)
(outputs)
(targets)
We have a training set S ⊆ RD × Rd,i.e., pairs of
• an input vector x = (x1, x2, . . . , xD)> and
• a target vector t = (y1, y2, . . . , yd)>.
Purpose: to compute w s.t.
t ≈ y for all (x, t) ∈ S
where y = f(x,w) is an output vector of a NN.
31 / 81
Outline
1 Fundamental of machine learning: classification
2 Perceptron
3 Multilayer feed-forward NN and backpropagationMultilayer feed-forward NNBackpropagationProblem: over-fittingProblem: vanishing gradientThe second “quiet years” of NNs
4 Deep Learning
5 Application of deep learning32 / 81
The purpose of training for a multilayer NN
The optimum weight vector w minimizes the error function E(w):
w = arg minw
E(w)
where
E(w) =1
2
∑(x,t)∈S
‖f(x,w)− t‖2.
However, this is a nonlinear least square problem since f is nonlinear.Generally, finding w is very difficult, so that we try to find a local optimum.
33 / 81
Gradient descent method
Gradient descent method
Finding a local optimum of f : R→ R can be achieved by iteration given as
x(τ+1) = x(τ) − ηf ′(x(τ)
)where
• x(τ) is a value of x at the τ -th iteration and
• η > 0 is a learning rate.
34 / 81
Backpropagation
We find a local optimum weight by
w(τ+1) = w(τ) − η∇E(w(τ)
).
Gradient ∇E(w(τ)
)is computed by backpropagation [Rumelhart et al., 1986]:
• The basic idea is propagation of error δi = ∂E/∂ai from upper to lower layers.
zi
zj
δjδk
δ1
wji wkj
• (The concrete algorithm is omitted because it is complex.)
35 / 81
Outline
1 Fundamental of machine learning: classification
2 Perceptron
3 Multilayer feed-forward NN and backpropagationMultilayer feed-forward NNBackpropagationProblem: over-fittingProblem: vanishing gradientThe second “quiet years” of NNs
4 Deep Learning
5 Application of deep learning36 / 81
Over-fitting
• Over-fitting: to obtain w that cannot apply to real data, i.e.,
• the fitting error for a training set is very small,• the error for real data is very large.
Cause:
• a multilayer NN is sensitive to errors in training data;• backpropagation tends to fall into a (bad) local optimum.
37 / 81
Over-fitting
-4
-2
0
2
4
-4 -2 0 2 4
Class 1Class 2
(# of layer = 2, # of hidden units = 10, all activation functions are sigmoid.)
38 / 81
Over-fitting and errors
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 1000 2000 3000 4000 5000
ERMS
# of iteration times
TraningTest
Evaluation of theroot-mean-square (RMS) error
ERMS =√
2E(w)/N.
for
• a training set and
• a test set.
• The training error decreases,
• but the test error increases.
39 / 81
Regularization
Regularization (a.k.a. weight decay) is a approach to prevent over-fitting by minimizing, e.g.,
E(w) +λ
2‖w‖2
where λ > 0 is a regularization coefficient. (λ2‖w‖2 is a penalty.)
Why can regularization prevent over-fitting?
• Empirically, ‖w‖ gets large when over-fitting.
• Stochastically, it is a little advanced estimation method.
• minE(w) maximizes the likelihood p(S | w),assuming p(w) is an uniform distribution (maximum likelihood estimation).
• min(E(w) + λ
2‖w‖2)
maximizes the posterior p(w | S),assuming p(w) is a zero-mean gaussian distribution (MAP estimation).
40 / 81
Outline
1 Fundamental of machine learning: classification
2 Perceptron
3 Multilayer feed-forward NN and backpropagationMultilayer feed-forward NNBackpropagationProblem: over-fittingProblem: vanishing gradientThe second “quiet years” of NNs
4 Deep Learning
5 Application of deep learning41 / 81
Vanishing gradient problem
Vanishing gradient problem
• Errors δi are not propagated to lower layers if a NN has large # of layers.
• Most errors are absorbed in the top some layers.
Cause:
• a multilayer NN can solve a problem without lower layers because it is strong(cf. universal approximation).
42 / 81
Vanishing gradient
1e-12
1e-10
1e-08
1e-06
0.0001
0.01
1
0 2000 4000 6000 8000 10000
‖δ‖2
ofea
chla
yer
# of iteration times
10th9nd8rd7th6th5th4th3rd2nd1st
(Each layer contains 10 units, and all activation functions are sigmoid.) 43 / 81
Outline
1 Fundamental of machine learning: classification
2 Perceptron
3 Multilayer feed-forward NN and backpropagationMultilayer feed-forward NNBackpropagationProblem: over-fittingProblem: vanishing gradientThe second “quiet years” of NNs
4 Deep Learning
5 Application of deep learning44 / 81
The second “quiet years” of NNs
The first paragraph of a paper [Simard et al., ICDAR 2003]:
After being extremely popular in the early 1990s, neural networks have fallen out offavor in research in the last 5 years. In 2000, it was even pointed out by theorganizers of the Neural Information Processing System (NIPS) conference that theterm “neural networks” in the submission title was negatively correlated withacceptance. In contrast, positive correlations were made with support vectormachines (SVMs), Bayesian networks, and variational methods.
45 / 81
Outline
1 Fundamental of machine learning: classification
2 Perceptron
3 Multilayer feed-forward NN and backpropagation
4 Deep LearningUnsupervised pretraining by restricted Boltzmann machineRectified linear units (ReLU) and MaxoutDropout
5 Application of deep learning
46 / 81
Deep Leaning
Deep Learning (2006–)
• A set of algorithms for deeply structured NNs (of about 7, 8 or more layers)
• A trend of recent researches in machine learning
• Actively used for image processing, speech recognition, natural language processing,etc.
• Also applied to analysis of programming languages
47 / 81
Outline
1 Fundamental of machine learning: classification
2 Perceptron
3 Multilayer feed-forward NN and backpropagation
4 Deep LearningUnsupervised pretraining by restricted Boltzmann machineRectified linear units (ReLU) and MaxoutDropout
5 Application of deep learning
48 / 81
Supervised & unsupervised training
Supervised training
• Training using a training set containing inputs and targets
• To obtain f : X → Y from S ⊆ X × Y (f(x) ≈ t for most (x, t) ∈ S)
• Typical problems:
• Classification: giving labels for inputs• Regression: finding relationship between two continuous variables
Unsupervised training
• Training using a training set only containing inputs
• To obtain f : X → Y from S ⊆ X• Typical problems:
• Clustering: finding grouping for inputs• Dimensionality reduction: converting high-dimensional vectors into low-dimensional
ones (preserving the original information)
49 / 81
Unsupervised pretraining by restricted Boltzmann machineHinton and Salakhutdinov, Reducing the Dimensionality of Data with Neural Networks,Science, Vol. 313, No. 5786, pp. 504–507, 2006. PDF
Autoencoder network
• A NN for (nonlinear) dimensionality reduction
• Difficulty of finding appropriate weights
• The initial weights must be close to a good solution.
Hinton’s approach
• Process:
1. Pretraining: layer-wise computation of good initial weights by restricted Boltzmannmachine (RBM)
2. Unrolling: construction of an autoencoder network using the weights3. Fine-tuning: adjustment of the whole network by backpropagation
• Actually, the initial weights are suitable for other problems.
• Empirically, lower-dimensional representation of inputs is a good initial weight.
50 / 81
Autoencoder
inputsignalsx
h
y
outputlayer
hiddenlayer
W
W
outputsignals
• A two-layer NN that reconstructs inputs (x ≈ y)
• Bottleneck: # of hidden units < # of inputs
• Computation:
• Encoder: h = f(Wx+ b)• Decoder: y = f(W>h+ b′)
51 / 81
Restricted Boltzmann machine (RBM)
c1 c2 cn
b1 b2 bmb3
v1 v2 vmv3
h1 h2 hn
wij
v = (v1, . . . , vm) ∈ {0, 1}m
h = (h1, . . . , hn) ∈ {0, 1}n
b = (b1, . . . , bm) ∈ Rm
c = (v1, . . . , vn) ∈ Rn
W = {wij} ∈ Rm×n
θ = (b, c,W ) (parameters)
RBM is an undirected graph constructed of
• a visible layer (for an input vector v) and
• a hidden layer (as low-dimensional representation h).
Estimation of θ: Maximizing likelihood p(v | θ) where
• p(v,h | θ) = exp (−E(v,h)) /Z and
• E(v,h) = −b>v − c>h− v>Wh
by Contrastive Divergence learning (approximation).
52 / 81
Process of pretraining, unrolling and fine-tuning
W1
W2
W3
W1
W2
W3
W3
W2
W1
Encoder
Decoder
(1) pretraining (2) unrolling
Code layer
W1
W2
W3
W3
W2
W1
(3) fine-tuning
1
2
3
4
5
6
copy
copy
input vectorRBM
RBM
RBM
53 / 81
Dimensionality reduction for retrieved documents
2000-500-250-125-2 autoencoder for 804,414 newswire stories
54 / 81
Outline
1 Fundamental of machine learning: classification
2 Perceptron
3 Multilayer feed-forward NN and backpropagation
4 Deep LearningUnsupervised pretraining by restricted Boltzmann machineRectified linear units (ReLU) and MaxoutDropout
5 Application of deep learning
55 / 81
The problem of logistic sigmoid function
0
0.2
0.4
0.6
0.8
1
-6 -4 -2 0 2 4 6
• An output of an unit is given by f(w>x) where
f(a) = sigm(a) =1
1 + exp(−a).
• The gradient of sigm decreases when training progresses.
• Then it makes training slow.
High precision requires long training!
56 / 81
Rectified linear units (ReLU)Nair and Hinton, Rectified Linear Units Improve Restricted Boltzmann Machines, ICML 2010. PDF
0
1
2
3
4
5
-4 -2 0 2 4
ReLUsoftplus
Softplus function: f(a) = log(1 + ea)
• More biologically plausible than sigmoid (rarelysaturated)
Rectified linear units (ReLU): practical approx. of softplus
• ReLU: f(a) = max(0, a)
• NReLU (Noisy ReLU): f(a) = max(0, a+N (0, σ(a)))
Features of ReLUs:
• Fast convergence
• High precision for real data
57 / 81
MaxoutGodfellow, Maxout Networks, ICML 2013. PDF
0
1
2
3
4
5
-4 -2 0 2 4
maxout
An example of maxout output
Maxout: f(x) = max(x1, x2, . . . , xn)
• An activation function that outputs the max of inputs
• Equal to a piecewise linear function
• Higher precision for real data
1 2 3 4 5 6 7#layers
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
Error
MNIST classification error versus network depth
Maxout test errorRectifier test errorMaxout train errorRectifier train error
58 / 81
Outline
1 Fundamental of machine learning: classification
2 Perceptron
3 Multilayer feed-forward NN and backpropagation
4 Deep LearningUnsupervised pretraining by restricted Boltzmann machineRectified linear units (ReLU) and MaxoutDropout
5 Application of deep learning
59 / 81
DropoutHinton, et al. Improving neural networks by preventing co-adaptation of feature detectors. Technicalreport, arXiv:1207.0580, 2012. PDF
50%
50%
50%
20%
hiddenlayers
inputs
outputlayer (100%)
Dropout probability(frequently used values)
• An important approach to prevent over-fitting
• The performance was better than otherregularization methods.
• To randomly drop inputs & hidden units duringtraining
Present withprobability p
wAlwayspresent
pw
Training Test
60 / 81
Dropout is like bagging
Bagging
Traning set
Random samplingwith replacement
Traning set #1
Traning set #2
Traning set #N
Classifier #1
Classifier #2
Classifier #N
Training
Training
Classifier #1
Classifier #2
Classifier #N
Test (Classification)
xInput
y1
y2
yN
y
Majorityvote
Output
Dropout• Each neuron is trained using different data.• It is similar to using combination of many NNs.
61 / 81
Effect of dropout of inputs
MNIST test set (handwritten digits)/classification/784-{800-800,1200-1200,2000-2000,1200-1200-1200}-1062 / 81
Effect of dropout
TIMIT core test set (English speech reconition)/classification/4 fully-connected hidden layers × 4000 units63 / 81
Outline
1 Fundamental of machine learning: classification
2 Perceptron
3 Multilayer feed-forward NN and backpropagation
4 Deep Learning
5 Application of deep learningImage processingOther fieldsProgramming language analysis
64 / 81
Outline
1 Fundamental of machine learning: classification
2 Perceptron
3 Multilayer feed-forward NN and backpropagation
4 Deep Learning
5 Application of deep learningImage processingOther fieldsProgramming language analysis
65 / 81
Image processing and feature extraction
Conventional approaches (shallow learning)
Raw Data (RGB pixels)
FeatureExtraction
Features(SIFT, SURF, etc.)
Classifier
"Animal"
Label(s)
• Features needed to bedesigned by hand.(SIFT, SURF, etc.)
• Craftsmanship was required.
Deep learning
Raw Data (RGB pixels)
"Animal"
Label(s)
Features
• Information correspondingto features is automaticallyobtained by a deep CNN.
In addition, accuracy is higher than the traditional ways!
66 / 81
ILSVRC 2012Krizhevsky, Ilya & Hinton, ImageNet Classification with Deep Convolutional Neural Networks, pp.1097–1105, NIPS 2012. PDF Ranking
• Large Scale Visual Recognition Challenge 2012 (ILSVRC 2012)
• A competition for object recognition in photos (1000 labels)• Outclassing of Hinton’s team (Toronto Univ.)
Team name ErrorSuperVision 0.15315ISI 0.26172OXFORD VGG 0.26979XRCE/INRIA 0.27058Univ. of Amsterdam 0.29576etc.
67 / 81
Google’s grandmother neuronsLe et al, Building High-level Features Using Large Scale Unsupervised Learning, ICML 2012. PDF
Pretraining by 10 million 200x200 pixel images
• 9-layer autoencoder
• 1 billion connections
Neurons responding specific stimulus (grandmother neurons)are automatically obtained.
Human faces Cat faces Human bodies
Image recognition accuracy is improved by training after the pretraining.
68 / 81
Outline
1 Fundamental of machine learning: classification
2 Perceptron
3 Multilayer feed-forward NN and backpropagation
4 Deep Learning
5 Application of deep learningImage processingOther fieldsProgramming language analysis
69 / 81
Deep Q-learningMnih et al., Playing Atari with Deep Reinforcement Learning, CoRR abs/1312.5602, 2013. PDF.
• Reinforcement Learning of Atari 2600 games by deep CNN (input: raw images of gamescreen)
• Winning Breakout, Enduro and Pong with an expert human player
• However, losing Q*bert, Seaquest, Space Invader
• Long-time strategy is required.
Pong Breakout Space Invaders Seaquest Beam Rider
70 / 81
Outline
1 Fundamental of machine learning: classification
2 Perceptron
3 Multilayer feed-forward NN and backpropagation
4 Deep Learning
5 Application of deep learningImage processingOther fieldsProgramming language analysis
71 / 81
Leaning to executeZaremba & Ilya, Learning to Execute, CoRR abs/1410.4615, 2014. Slides PDF
Can NNs predict execution output of short programs written in an unknown language?
Input :j=8584for x in range(8):j+=920
b=(1500+j)print((b+7564))
Target : 1218.
72 / 81
Problem setting
Model: recurrent neural network (RNN) with long-short term memory (LSTM)
• 2 layer × 400 units
• A program is given as character stream.
• The model does not know syntax and semantics of a given program.
Short programs
• The Python syntax
• Containing addition, multiplication, variable assignment, if-statement, and for-loops
• Double loops are forbidden.• One of the operands of multiplication and the range of for-loops is constant.
• Complexity parameters of programs:
• length: the maximum # of digits in integers in a program• nesting : the depth of a parse trees
73 / 81
Examples of short programs
length = 4, nesting = 3:
Input :j=8584for x in range(8):j+=920
b=(1500+j)print((b+7564))
Target : 1218.
Input :i=8827c=(i-5347)print((c+8704) if 2641<8500 else
5308)Target : 1218.
An example with scrambled characters:
Input :vqppknsqdvfljmncy2vxdddsepnimcbvubkomhrpliibtwztbljipccTarget : hkhpg
74 / 81
Exact prediction examples
Input :f=(8794 if 8887<9713 else (3*8334))print((f+574))
Target : 9368.Model prediction : 9368.
Input :c=445d=(c-4223)for x in range(1):d+=5272
print((8942 if d<3749 else 2951))Target : 8942.Model prediction : 8942.
75 / 81
Misprediction examples
Input :j=8584for x in range(8):j+=920
b=(1500+j)print((b+7567))
Target : 25011.Model prediction : 23011.
Input :a=1027for x in range(2):a+=(402 if 6358>8211 else 2158)
print(a)Target : 5343.Model prediction : 5293.
76 / 81
Training strategies
• Baseline
• Learning with target distribution (with length = a, nesting = b)
• Naive (naive curriculum learning) [Bengio et al., 2009]
• Gradually increasing the “difficulty level” of training samples• Giving them from (length,nesting) = (1, 1) to (a, b)
• Mix (mixed strategy)
• Mix of all levels of hardness• Picking length ∈ [1, a] and nesting ∈ [1, b] independently for every sample
• Combined (mixed strategy)
• Combination of mix with naive curriculum learning
77 / 81
Absolute prediction accuracy (baseline & combined)
78 / 81
Relative prediction accuracy (naive, mix & combined)
Naive Mix Combined
79 / 81
Conclusion
• Naive curriculum learning is sometimes worse than the baseline.
• Naive: giving training samples from (length,nesting) = (1, 1) to (a, b)• The model reconstructs its memory to take larger numbers (e.g., 5 digits Ô 6 digits).• The memory pattern reconstruction might be difficult.
• The authors said “we don’t know how much our networks understand the meaning ofprograms.”
80 / 81
Summary
• Fundamentals of machine learning
• Feature vectors• Classification• Dimensionality reduction
• History of NNs
• Perceptron and single-layer NNs• Multi-layer NNs, gradient decent method and backpropagation• Deep learning
• Pretraning by RBM• Rectified linear units (ReLU) and Maxout• Dropout
• Applications
• Image processing (ILSVRC 2012, Google’s grandmother neurons)• Programming language analysis (Learning to execute)
81 / 81
APPENDIX
82 / 81
Outline
6 Derivation of back propagation
83 / 81
Derivation of backpropagation (1)
The updating of weight wi,j is defined by
w(τ+1)i,j = w
(τ)i,j − η
∂E
∂w(τ)i,j
.
The differentiation is
∂E
∂w(τ)i,j
=∑
(x,t)∈S
∂E
∂a(τ)i (x)
∂a(τ)i (x)
∂w(τ)i,j
=∑
(x,t)∈S
δ(τ)i (x)s
(τ)j (x)
where
δ(τ)i (x) ≡ ∂E
∂a(τ)i (x)
.
84 / 81
Derivation of backpropagation (2)
• If i ∈ V is in the output layer, then
δ(τ)i (x) ≡ ∂E
∂a(τ)i (x)
=∂E
∂f(τ)i (x)
∂f(τ)i (x)
∂a(τ)i (x)
= E′(f(τ)i (x)
)h′i
(a(τ)i (x)
).
• Otherwise we obtain
δ(τ)i (x) =
∑k s.t. (k,i)∈S
∂E
∂a(τ)k (x)
∂a(τ)k (x)
∂a(τ)i (x)
=∑
k s.t. (k,i)∈S
∂E
∂a(τ)k (x)
∂a(τ)k (x)
∂s(τ)i (x)
∂s(τ)i (x)
∂a(τ)i (x)
=
∑k s.t. (k,i)∈S
δ(τ)k (x)wk,i
h′i
(a(τ)i (x)
).
85 / 81
Algorithm Backpropagation (batch mode)
Initialize w(0) randomly.for τ = 0, 1, 2, . . . do
Compute signals s(τ)i (x) for all i ∈ V, (x, t) ∈ S from lower to upper layers.
Update weights for all (i, j) ∈ E from upper to lower layers by w(τ+1)i,j ← w
(τ)i,j − η∆w
(τ)i,j
where∆w
(τ)i,j =
∑(x,t)∈S
δ(τ)i (x)s
(τ)j (x)
and
δ(τ)i (x) =
(y(τ)i (x)− ti
)h′i
(a(τ)i (x)
)if i ∈ V is in the output layer, ∑
k s.t. (k,i)∈E
δ(τ)k (x)wk,i
h′i
(a(τ)i (x)
)otherwise.
end for
86 / 81