Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible

43hrs

Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 1 / 109

Back to optimization


Stochastic optimization

• Optimize a cost that is a random variable

• Types of randomness:

- Measurement plus noise: R+ ν

- Multiple effects mixed together (we might use a mixture model)

- Unknown statistical properties


Monte Carlo integration

• Expectation of a random variable X :

E {X} =

∫Eξ px(ξ) dξ

(over the whole data space E)

• . . . But only a sample {x1, . . . , xn} is given (training set)

• Empirical distribution Px(ξ) =1

n

∑nl=1 δ (ξ − xl)→

• Approximate (empirical) expectation of X :

E {X} =

∫Eξ Px(ξ) dξ =

1

n

n∑l=1

xl

• This is a Monte Carlo integral


• Suppose that R is classification performance (risk).

• We want to optimize the true risk, the one computed on all possible,infinite data:

R(w) =

∫R (y(x),w) p(x)dx.

• This is a function of w(the weights identify one specific neural net)

• It is also a function of the data distribution p(x)

(the performance is estimated on the data)


• When training a neural network we don’t have p(x), but only thetraining set {x1, . . . ,xn}

• From the training set we have the empirical distribution

Px(ξ) =1

n

n∑l=1

δ (ξ − xl)

• so we can compute a Monte Carlo estimate of the risk

R(w, X) =1

np

np∑l=1

R (y(xl),w)

this is the empirical risk.


Training by epoch

• Optimize using the whole training set to estimate the cost

• It means computing R (and the ∆W )

• on the basis of a Monte Carlo estimate of risk

• Finds the optimal value of an approximate (empirical) cost function


Stochastic approximation

• A special kind of stochastic optimization

• R is estimated at each input pattern using that pattern alone

• Extremely unreliable estimation – but it converges in probability!

• Robbins and Monro, 1951; Kiefer and Wolfowitz, 1952


• Convergence in probability:

limn→∞

Pr(|Rn −R| ≥ ε

)= 0

• Rn is the estimate of R on a training set of size n


Stochastic approximation

• Given:

- A function R whose gradient ∇R we want to set to zero, or minimize(but we can’t compute analytically)

- A sequence G1, G2, . . . , Gl, . . . of random samples of ∇R, affectedby random noise

- A decreasing sequence η1, η2, . . . , ηl, . . . of step size coefficients

• Basic iteration:w(l + 1) = w(l)− ηl Gl


Stochastic approximation: The intuition

• Each sample gives a noisy (stochastic) estimate of the gradient

• ⇒ ∇R + noise

• By averaging over time, noise cancels out

• Random variations also make it possible to escape local minima


Results on convergence of stochastic approximation

• If R is twice differentiable and convex,

then stochastic approximation converges with a rate of O(

1

l

)• A condition of convergence (not optimal rate of convergence):

0 <∑l

η2l = A <∞

• Usually the hypotheses are not met (complex cost landscape) and wedon’t have guarantees.


Training by pattern

• is computing R (and the ∆W )

• on the basis of an estimate of risk on a single point

• An extreme Monte Carlo estimate on a training set of one observationonly

• Finds the approximate optimal value of an approximate costfunction


Implementation of training

• By epoch: estimation loop, then update

• By pattern: estimation + update loop

• By pattern on a training set: l = random

• Learning rate η → By pattern: keep it low

• → By epoch: make it adaptive


Multi-layer neural networks


Connectionism and Parallel Distributed Processes

David Rumelhart James McClelland Geoffrey Hinton


What is connectionism?

• Connectionism is an approach to cognitive science that characterizeslearning and memory through the discrete interactions between nodesof neural networks

• Representation of concepts and rules not concentrated in symbols witha lot of meaning, but in sub-symbolic “neural encodings” (neuronactivations) which have a meaning only if taken collectively as patterns

• Neural networks are distributed and massively parallel

• They rely on spontaneously-generated internal representations


Network topologies

Most general: feedback.

*

Units may be visible or hidden (*)


Network topologies

A special type of feedback is lateral connections


Network topologies

Less general: a topology where cycles are forbidden: feedforward.Visible units may be input or output.


Network topologies

Least general: multi-layer


Why multi-layer?

Linear separability

Feature discovery

Hierarchies of abstractions


Example: Parity

Problem: Given any input string of d bits, tell whether the number of bitsset (= 1) is even.

Generalizes XOR: it is not linearly separable


Example: Parity

The solution requires d hidden units


Universal approximation theorem

G. Cybenko 1989

A feed-forward network with a single hidden layer containing a finitenumber of neurons can approximate any continuous function on compactsubsets of Rd


How do we train a multi-layer neural network?

1 With a suitable algorithm

2 With a sequence of independent trainings


• As we have seen, learning (e.g., learning to recognize) can be cast asthe problem of optimizing a suitable cost function (risk)

• But most optimization methods rely on the necessary minimumcondition ∇E = 0 or on the direction of the gradient ∇E

→ requirement: E must be at least differentiable (even better if alsoconvex, but that’s not always possible)

• Even if E is differentiable, for hidden units we cannot compute anerror term like (t− a)2 (mse)

→ requirement: we need a way to do this


A differentiable activation function

• Let’s write the discriminant function for a problem with two Gaussian,spherical, equal-variance classes.

• Translation of the origin, rotation of axes. . .

• 1-dimensional symmetrical problem in x with only two parameters

p(x|ω1) =1√2πσ

exp

[(x− µ)2

2σ2

]p(x|ω2) =

1√2πσ

exp

[(x+ µ)2

2σ2

]


For the Bayes theorem:

P (ω1) =p(x|ω1)P (ω1)

p(x|ω1)P (ω1) + p(x|ω2)P (ω2)

P (ω2) =p(x|ω2)P (ω2)

p(x|ω1)P (ω1) + p(x|ω2)P (ω2)


2-class discriminant function:

g(x) = P (ω1)− P (ω2)

=exp

[(x−µ)22σ2

]exp

[(x−µ)22σ2

]+ exp

[(x+µ)2

2σ2

] − exp[(x+µ)2

2σ2

]exp

[(x−µ)22σ2

]+ exp

[(x+µ)2

2σ2

]removing the factors 1/

√2πσ


g(x) =exp

[−x2+µ2

2σ2

]exp

[xµσ2

]− exp

[−x2+µ2

2σ2

]exp

[−xµσ2

]exp

[−x2+µ2

2σ2

]exp

[xµσ2

]+ exp

[−x2+µ2

2σ2

]exp

[−xµσ2

]The common positive factor exp

[−x2+µ2

2σ2

]cancels out:

g(x) =exµ

σ2 − e−xµ

σ2

exµ

σ2 + e−xµ

σ2


• We replace x with the score r = x ·w′

• We can absorb the factor µ/σ2 into the norm of w′:

w =µ

σ2w′

• We obtain

g(r) =er − e−r

er + e−r, r = x ·w

g(r) = hyperbolic tangent activation, tanh(r)

• logistic or sigmoid activation:

σ(r) =1

1 + e−r=

tanh(r) + 1

2


-1

-0.5

0

0.5

1

-10 -5 0 5 10

a

r

SIGMOID

TANH


-1

-0.5

0

0.5

1

-10 -5 0 5 10

a

r

HEAVISIDE

SIGN


• The sigmoid is the solution of the logistic equation

y′ = y(1− y)

• Therefore, by definition,

∂σ(r)

∂r= σ(r) ( 1− σ(r) )

• Also,∂ tanh(r)

∂r= 1− tanh2(r)


The error back-propagation algorithm

• Discovered by Amari/Werbos/Parker/Rumelhart/Hinton/Williams from1974 to 1986

• The name appears in Rosenblatt’s book “Principles of Neurodynamics”in 1962

• A clever application of the chain rule of differential calculus

• We can perform gradient descent in a distributed way and withoutactually computing derivatives

• The responsibility for errors is back-propagated from the outputsback inside the network, and distributed among the hidden layers.


The chain rule

df(g(x))

dx=

df(y)

dy

∣∣∣∣y=g(x)

dg(x)

dx

Where is the “chain”?

df(g(h(x)))

dx=

df(g)

dg↗ dg(h)

dh↗ dh(x)

dx

which, for instance, can be used to prove that

∂σ(r)

∂wi=

dσ(r)

dr∂r

∂wi= σ′(r)xi = σ(r) ( 1− σ(r) )xi (1)



np number of patterns in the training setni number of input units

nh number of hidden unitsno number of output unitsnw total number of weights, nw = (ni + 1)nh + (nh + 1)noi index for input componentsj index for hidden unitsk index for output unitsxi i-th component of input patternrj net stimulus of the j-th hidden unitrk net stimulus of the k-th output unit

shj j-th hidden unit activation valuesok k-th component of outputtgk k-th component of target

whiji weight to j-th hidden unit from i-th input unit [(ni + 1)× nh]wohkj weight to k-th output unit from j-th hidden unit [(nh + 1)× no]


Loss function λ(sok, tgk) = (tgk − sok)2

1 in general there may be several output units;

2 the overall cost function is not quadratic (a paraboloid) because thenetwork is non-linear

Non convex cost function


Expected cost

E =

∫1

2

1

no

no∑k=1

(sok(x)− tgk(x))2 p(x)dx (2)

E is known only through its estimate on the training set (here by epoch):

E =1

np

np∑l=1

1

2

1

no

no∑k=1

(sok(xl)− tgk(xl))2 (3)


Summation and differentiations are both linear and therefore can beexchanged freely.

E =1

2

1

no

no∑k=1

(sok − tgk)2 (4)

We only consider one pattern

• For training online = by pattern, we will apply immediately the ∆w

as we did with perceptron and Adaline

• For training by epoch, we will sum several ∆w and apply them onlyat the end of each pass (a training epoch).

• For training by batch, we will sum several ∆w and apply them onlyafter some % of a complete pass.


The operation of the multilayer perceptron is divided in two steps:

• activation forward-propagation

• error back-propagation.


Forward propagation

→→→→→→→→


Forward propagation

∀j rj =

ni∑i=0

whijixi ⇒ shj = σ(rj) (5)

∀k rk =

nh∑j=0

wohkjshj ⇒ sok = σ(rk) (6)


Error back-propagation

←←←←←←←←


Error back-propagation and update

we start from computation of partial derivatives, i.e., the gradient of theerror.

w is generically any of the weights of the network.

We need all the components of the gradient ∇E

These are∂E

∂wfor all possible w


∂E

∂w=

1

2

1

no

no∑k=1

∂ (sok − tgk)2

∂w=

1

no

no∑k=1

(sok − tgk)∂sok∂w

(7)

Depending on whether w is a woh or a whi we will have differentexpansions of the above expression.


Hidden-to-output weights wohkj

∂E

∂wohkj=

1

no

no∑k′=1

(sok′ − tgk′)∂sok′

∂rk′

∂rk′

∂shj(8)

We can drop all terms not depending on k, those with k′ 6= k:

∂E

∂wohkj=

1

no(sok − tgk)

∂sok∂rk

∂rk∂shj

(9)

We plug in quantities known from the forward pass:

∂E

∂wohkj=

1

no(sok − tgk)σ

′(rk)shj (10)


If we define

δk = (sok − tgk)σ′(rk) (11)

we have a generalization of the “delta” term which we have seen in the deltarule by Widrow and Hoff.

Generalized delta rule for the hidden-to-output weights:

∆wohkj = −ηδkshj , (12)


Problem with the input-to-hidden weights:not all terms are readily available.We use again the chain rule to find another formulation for ∂E/∂whiji


∂E

∂whiji=

1

2

1

no

no∑k=1

∂ (sok − tgk)2

∂whiji= (13)

=1

no

no∑k=1

(sok − tgk)∂sok∂rk

∂rk∂shj

∂shj∂whiji

(14)


Now the quantities appearing in the last equation are available, again fromeither the forward pass or theory:

• (sok − tgk)∂sok∂rk

= δk

•∂rk∂shj

= wohkj

•∂shj∂whiji

=∂shj∂∂rj

∂rj∂whiji

= σ′(rj)xi


∂E

∂whiji=

1

no

no∑k=1

(sok − tgk)∂sok∂rk

∂rk∂shj

∂shj∂whiji

(15)

=1

no

no∑k=1

[δkσ′(rk)wohkj

] [σ′(rj)xi

](16)

Note that the summation here does not disappear


We can further manipulate the expression, by first isolating the terms whichdo not contribute to the summation:

=

[1

no

no∑k=1

δkσ′(rk)wohkj

] [σ′(rj)xi

](17)

and then identifying the generalized delta for the input-to-hidden weights:

δj =

[1

no

no∑k=1

δkσ′(rk)wohkj

](18)


Generalized delta rule

for the input-to-hidden weights:

∆whiji = −ηδjxi , (19)

amazingly similar in form to that for the hidden-to-output weights


Important property of multi-layer networks

The layered network is the simplest possible connectivity that has theuniversal approximation property.

Should be large enough – or deep enough


Generalization and overfitting

The number of weights needs to be high.

We must take care of controlling overfitting.


Overfitting

Is the situation where

• R is low

• but |R−R| is high

Symptom: While training we are happy, but then tests fail!

No generalization due to too much specialization (learning the trainingset, not the classificatin rule)


Multi-layer perceptrons not a good model for the brain?

Some evidence that the brain uses sparse (localized) rather than dense(distributed) representations.

Probably both


Deep neural networks


David Hubel and Torsten Wiesel


Hubel and Wiesel placed electrodes in animals brains (visual cortex)They discovered the columnar organization of neurons


Each layer in a cortical colum extracts features from the input it receivesfrom the previous layer

These features are more and more abstract

Edges – Simple shapes – Composite shapes – Eyes, mouths, noses. . . Grandmother(The Grandmother Cell hypothesis)


Learning features

in neural networksInternal representation in hidden layers

Hierarchy requires many layers (deep networks).



Learning: Limits of multi-layer networks

Error back-propagation does not work well with very deep structures

Vanishing gradient phenomenon:At each layer, the backpropagated components of the gradient becomeexponentially smaller.

To avoid the problem: use shallow networks (theoretically sufficient).


Example of a shallow architecture

Support vector machines


Representational advantage of depth

In the 80’s and early 90’s some works proved that:some logical functions, that can be implemented with a depth of klayers, require exponentially more units if reduced to k − 1 layers

In the 2010’s: dependent inputs (variables) need very deep networks


How to avoid training the whole network altogether?


Multi-level hierarchies of networks

Cascaded networks of unsupervised layers trained one after the other+Final classification layer

The whole structure is finally trained with error back-propagation


The idea is not new: Neocognitron

K. Fukushima, 1987


Unsupervised learning principles


Information Bottleneck


Information Bottleneck


Techniques using the "information bottleneck" principle

Using statistics and entropy

• Coding theory

• Stochastic complexity and minimum description length

Using errors

• Autoencoders

• PCA

• Rate-distortion theory


Autoencoders

An autoencoder is a special case of a multi-layer perceptron charcterized bytwo aspects:

1 Structure: number of units in the input layer = number of units in theoutput layer > number of hidden units

2 Learned task: an autoencoder is trained to approximate theidentity function (= replicate its input at the output)

An autoencoder is not a classifier


Autoencoders


Autoencoders

What is interesting is not the output value (is an approximation to the input)but the pattern present on the hidden layer

Since we don’t use any target (the target coincides with the input),the autoencoder task is unsupervisedSometimes termed "self-supervised"


Learned features from a set of images


Recognizing handwritten digits


Features for recognizing ’0’ from ’8’


Features for recognizing ’1’ from ’8’


An example of an autoencoder for learning features fromsymbolic data

Task: diagnose Lyme disease from patient records

Problem: many features (observed signs and symptoms) are binary and verysparse



Learning the features



Using the learned features


Principal component analysis

Is an instance of factor analysis:Discover the few unobservable factors that give rise to observable(measurable) variables


Example of factor analysis problem:Discover the abilities underlying performance in school tests

Observed variables Marks in algebra testMarks in geometry testMarks in literature testMarks in foreign language testMarks in music testMarks in essay

Hidden factors Linguistic abilitySpatial abilitySymbolic processing ability


Principal Component Analysis or PCA

is a linear solution to the factor analysis problem

Linear: factors are linear combinations of patterns

v = λ1x1 + λ2x2 + . . .+ λdxd


PCA works on the Covariance matrix of data

Covariance between input xi and input xj :

σi,j = σj,i = E {(xi − xi)(xj − xj)}

E{} expectation (or mean over te training set), xi mean of i-th input

Σ =

σ1,1 σ1,2 . . . σ1,dσ2,1 σ2,2 . . . σ2,d...

. . ....

σd,1 σd,2 . . . σd,d


Note: If X is the training set as a matrix and all inputs have zero mean ,i.e., X −X , then

Σ = XTX

X = X-repmat(mean(X),size(X,1),1)


Principal componentsThe "factors" in PCA are called principal components and are given bythe eigenvectors of Σ:

v1, . . . ,vd

If we project pattern x = [x1, x2, . . . , xd] onto the componentvi = [v1, v2, . . . , vd]

we obtain the value of the i-th factor, or component, or feature, for patternx:

ai = x · vi =∑i

xivi

OK, components; but why "principal"?


Property

1 Eigenvectors of Σ can be ordered by the corresponding eigenvalues,from largest to smallest

2 Eigenvectors are thus ordered by variance or energy or level ofactivity from largest to smallest

3 Projection of the training set X onto the first r (principal)components gives the best rank r approximation to X itself, whenmeasured by mean square error

PCA is a form of lossy compressionThe principal components are features useful to represent the data in asynthetic way (information bottleneck)


It has been proved that an autoencoder with linear activations learnsthe principal components

This is because the objective is the mean squared reconstruction error of alower-rank representation, the same as PCA


Oja’s neuron

A single-unit model with linear (identity) activation

a = x ·w

Learning rule:

w← w + ηx(a− aw)


It can be proven that, for small η, Oja’s learning rule is a first-order Taylorapproximation of the Rayleigh quotient iteration method of finding theprincipal eigenvector.

At convergence, w is the principal component of Σ.


Oja’s neuron is a neural principal component analyzer

Advantages over using explicit eigensolvers (e.g., LAPACK eigensolver, orMatlab’s eig function):

1 Distributed

2 Online (big data!)

Disadvantages:

1 Stochastic (convergence in probability)

2 Slower because of the requirement of small η


Restricted Boltzmann Machines

A generative modelInvented by G. HintonStarted in the Eighties (Boltzmann machines) then developed in thefollowing decades


Boltzmann Machines:

• binary-valued units

• bi-directional connections

• symmetric weight (equal in the two directions)

• general topology (feedback possible)


The restricted version has the limitation that its topology must be abipartite graph

This makes it more tractable


Energy

• v = [vi] and h = [hi] visible and hidden unit activation values,respectively

• wi,j weight between vi and hj• ai and bi biases of visible and hidden units, respectively

then we can define an "energy"

E(v,h) = −∑i

aivi −∑j

bjhj −∑i

∑j

viwi,jhj


Probability of states

The probability of any possible network state is

P (v,h) =1

Ze−E(v,h)

with Z partition function (normalizer)


Probability of states

Since intra-layer connections are not present, probability of activation ofone unit does not depend on that of other units in the same layer – only inthe other layer

P (vi = 1|h) =1

1 + e−(ai+∑j wi,jhj)

P (hj = 1|v) =1

1 + e−(bj+∑i wi,jvi)


Training a RBM

Algorithm called contrastive divergenceUses random sampling from the probabilities (computed as above):

• Apply one input

• Compute probability P (h|v) - Sample from it to generate hiddenconfiguration

• Compute a positive update step ∆w+ = vhT (outer product)

• Generate one possible input v′ from the hidden configuration

• Compute probability P (h′|v′)• Compute a negative update step ∆w− = v′h′T

• Apply update: w← w + η(∆w+ −∆w−)

This does not optimize any explicit objective function!!


Training RBMs of large size is not simpleThere are tricks to make the task easier

Example: weight sharing and convolutional neural networksThese help with data having correlated inputs, as in image, video, speech,general time series.


Deep Belief Networks

• A DBN is a sequence of RBMs

• Each RBM can be trained independently of the following ones

• greedy strategy

• The last layer can be a classifier


Deep networks can be built out of RBMs, but also out of autoencoders

Autoencoders are less insensitive to random noise


Neural networks: Why bothering?

Deep learning achieved success in very complex tasks and won manycompetitionsExample: extracting words from audio and transforming them in automaticsubtitles (cfr. Youtube)


T H E E N D


Documents

Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible