November 16, 2010Neural Networks Lecture 17: Self-Organizing Maps 1 About Assignment #3 Two approaches to backpropagation learning: 1. “Per-pattern” learning:

November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps

1

About Assignment #3About Assignment #3

Two approaches to backpropagation learning:Two approaches to backpropagation learning:

1. “Per-pattern” learning: 1. “Per-pattern” learning:

Update weights after every exemplar presentation.Update weights after every exemplar presentation.

2. “Per-epoch” (batch-mode) learning:2. “Per-epoch” (batch-mode) learning:

Update weights after every epoch. During epoch, Update weights after every epoch. During epoch, compute the sum of required changes for each weight compute the sum of required changes for each weight across all exemplars. After epoch, update each weight across all exemplars. After epoch, update each weight using the respective sum.using the respective sum.


2


Per-pattern learning often approaches near-optimal Per-pattern learning often approaches near-optimal network error quickly, but may then take longer to network error quickly, but may then take longer to reach the error minimum. reach the error minimum.

During per-pattern learning, it is important to present During per-pattern learning, it is important to present the exemplars in random order. the exemplars in random order.

Reducing the learning rate between epochs usually Reducing the learning rate between epochs usually leads to better results.leads to better results.


3


Per-epoch learning involves less frequent weight Per-epoch learning involves less frequent weight updates, which makes the initial approach to the error updates, which makes the initial approach to the error minimum rather slow.minimum rather slow.

However, per-epoch learning computes the actual However, per-epoch learning computes the actual network error and its gradient for each weight so that network error and its gradient for each weight so that the network can make more informed decisions about the network can make more informed decisions about weight updates.weight updates.

Two of the most effective algorithms that exploit this Two of the most effective algorithms that exploit this information are information are QuickpropQuickprop and and RpropRprop..


4

The Quickprop Learning AlgorithmThe Quickprop Learning Algorithm

The assumption underlying Quickprop is that the network error as a function of each individual weight can be The assumption underlying Quickprop is that the network error as a function of each individual weight can be approximated by a paraboloid.approximated by a paraboloid.

Based on this assumption, whenever we find that the gradient for a given weight switched its sign between successive Based on this assumption, whenever we find that the gradient for a given weight switched its sign between successive epochs, we should fit a paraboloid through these data points and use its minimum as the next weight value.epochs, we should fit a paraboloid through these data points and use its minimum as the next weight value.


5

The Quickprop Learning AlgorithmThe Quickprop Learning AlgorithmIllustration (sorry for the crummy paraboloid):Illustration (sorry for the crummy paraboloid):

ww

EE

assumed error function assumed error function

(paraboloid)(paraboloid)

w(t-1)w(t-1)w(t)w(t)

E(t)E(t)

E(t-1)E(t-1)

slope: slope: E’(t)E’(t)

w(t-1)w(t-1)

w(t+1)w(t+1)

slope: slope: E’(t-1)E’(t-1)


6


Newton’s method:Newton’s method: cbwawE 2

btawtEw

tE

)(2)(')(

btawtEw

tE

)1(2)1(')1(

)1(

)1(')('

)1()(

)1(')('2

tw

tEtE

twtw

tEtEa

)1(

)())1(')('()('

tw

twtEtEtEb


7


For the minimum of E we must have:For the minimum of E we must have:

0)1(2)1(

btaw

w

tE

a

btw

2)1(

)1(

)1()(')()]1(')('[

)1(')('

)1()1(

tw

twtEtwtEtE

tEtE

twtw

)(')1('

)1()(')()1(

tEtE

twtEtwtw


8


Notice that this method cannot be applied if the error gradient has not decreased in magnitude and has not Notice that this method cannot be applied if the error gradient has not decreased in magnitude and has not changed its sign at the preceding time step.changed its sign at the preceding time step.

In that case, we would ascent in the error function or make an infinitely large weight modification.In that case, we would ascent in the error function or make an infinitely large weight modification.

In most cases, Quickprop converges several times faster than standard backpropagation learning. In most cases, Quickprop converges several times faster than standard backpropagation learning.


9

Resilient Backpropagation (Rprop)Resilient Backpropagation (Rprop)The Rprop algorithm takes a very different approach to improving backpropagation as compared to Quickprop.The Rprop algorithm takes a very different approach to improving backpropagation as compared to Quickprop.

Instead of making more use of gradient information for better weight updates, Rprop only uses the Instead of making more use of gradient information for better weight updates, Rprop only uses the signsign of the gradient, because its of the gradient, because its size can be a poor and noisy estimator of required weight updates.size can be a poor and noisy estimator of required weight updates.

Furthermore, Rprop assumes that different weights need different step sizes for updates, which vary throughout the learning process. Furthermore, Rprop assumes that different weights need different step sizes for updates, which vary throughout the learning process.


10

Resilient Backpropagation (Rprop)Resilient Backpropagation (Rprop)The basic idea is that if the error gradient for a given weight wThe basic idea is that if the error gradient for a given weight wijij had the same sign in two consecutive epochs, we increase its step size had the same sign in two consecutive epochs, we increase its step size ijij, because , because

the weight’s optimal value may be far away.the weight’s optimal value may be far away.

If, on the other hand, the sign switched, we decrease the step size.If, on the other hand, the sign switched, we decrease the step size.

Weights are always changed by adding or subtracting the current step size, regardless of the absolute value of the gradient.Weights are always changed by adding or subtracting the current step size, regardless of the absolute value of the gradient.

This way we do not “get stuck” with extreme weights that are hard to change because of the shallow slope in the sigmoid function.This way we do not “get stuck” with extreme weights that are hard to change because of the shallow slope in the sigmoid function.


11

Resilient Backpropagation (Rprop)Resilient Backpropagation (Rprop)

Formally, the step size update rules are:Formally, the step size update rules are:

otherwise ,

0 if ,

0 if ,

)1(

)()1()1(

)()1()1(

)(

tij

ij

t

ij

ttij

ij

t

ij

ttij

tij w

E

w

E

w

E

w

E

Empirically, best results were obtained with initial step Empirically, best results were obtained with initial step sizes of 0.1, sizes of 0.1, ++=1.2, =1.2, --=1.2, =1.2, maxmax=50, and =50, and minmin=10=10-6-6..


12

Resilient Backpropagation (Rprop)Resilient Backpropagation (Rprop)

Weight updates are then performed as follows:Weight updates are then performed as follows:

It is important to remember that, like in Quickprop, in It is important to remember that, like in Quickprop, in Rprop the gradient needs to be computed across all Rprop the gradient needs to be computed across all samples (per-epoch learning).samples (per-epoch learning).

otherwise , 0

0 if ,

0 if ,

)()(

)()(

)(

ij

ttij

ij

ttij

tij w

E

w

E

w


13

Resilient Backpropagation (Rprop)Resilient Backpropagation (Rprop)The performance of Rprop is comparable to Quickprop; it also considerably accelerates backpropagation learning.The performance of Rprop is comparable to Quickprop; it also considerably accelerates backpropagation learning.

Compared to both the standard backpropagation algorithm and Quickprop, Rprop has one advantage:Compared to both the standard backpropagation algorithm and Quickprop, Rprop has one advantage:

Rprop does not require the user to estimate or empirically determine a step size parameter and its change over time.Rprop does not require the user to estimate or empirically determine a step size parameter and its change over time.

Rprop will determine appropriate step size values by itself and can thus be applied “as is” to a variety of problems without significant loss of Rprop will determine appropriate step size values by itself and can thus be applied “as is” to a variety of problems without significant loss of efficiency.efficiency.


14

The Counterpropagation NetworkThe Counterpropagation NetworkLet us look at the CPN structure again.Let us look at the CPN structure again.

How can this network determine its hidden-layer winner unit?How can this network determine its hidden-layer winner unit?

XX11 XX22

Input Input layerlayer

HH11 HH22 HH33HiddenHiddenlayerlayer

YY11 YY22Output Output layerlayer

Ow23

Ow22

Ow21Ow13

Ow12

Ow11

Hw32

Hw31Hw22Hw21

Hw12

Hw11

Additional Additional connections!connections!


15

The Solution: MaxnetThe Solution: MaxnetA maxnet is a recurrent, one-layer network that uses A maxnet is a recurrent, one-layer network that uses competition to determine which of its nodes has the greatest competition to determine which of its nodes has the greatest initial input value.initial input value.

All pairs of nodes have inhibitory connections with the same All pairs of nodes have inhibitory connections with the same weight -weight -, where typically , where typically 1/(# nodes). 1/(# nodes).

In addition, each node has a self-excitatory connection to itself, In addition, each node has a self-excitatory connection to itself, whose weight whose weight is typically 1. is typically 1.

The nodes update their net input and their output by the The nodes update their net input and their output by the following equations:following equations:

i

iixwnet

),0max()( netnetf


16

MaxnetMaxnetAll nodes update their output simultaneously.All nodes update their output simultaneously.

With each iteration, the neurons’ activations will decrease With each iteration, the neurons’ activations will decrease until only one neuron remains active.until only one neuron remains active.

This is the “winner” neuron that had the greatest initial input. This is the “winner” neuron that had the greatest initial input.

Maxnet is a biologically plausible implementation of a Maxnet is a biologically plausible implementation of a maximum-finding function.maximum-finding function.

In parallel hardware, it can be more efficient than a In parallel hardware, it can be more efficient than a corresponding serial function. corresponding serial function.

We can add maxnet connections to the hidden layer of a We can add maxnet connections to the hidden layer of a CPN to find the winner neuron.CPN to find the winner neuron.


17

Maxnet ExampleMaxnet ExampleExample of a Maxnet with five neurons and Example of a Maxnet with five neurons and = 1, = 1, = 0.2: = 0.2:

0.50.5

0.90.9

110.90.9

0.90.9

00

0.240.24

0.360.360.240.24

0.240.24

00

0.070.07

0.220.220.070.07

0.070.07

00

00

0.170.1700

00

Winner!Winner!


18

Self-Organizing Maps (Kohonen Maps)Self-Organizing Maps (Kohonen Maps)

As you may remember, the counterpropagation As you may remember, the counterpropagation network employs a combination of network employs a combination of supervisedsupervised and and unsupervisedunsupervised learning. learning.

We will now study Self-Organizing Maps (SOMs) as We will now study Self-Organizing Maps (SOMs) as examples for completely unsupervised learning examples for completely unsupervised learning (Kohonen, 1980).(Kohonen, 1980).

This type of artificial neural network is particularly This type of artificial neural network is particularly similar to biological systems (as far as we understand similar to biological systems (as far as we understand them). them).


19


In the human cortex, multi-dimensional sensory input In the human cortex, multi-dimensional sensory input spaces (e.g., visual input, tactile input) are spaces (e.g., visual input, tactile input) are represented by two-dimensional maps.represented by two-dimensional maps.

The projection from sensory inputs onto such maps is The projection from sensory inputs onto such maps is topology conserving.topology conserving.

This means that neighboring areas in these maps This means that neighboring areas in these maps represent neighboring areas in the sensory input represent neighboring areas in the sensory input space.space.

For example, neighboring areas in the sensory cortex For example, neighboring areas in the sensory cortex are responsible for the arm and hand regions.are responsible for the arm and hand regions.


20


Such topology-conserving mapping can be achieved Such topology-conserving mapping can be achieved by SOMs:by SOMs:

• Two layers: input layer and output (map) layerTwo layers: input layer and output (map) layer

• Input and output layers are completely connected.Input and output layers are completely connected.

• Output neurons are interconnected within a defined Output neurons are interconnected within a defined neighborhood. neighborhood.

• A topology (neighborhood relation) is defined on A topology (neighborhood relation) is defined on the output layer. the output layer.


21


Network structure:Network structure:

input vector input vector xx

xx11

output vector output vector oo

xx22 xxnn

OO11 OO22 OO33 OOmm

……

……


22


Common output-layer structures:Common output-layer structures:

One-dimensionalOne-dimensional(completely interconnected)(completely interconnected)

Two-dimensionalTwo-dimensional(connections omitted, (connections omitted, only neighborhood only neighborhood relations shown [green])relations shown [green])

ii

ii

Neighborhood of neuron iNeighborhood of neuron i


23


A neighborhood function (i, k) indicates how closely A neighborhood function (i, k) indicates how closely neurons i and k in the output layer are connected to neurons i and k in the output layer are connected to each other.each other.

Usually, a Gaussian function on the distance between Usually, a Gaussian function on the distance between the two neurons in the layer is used:the two neurons in the layer is used:

position of iposition of k


24

Unsupervised Learning in SOMsUnsupervised Learning in SOMs

For n-dimensional input space and m output neurons:

(1) Choose random weight vector wi for neuron i, i = 1, ..., m

(2) Choose random input x

(3) Determine winner neuron k: ||wk – x|| = mini ||wi – x|| (Euclidean distance)

(4) Update all weight vectors of all neurons i in the neighborhood of neuron k: wi := wi + ·(i, k)·(x – wi) (wi is shifted towards x)

(5) If convergence criterion met, STOP. Otherwise, narrow neighborhood function and learning parameter and go to (2).

Documents

November 16, 2010Neural Networks Lecture 17: Self-Organizing Maps 1 About Assignment #3 Two approaches to backpropagation learning: 1. “Per-pattern” learning: