Training Data

Training Data

Concept Map Practical Design Issues

Topology Initial Weights Learning Algorithm

Fast LearningNetwork Size Generalization

Occam’s RazorCeoss-validation

& Early stopping

Noiseweight sharing

Small sizeIncrease Training

Data

NetworkGrowing

NetworkPruning

BrainDamage

WeightDecay

Fast Learning

Training Data

Concept Map

Normalize

Scale

Present at Random

Cost Function

Activation Function

Adaptive slope

Architecture

Modular Committee

BP variants

No weightLearning

ForCorrectlyClassifiedPatterns

η

Chen & Mars

Momentum

Fahlmann’s

Other

Minimization Method

Conjugate Gradient

1. Practical Issues

Performance = f (training data, topology, initial weights, learning algorithm, . . .)

= Training Error, Net Size, Generalization.

(1) How to prepare training data, test data ?

- The training set must contain enough info to learn the task.

- Eliminate redundancy, maybe by data clustering.

- Training Set size N > W/

(N = # of training data, W = # of weights,

ε= Classification error permitted on Test data

Generalization error)

Chapter 4. Designing & Training MLPs

Ex. Modes of Preparing Training Data for Robot Control

The importance of the training data for tracking performance can not be overemphasized. Basically, three modes of training data selection are considered here. In the regular mode, the training data are obtained by tessellating the robot’s workspace and taking the grid points as shown in the next page. However, for better generalization, a sufficient amount of random training set might be obtained by observing the light positions in response to uniformly random Cartesian commands to the robot. This is the random mode. The best generalization power is achieved by the semi-random mode which evenly tessellates the workspace into many cubes, and chooses a randomly selected training point within each cube. This mode is essentially a blend of the regular and the random modes.

Regular modeRandom mode

Semi-random mode

Training Data Acquisition mode

Fig.10. Comparison of training errors and generalization errors for random and semi-random training methods.

0

5

10

15

20

25

30

35

40

45

50

0 50 100 150 200 250 300 350 400

RMS Error(mm)

Iteration

(a) Training error

RandomSemi-Random

0

5

10

15

20

25

30

35

40

45

50

0 50 100 150 200 250 300 350 400

RMS Error(mm)

Iteration

(b) Test error

RandomSemi-Random

(2) Optimal Implementation

A. Network Size

Occam’s Razor :

Any learning machine should be sufficiently large to solve a given problem, but not larger. A scientific model should favor simplicity or shave off the fat in the model. [Occam = 14th century British monk]

a. Network Growing: Start with a few / add more (Ref. Kim, Modified Error BP Adding Neurons to Hidden Layer, J. of

KIEE 92/4) If E > 1 and E < 2, Add a hidden node.

Use the current weights for existing weights and small random values for newly added weights as initial weights for new learning. b. Network Pruning ① Remove unimportant connections After brain damage, retrain the network. Improves generalization. ② Weight decay: after each epoch

c. Size Reduction by Dim. Reduction or Sparse Connectivity in

Input Layer [e.g. Use 4 random instead of 8 connections]

ww )( 1'

Number of Epochs

E

E

Good

train(X)

test(O)

T

X

R

U

R'

T : Training Data

X : Test Data

R : NN with Good Generalization

R' : NN with Poor Generalization

Overfitting(due to too many traning samples, weights) noise

Poor

train(X)

test(O)

B. Generalization : Train (memorize) and

Apply to an Actual problem (generalize)

LearningSubset

ValidationSubset

Test Set

Training Set

Mean-Square

ErrorValidation

sample

Trainingsample

Earlystopping

point

0 Number of epochs

For good generalization, train with Learning Subset. Check on validation set. Determine best structure based on Validation Subset [10% at every 5-10 iterations]. Train further with the full Training Set. Evaluate on test set. Statistics of training (validation) data must be similar to that of test (actual problem) data.

Tradeoff between training error and generalization !

Stopping Criterion Classification : Stop upon no error

Function Approximation : check EE ,

An Example showing how to prepare the various data sets to learn an unknown function from data samples

Other measures to improve generalization.

• Add Noise (1-5 %) to the Training Data or Weights.

• Hard (Soft) Weight Sharing (Using Equal Values for Groups of Weights)

Can Improve Generalization.

• For fixed training data, the smaller the net the better the generalization.

• Increase the training set to improve generalization.

• For insufficient training data, use leave-one (some)-out method = Select an example and train the net without this example, evaluate with

this unused example.

• If still does not generalize well, retrain with the new problem data.

C. Speeding Up [Accelerating] Convergence

- Ref. Book by Hertz, AI Expert Magazine 91/7

To speed up calculation itself:Reduce # Floating Point Ops by Using a Fixed Point Arithmetic And Use a Piecewise-Linear approximation for the sigmoid.

What will happen if more than 5-10 % validation data are used ?

Consider 2 industrial assembly robots for precision jobs made by the same company with an identical spec. If the same NN is used for both, then the robots will act differently. Do we need better generalization methods to compensate for this difference ?

Large N may increase noisy data. However, wouldn’t large N offset the problem by yielding more reliability ? How big an influence would noise have upon misguided learning ?

Wonder what measures can prevent the local minimum traps.

Students’ Questions from 2005

Is there any mathematical validation for the existence of a stopping point in validation samples ?

The number of hidden nodes are adjusted by a human. An NN is supposed to self-learn and therefore there must be a way to automatically adjust the number of the hidden nodes.

① Normalize Inputs, Scale Outputs. Zero mean, Decorrelate (PCA) and Covariance equalization

② Start with small uniform random initial weights [for tanh] :

rr )0(w ③ Present training patterns in random (shuffled) order (or mix different

classes).

④ Alternative Cost or Activation Functions

Ex.

Cost

Use with as targets or

( , , at )

infanw

4.2

s

3

2tanh716.1 ssinhtan

2 1

k

kkP

r

kkkP

ydEvsydE 2)(.

1

1)1( 1)0( max)( s 1s

⑤ Fahlman's Bias to Ensure Nonzero ))(1.0'( kk yt for output units only or for all units

⑥ Chen & Mars Differential step size 0.1 = =

outerinner

)( kk yt

⑦ (Accelerating BP Algorithm through Omitting Redundant Learning, J. of KIEE 92/9 ) If , Ep < do not update weight on the pth training pattern – NO BP

p

E

p

Cf. Principe’s Book recommends . Best to try diff. values.5 ~2 = For output units only -- drop .'

'

⑧ Ahalt - Modular Net

M L P 1

M L P 2

x

1y

2y

vary in

⑨ Ahalt - Adapt Slope (Sharpness) Parameters

J/

J/

es s

ww

)1/(1)(

⑩ Plaut Rule

infanpq

1

⑪ Jacobs - Learning Rate Adaptation

[Ref. Neural Networks, Vol. 1, No. 4, 88. ]

+

Reason for Slow Convergence

a. Momentum : )1()(

tJ

t ww

w

)(

)(

0 it

itJt

i

i

w

In plateau,

where is the effective learning rate

ww

J

1

1

without momentum

with momentum

b. rule : where )(w

)()(wt

Jtt

iii

)(t

J

iw

)(ti

i

t

0)()1()1(

0)()1(

ttift

ttifK

iii

iii

)1()()1()( ttt iii

])1()([)1( tt ii

For actual parameters to be used, consult Jacob’s paper and also “Getting a fast break with Backprop”, Tveter, AI Expert Magazine, excerpt from pdf files that I provided.


Is there any way to design a spherical error surface for faster convergence ?

Momentum provides inertia to jump over a small peak.

Parameter Optimization technique seems to a good help to NN design.

I am afraid that optimizing even the sigmoid slope and the learning rate may expedite overfitting.

In what aspect is it more manageable to remove the mean, decorrelate, etc. ?

How does using a bigger learning rate for the output layer help learning ?

Does the solution always converge if we use the gradient descent ?

Are there any shortcomings in using fast learning algorithms ?In the Ahalt’s modular net, is it faster for a single output only or all the outputs than an MLP ?Various fast learning methods have been proposed. Which is the best one ? Is it problem-dependent ?The Jacobs method cannot find the global min. for an error surface like:

⑫ Conjugate Gradient : Fletcher & Reeves

Line Search )()()1( nnn sww )(nw

)(ns

)]()([ nnEMin sw

If is fixed and )()]([)( nnEn gws Gradient Descent

)](*)([)]()([Min nnEnnE gwgw

If

0)](*)([)()]1([ nnEnnE gwgw

)1()1(, nn gs

)1( ng Steepest Descent

)()1( nn gg

)1( nw

GradientDescent SteepestDescent ConjugateGradient

(n)]E[w

)(ns

Gradient D.+ Line Search Steepest Descent + Momentum

GD SD

Momentum CG

w(n+1)w(n)

= 1)](nE[ w

w(n+2)

w(n)

w(n+1)

w(n+2)

1)](nE[* w

(n)]E[* w =)(* ns

w(n-1)

w(n)

(n)]E[w

w(n+1)

w(n) )(* ns w(n+1)

1)](nE[ w s(n+1)

)(ns

2) Choose such that

If : Conjugate Gradient

1) Line Search )()1()1( nnn sgs

jiww

EHessianHwhere

2

0)]1()1([)( nHnn sgs )()()( 00 wwww HEEFor

0)1()( nHn ss

From Polak-Ribiere Rule :

2)(

)1())()1((

n

nnn

g

ggg

0)()1(0)]()([

nnnnE sgsw

0)()]1()1([ nnnE ssw

N

YY

End

START

N

Line Search

Initialize)]0([

)0()0(

w

gs

E

))()((min nnE sw

)(*)()1( nnn sww

)]1([)1( nEn wg

)()1()1( nnn sgs

1))(( nE w

max2))(( nnnE orw

)(ns)(nw )1( nw

)(ns

)1( ng )1( ns

+ +

Steepest Descent Conjugate Gradient

Each step takes a line search.

For N-variable quadratic functions, converges in N steps at most

Recommended:

Steepest Descent + n steps of Conjugate Gradient

+ Steepest Descent + n steps of Conjugate Gradient

+

Comparison of SD and CG

X. Swarm Intelligence What is “swarm intelligence” and why

is it interesting?

Two kinds of swarm intelligence particle swarm optimization ant colony optimization

Some applications

Discussion

What is “Swarm intelligence”?

“Swarm Intelligence is a property of systems of non-intelligent agents exhibiting collectively intelligent behavior.”

Characteristics of a swarm distributed, no central control or data source no (explicit) model of the environment perception of environment ability to change environment

I can’t do…

We can do…

Group of friends each having a metal detector are on a treasure finding mission.

Each can communicate the signal and current position to the n nearest neighbors. If you neighbor is closer to the treasure than him, you can move closer to that neighbor thereby improving your own chance of finding the treasure. Also, the treasure may be found more easily than if you were on your own.

Individuals in a swarm interact to solve a global objective in a more efficient manner than one single individual could. A swarm is defined as a structured collection of interacting organisms [ants, bees, wasps, termites, fish in schools an birds in flocks] or agents. Within the swarms, individuals are simple in structure, but their collective behaviors can be quite complex. Hence, the global behavior of a swam emerges in a nonlinear manner from the behavior of the individuals in that swarm.

The interaction among individuals plays a vital role in shaping the swarm’s behavior. Interaction aids in refining experiential knowledge about the environment, and enhances the progress of the swarm toward optimality. The interaction is determined genetically or throgh social interaction.

Applications: function optimization, optimal route finding, scheduling, image and data analysis.

Why is it interesting? Robust nature of animal problem-solving

simple creatures exhibit complex behavior behavior modified by dynamic environmente.g.) ants, bees, birds, fishes, etc,.

Two kinds of Swarm intelligence

Particle swarm optimization Proposed in 1995 by J. Kennedy and R. C. Eberhar

t based on the behavior of bird flocks and fish scho

ols Ant colony optimization

defined in 1999 by Dorigo, Di Cargo and Gambardella

based on the behavior of ant colonies

1. Particle Swarm Optimization

Population-based method Has three main principles

a particle has a movement this particle wants to go back to the best previously visited

position this particle tries to get to the position of the best positioned

particles

Four types of neighborhood star (global) : all particles are neighbors of all

particles ring (circle) : particles have a fixed number of

neighbors K (usually 2) wheel : only one particle is connected to all

particles and act as “hub” random : N random conections are made betw

een the particles

algorithm

Initialization

Calculate performance

Update best particle

Move each particle

Until system converges

: xid(0) = random value, vid(0) = 0;

: F (xid(t)) = ? (F : performance)

: F (xid(t)) is better than the pbest

-> pbest = F(xid(t)), pid = xid(t), Same for the gbest: See next slide

Particle Dynamics

for convergence c1+ c2 < 4 [Kennedy 1998]

Examples

http://uk.geocities.com/markcsinclair/pso.html

http://www.engr.iupui.edu/~shi/PSO/AppletGUI.html

⑭ Local Minimum Problem• Restart with different initial weights, learning rates, and number

of hidden nodes

• Add (and anneal) noise a little (zero mean white Gaussian) to

weights or training data [desired output or input (for better

generalization) ]

• Use {Simulated Annealing} or {Genetic Algorithm Optimization then BP} ⑮ Design aided by a Graphic User Interface– NN Oscilloscope

Look at Internal weights/Node Activities with Color Coding

⑬ Fuzzy control of Learning rate, Slope (Principe’s, Chap. 4.16)


When the learning rate is optimized and initialized, there must be a rough boundary for it. Just an empirical way to do it ?

In Conjugate Gradient, s(n) = -g(n+1) …

The learning rate annealing just keeps on decreasing the error as n without looking at where in the error surface the current weights are. Is this OK ?

Conjugate Gradient is similar to Momentum in that old search direction is utilized in determining the new search direction. It is also similar to rule using the past trend.

Is CG always faster converging than the SD ?

Do the diff. initial values of the weights affect the output results ? How can we choose them ?

Documents

Training Data