Download pdf - [Codientu.org] NN Lectures

1

© 2012 Primož Potočnik NEURAL NETWORKS (0) Organization of the Study #1

NEURAL NETWORKS

Lecturer: Primož Potočnik

University of Ljubljana

Faculty of Mechanical Engineering

Laboratory of Synergetics

www.neural.si

[email protected]

+386-1-4771-167


TABLE OF CONTENTS

0. Organization of the Study

1. Introduction to Neural Networks

2. Neuron Model – Network Architectures – Learning

3. Perceptrons and linear filters

4. Backpropagation

5. Dynamic Networks

6. Radial Basis Function Networks

7. Self-Organizing Maps

8. Practical Considerations

http://www.uni-lj.si/

http://www.fs.uni-lj.si/

http://www.fs.uni-lj.si/lasin

http://www.neural.si/

mailto:[email protected]



2


0. Organization of the Study

0.1 Objectives of the study

0.2 Teaching methods

0.3 Assessment

0.4 Lecture plan

0.5 Books

0.6 SLO books

0.7 E-Books

0.8 Online resources

0.9 Simulations

0.10 Homeworks


1. Objectives of the study

• Objectives – Introduce the principles and methods of neural networks (NN)

– Present the principal NN models

– Demonstrate the process of applying NN

• Learning outcomes – Understand the concept of nonparametric modelling by NN

– Explain the most common NN architectures

• Feedforward networks

• Dynamic networks

• Radial Basis Function Networks

• Self-organized networks

– Develop the ability to construct NN for solving real-world problems

• Design proper NN architecture

• Achieve good training and generalization performance

• Implement neural network solution

3


2. Teaching methods

• Teaching methods: 1. Lectures 4 hours weekly, clasical & practical (MATLAB)

• Tuesday 9:15 - 10:45

• Friday 9:15 - 10:45

2. Homeworks home projects

3. Consultations with the lecturer

• Organization of the study – Nov – Dec: lectures

– Jan: homework presentations

– Jan: exam

• Location – Institute for Sustainable Innovative Technologies,

(Pot za Brdom 104, Ljubljana)


3. Assessment

• ECTS credits: – EURHEO (II): 6 ECTS

• Final mark: – Homework 50% final mark

– Written exam 50% final mark

• Important dates – Homework presentations: Tue, 8 Jan 2013

Fri, 11 Jan 2013

– Written exam: Fri, 18 Jan 2013

http://en.wikipedia.org/wiki/European_Credit_Transfer_and_Accumulation_System

4


4. Lecture plan (1/5)

1. Introduction to Neural Networks 1.1 What is a neural network?

1.2 Biological neural networks

1.3 Human nervous system

1.4 Artificial neural networks

1.5 Benefits of neural networks

1.6 Brief history of neural networks

1.7 Applications of neural networks

2. Neuron Model, Network Architectures and Learning 2.1 Neuron model

2.2 Activation functions

2.3 Network architectures

2.4 Learning algorithms

2.5 Learning paradigms

2.6 Learning tasks

2.7 Knowledge representation

2.8 Neural networks vs. statistical methods



3. Perceptrons and Linear Filters 3.1 Perceptron neuron

3.2 Perceptron learning rule

3.3 Adaline

3.4 LMS learning rule

3.5 Adaptive filtering

3.6 XOR problem

4. Backpropagation 4.1 Multilayer feedforward networks

4.2 Backpropagation algorithm

4.3 Working with backpropagation

4.4 Advanced algorithms

4.5 Performance of multilayer perceptrons

5


5. Dynamic Networks 5.1 Historical dynamic networks

5.2 Focused time-delay neural network

5.3 Distributed time-delay neural network

5.4 NARX network

5.5 Layer recurrent network

5.6 Computational power of dynamic networks


5.8 System identification

5.9 Model reference adaptive control



6. Radial Basis Function Networks 6.1 RBFN structure

6.2 Exact interpolation

6.3 Commonly used radial basis functions

6.4 Radial Basis Function Networks

6.5 RBFN training

6.6 RBFN for pattern recognition

6.7 Comparison with multilayer perceptron

6.8 RBFN in Matlab notation

6.9 Probabilistic networks

6.10 Generalized regression networks


6


7. Self-Organizing Maps 7.1 Self-organization

7.2 Self-organizing maps

7.3 SOM algorithm

7.4 Properties of the feature map

7.5 Learning vector quantization

8. Practical considerations 8.1 Designing the training data

8.2 Preparing data

8.3 Selection of inputs

8.4 Data encoding

8.5 Principal component analysis

8.6 Invariances and prior knowledge

8.7 Generalization



5. Books

1. Neural Networks and Learning Machines, 3/E Simon Haykin (Pearson Education, 2009)

2. Neural Networks: A Comprehensive Foundation, 2/E Simon Haykin (Pearson Education, 1999)

3. Neural Networks for Pattern Recognition Chris M. Bishop (Oxford University Press, 1995)

4. Practical Neural Network Recipes in C++ Timothy Masters (Academic Press, 1993)

5. Advanced Algorithms for Neural Networks Timothy Masters (John Wiley and Sons, 1995)

6. Signal and Image Processing with Neural Networks Timothy Masters (John Wiley and Sons, 1994)

http://www.pearsonhighered.com/educator/product/Neural-Networks-and-Learning-Machines/9780131471399.page




http://www.pearsonhighered.com/educator/academic/product/0,3110,0132733501,00.html

7


6. SLO Books

1. Nevronske mreže

Andrej Dobnikar, (Didakta 1990)

2. Modeliranje dinamičnih sistemov z umetnimi nevronskimi mrežami

in sorodnimi metodami

Juš Kocijan, (Založba Univerze v Novi Gorici, 2007)


7. E-Books (1/2)

List of links at www.neural.si

– An Introduction to Neural Networks

Ben Krose & Patrick van der Smagt, 1996

– Neural Networks - Methodology and Applications

Gerard Dreyfus, 2005

– Metaheuristic Procedures for Training Neural Networks

Enrique Alba & Rafael Marti (Eds.), 2006

– FPGA Implementations of Neural Networks

Amos R. Omondi & Mmondi J.C. Rajapakse (Eds.), 2006

– Trends in Neural Computation

Ke Chen & Lipo Wang (Eds.), 2007

Recommended as an

easy introduction


8


7. E-Books (2/2)

– Neural Preprocessing and Control of Reactive Walking Machines

Poramate Manoonpong, 2007

– Artificial Neural Networks for the Modelling and Fault Diagnosis of

Technical Processes

Krzysztof Patan, 2008

– Speech, Audio, Image and Biomedical Signal Processing using

Neural Networks [only two chapters],

Bhanu Prasad & S.R. Mahadeva Prasanna (Eds.), 2008

– MATLAB Neural Networks Toolbox 7

User's Guide, 2010


8. Online resources

List of links at www.neural.si

• Neural FAQ – by Warren Sarle, 2002

• How to measure importance of inputs – by Warren Sarle, 2000

• MATLAB Neural Networks Toolbox (User's Guide) – latest version

• Artificial Neural Networks on Wikipedia.org

• Neural Networks – online book by StatSoft

• Radial Basis Function Networks – by Mark Orr

• Principal components analysis on Wikipedia.org

• libsvm – Support Vector Machines library


ftp://ftp.sas.com/pub/neural/FAQ.html

ftp://ftp.sas.com/pub/neural/importance.html

http://www.mathworks.com/access/helpdesk/help/toolbox/nnet/

http://en.wikipedia.org/wiki/Artificial_neural_network

http://www.statsoft.com/textbook/stneunet.html

http://www.anc.ed.ac.uk/rbf/rbf.html

http://en.wikipedia.org/wiki/Principal_components_analysis

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

9


9. Simulations

• Recommended computing platform – MATLAB R2010b (or later) & Neural Network Toolbox 7

http://www.mathworks.com/products/neuralnet/

Acceptable older MATLAB release:

– MATLAB 7.5 & Neural Network Toolbox 5.1 (Release 2007b)

• Introduction to Matlab – Get familiar with MATLAB M-file programming

– Online documentation: Getting Started with MATLAB

• Freeware computing platform – Stuttgart Neural Network Simulator

http://www.ra.cs.uni-tuebingen.de/SNNS/


10. Homeworks

• EURHEO students (II) 1. Practical oriented projects

2. Based on UC Irvine Machine Learning Repository data

http://archive.ics.uci.edu/ml/

3. Select data set and discuss with lecturer

4. Formulate problem

5. Develop your solution (concept & Matlab code)

6. Describe solution in a short report

7. Submit results (report & Matlab source code)

8. Present results and demonstrate solution

• Presentation (~10 min)

• Demonstration (~20 min)

http://www.mathworks.com/products/neuralnet/

http://www.mathworks.com/access/helpdesk/help/pdf_doc/matlab/getstart.pdf




http://archive.ics.uci.edu/ml/

10

Video links

• Robots with Biological Brains: Issues and Consequences

Kevin Warwick, University of Reading

http://videolectures.net/icannga2011_warwick_rbbi/

• Computational Neurogenetic Modelling: Methods, Systems,

Applications

Nikola Kasabov, University of Auckland

http://videolectures.net/icannga2011_kasabov_cnm/



http://videolectures.net/icannga2011_warwick_rbbi/

http://videolectures.net/icannga2011_kasabov_cnm/

11

© 2012 Primož Potočnik NEURAL NETWORKS (1) Introduction to Neural Networks #21

1. Introduction to Neural Networks

1.1 What is a neural network?




1.5 Benefits of neural networks

1.6 Brief history of neural networks

1.7 Applications of neural networks

1.8 List of symbols


1.1 What is a neural network? (1/2)

• Neural network – Network of biological neurons

– Biological neural networks are made up

of real biological neurons that are

connected or functionally-related in the

peripheral nervous system or the central

nervous system

• Artificial neurons – Simple mathematical approximations of

biological neurons

12


What is a neural network? (2/2)

• Artificial neural networks – Networks of artificial neurons

– Very crude approximations of small parts of biological brain

– Implemented as software or hardware

– By “Neural Networks” we usually mean Artificial Neural Networks

– Neurocomputers, Connectionist networks, Parallel distributted processors, ...


Neural network definitions

• Haykin (1999) – A neural network is a massively parallel distributed processor that has a natural

propensity for storing experiential knowledge and making it available for use. It

resembles the brain in two respects:

– Knowledge is acquired by the network through a learning process.

– Interneuron connection strengths known as synaptic weights are used to store

the knowledge.

• Zurada (1992) – Artificial neural systems, or neural networks, are physical cellular systems which

can acquire, store, and utilize experiential knowledge.

• Pinkus (1999) – The question 'What is a neural network?' is ill-posed.

13



Cortical neurons (nerve cells) growing

in culture

Neurons have a large cell body with

several long processes extending

from it, usually one thick axon and

several thinner dendrites

Dendrites receive information from

other neurons

Axon carries nerve impulses away from

the neuron. Its branching ends make

contacts with other neurons and with

muscles or glands

This complex network forms the

nervous system, which relays

information through the body

0.1 mm


Biological neuron

14


Interaction of neurons

• Action potentials arriving at the synapses stimulate currents in its dendrites

• These currents depolarize

the membrane at its axon, provoking an action potential

• Action potential propagates

down the axon to its synaptic knobs, releasing neurotransmitter and stimulating the post-synaptic neuron (lower left)


Synapses

• Elementary structural and functional units that mediate the interaction between neurons

• Chemical synapse:

pre-synaptic electric signal chemical neurotransmitter post-synaptic electrical signal

15


Action potential

• Spikes or action potential – Neurons encode their outputs as a series of voltage pulses

– Axon is very long, high resistance & high capacity

– Frequency modulation Improved signal/noise ratio



• Human nervous system can be represented by three stages:

• Receptors

– collect information from environment (photons on retina, tactile info, ...)

• Effectors – generate interactions with the environment (muscle activation, ...)

• Flow of information – feedforward & feedback

Stimulus Receptors Effectors Response Neural net

(Brain)

16


Human brain

Human activity is regulated by

a nervous system:

• Central nervous system

– Brain

– Spinal cord

• Peripheral nervous system

≈ 1010 neurons in the brain

≈ 104 synapses per neuron

≈ 1 ms processing speed of a neuron

Slow rate of operation

Extrem number of processing

units & interconnections

Massive parallelism


Structural organization of brain

Molecules & Ions ................

Synapses ............................

Neural microcircuits ..........

Dendritic trees ....................

Neurons ...............................

Local circuits .......................

Interregional circuits ..........

Central nervous system .....

transmitters

fundamental organization level

assembly of synapses organized into patterns of

subunits of individual neurons

basic processing unit, size: 100 μm

localized regions in the brain, size: 1 mm

pathways, topographic maps

final level of complexity

connectivity to produce desired functions

17



• Neuron model

• Network of neurons


What NN can do?

• In principle – NN can compute any computable function (everything a normal digital computer

can do)

• In practice – NN are especially useful for classification and function approximation problems

which are tolerant of some imprecision

– Almost any finite-dimensional vector function on a compact set can be

approximated to arbitrary precision by feedforward NN

– Need a lot of training data

– Difficulties to apply hard rules (such as used in an expert system)

• Problems difficult for NN – Predicting random or pseudo-random numbers

– Factoring large integers

– Determining whether a large integer is prime or composite

– Decrypting anything encrypted by a good algorithm

18


1.5 Benefits of neural networks (1/3)

1. Ability to learn from examples • Train neural network on training data

• Neural network will generalize on new data

• Noise tolerant

• Many learning paradigms

• Supervised (with a teacher)

• Unsupervised (no teacher, self-organized)

• Reinforcement learning

2. Adaptivity • Neural networks have natural capability to adapt to the changing environment

• Train neural network, then retrain

• Continuous adaptation in nonstationary environment


Benefits of neural networks (2/3)

3. Nonlinearity • Artificial neuron can be linear or nonlinear

• Network of nonlinear neurons has nonlinearity distributed throughout the network

• Important for modelling inherently nonlinear signals

4. Fault tolerance • Capable of robust computation

• Graceful degradation rather then catastrophic failure

19


Benefits of neural networks (3/3)

5. Massively parallel distributed structure • Well suited for VLSI implementation

• Very fast hardware operation

6. Neurobiological analogy • NN design is motivated by analogy with brain

• NN are research tool for neurobiologists

• Neurobiology inspires further development of artificial NN

7. Uniformity of analysis & design • Neurons represent building blocks of all neural networks

• Similar NN architecture for various tasks: pattern recognition, regression,

time series forecasting, control applications, ...


www.stanford.edu/group/brainsinsilicon/

http://www.stanford.edu/group/brainsinsilicon/

20


1.6 Brief history of neural networks (1/2)

-1940 von Hemholtz, Mach, Pavlov, etc. – General theories of learning, vision, conditioning

– No specific mathematical models of neuron operation

1943 McCulloch and Pitts – Proposed the neuron model

1949 Hebb – Published his book The Organization of Behavior

– Introduced Hebbian learning rule

1958 Rosenblatt, Widrow and Hoff – Perceptron, ADALINE

– First practical networks and learning rules

1969 Minsky and Papert – Published book Perceptrons, generalised the limitations of single layer

perceptrons to multilayered systems

– Neural Network field went into hibernation


Brief history of neural networks (2/2)

1974 Werbos

– Developed back-propagation learning method in his PhD thesis

– Several years passed before this approach was popularized

1982 Hopfield

– Published a series of papers on Hopfield networks

1982 Kohonen – Developed the Self-Organising Maps

1980s Rumelhart and McClelland – Backpropagation rediscovered, re-emergence of neural networks field

– Books, conferences, courses, funding in USA, Europe, Japan

1990s Radial Basis Function Networks were developed

2000s The power of Ensembles of Neural Networks and

Support Vector Machines becomes apparent

21

Current NN research


Topics for the 2013 International Joint Conference on NN

– Neural network theory and models

– Computational neuroscience

– Cognitive models

– Brain-machine interfaces

– Embodied robotics

– Evolutionary neural systems

– Self-monitoring neural systems

– Learning neural networks

– Neurodynamics

– Neuroinformatics

– Neuroengineering

– Neural hardware

– Neural network applications

– Pattern recognition

– Machine vision

– Collective intelligence

– Hybrid systems

– Self-aware systems

– Data mining

– Sensor networks

– Agent-based systems

– Computational biology

– Bioinformatics

– Artificial life


1.7 Applications of neural networks (1/3)

• Aerospace – High performance aircraft autopilots, flight path simulations, aircraft control

systems, autopilot enhancements, aircraft component simulations, aircraft component fault detectors

• Automotive – Automobile automatic guidance systems, warranty activity analyzers

• Banking – Check and other document readers, credit application evaluators

• Defense – Weapon steering, target tracking, object discrimination, facial recognition, new

kinds of sensors, sonar, radar and image signal processing including data compression, feature extraction and noise suppression, signal/image identification

• Electronics – Code sequence prediction, integrated circuit chip layout, process control, chip

failure analysis, machine vision, voice synthesis, nonlinear modeling

http://www.ijcnn2013.org/

http://www.ijcnn2013.org/

22


Applications of neural networks (2/3)

• Financial – Real estate appraisal, loan advisor, corporate bond rating, credit line use

analysis, portfolio trading program, corporate financial analysis, currency price

prediction

• Manufacturing – Manufacturing process control, product design and analysis, process and

machine diagnosis, real-time particle identification, visual quality inspection

systems, welding quality analysis, paper quality prediction, computer chip quality

analysis, analysis of grinding operations, chemical product design analysis,

machine maintenance analysis, project planning and management, dynamic

modelling of chemical process systems

• Medical – Breast cancer cell analysis, EEG and ECG analysis, prothesis design,

optimization of transplant times, hospital expense reduction, hospital quality

improvement, emergency room test advisement


Applications of neural networks (3/3)

• Robotics – Trajectory control, forklift robot, manipulator controllers, vision systems

• Speech – Speech recognition, speech compression, vowel classification, text to speech

synthesis

• Securities – Market analysis, automatic bond rating, stock trading advisory systems

• Telecommunications – Image and data compression, automated information services, real-time

translation of spoken language, customer payment processing systems

• Transportation – Truck brake diagnosis systems, vehicle scheduling, routing systems

23


1.8 List of symbols

THIS PRESENTATION | MATLAB

n – iteration, time step

t – time

x – input .................................. p

y – network output ................... a

d – desired (target) output ....... t

f – activation function

v – induced local field .............. n

w – synaptic weight

b – bias

e – error


24

© 2012 Primož Potočnik NEURAL NETWORKS (2) Neuron Model, Network

Architectures and Learning

#47

2. Neuron Model – Network Architectures –

Learning

2.1 Neuron model

2.2 Activation functions




2.6 Learning tasks


2.8 Neural networks vs. statistical methods



#48

2.1 Neuron model

• Neuron – information processing unit that is fundamental to the operation of a neural

network

• Single input neuron – scalar input x

– synaptic weight w

– bias b

– adder or linear combiner Σ

– activation potential v

– activation function f

– neuron output y

• Adjustable parameters – synaptic weight w

– bias b

x v y

)( bwxfy

25



#49

Neuron with vector input

• Input vector x = [x1, x2, ... xR ], R = number of elements in input vector

• Weight vector w = [w1, w2, ... wR ]

• Activation potential v = w x + b

product of input vector and

weight vector Rx

x

1

v y

)...(

)(

2211 bxwxwxwf

bwxfy

RR

1w

Rw



#50

2.2 Activation functions (1/2)

• Activation function defines the output of a neuron

• Types of activation functions

Threshold function Linear function Sigmoid function

0 if

0 if

v

vvy

0

1)( vvy )(

)exp(1

1)(

vvy

yyy

v vv

26



#51

Activation functions (2/2)



#52

McCulloch-Pitts Neuron (1943)

• Vector input, threshold activation function

• Extremely simplified model of real biological neurons – Missing features: non-binary outputs, non-linear summation, smooth thresholding,

stochasticity, temporal information processing

• Nevertheless, computationally very powerful – Network of McCulloch-Pits neurons is capable of universal computation

bwxy

bwxy

bwxvy

if0

if1

)sgn()(

Rx

x

1

v y

)( bwxfy

The output is binary, depending on whether

the input meets a specified threshold

27



#53

Matlab notation

• Presentation of more complex neurons and networks – Input vector p is represented by the solid dark vertical bar [R x 1]

– Weight vector is shown as single-row, R-column matrix W [1 x R]

– p and W multiply into scalar Wp



#54

Matlab Demos

• nnd2n1 – One input neuron

• nnd2n2 – Two input neuron

../Matlab-Demos/nnd2n1.bat




28



#55


About network architectures – Two or more of the neurons can be combined in a layer

– Neural network can contain one or more layers

– Strong link between network architecture and learning algorithm

1. Single-layer feedforward networks • Input layer of source nodes projects onto an output layer of neurons

• Single-layer reffers to the output layer (the only computation layer)

2. Multi-layer feedforward networks • One or more hidden layers

• Can extract higher-order statistics

3. Recurrent networks • Contains at least one feedback loop

• Powerfull temporal learning capabilities



#56

Single-layer feedforward networks

29



#57

Multi-layer feedforward networks (1/2)



#58

Multi-layer feedforward networks (2/2)

• Data flow strictly feedforward: input output

• No feedback Static network, easy learning

30



#59

Recurrent networks (1/2)

• Also called “Dynamic networks”

• Output depends on – current input to the network (as in static networks)

– and also on current or previous inputs, outputs, or states of the network

• Simple recurrent network

Delay Feedback loop



#60

Recurrent networks (2/2)

• Layered Recurrent Dynamic Network – example

31



#61


• Important ability of neural networks – To learn from its environment

– To improve its performance through learning

• Learning process 1. Neural network is stimulated by an environment

2. Neural network undergoes changes in its free parameters as a result of this stimulation

3. Neural network responds in a new way to the environment because of its changed internal structure

• Learning algorithm Prescribed set of defined rules for the solution of a learning problem

1. Error correction learning

2. Memory-based learning

3. Hebbian learning

4. Competitive learning



#62

Error-correction learning (1/2)

1. Neural network is driven by input x(t) and responds with output y(t)

2. Network output y(t) is compared with target output d(t)

Error signal = difference of network output and target output

)()()( tdtyte

x(t) y(t)

d(t)

e(t)

32



#63

Error-correction learning (2/2)

• Error signal control mechanism to correct synaptic weights

• Corrective adjustments designed to make network output y(t)

closer to target d(t)

• Learning achieved by minimizing instantaneous error energy

• Delta learning rule (Widrow-Hoff rule) – Adjustment to a synaptic weight of a neuron is proportional to the product of the error signal

and the input signal of the synapse

• Comments – Error signal must be directly measurable

– Key parameter: Learnign rate η

– Closed loop feedback system Stability determined by learning rate η

)(2

1)( 2 tet

)()()( txtetw



#64

Memory-based learning

• All (or most) past experiences are stored in a memory

of input-output pairs (inputs and target classes)

• Two essential ingredients of memory-based learning 1. Define local neighborhood of a new input xnew

2. Apply learning rule to adapt stored examples in the local neighborhood of xnew

• Examples of memory-based learning – Nearest neighbor rule

• Local neighborhood defined by the nearest training example (Euclidian distance)

– K-nearest neighbor classifier

• Local neighborhood defined by k-nearest training examples robust against outliers

– Radial basis function network

• Selecting the centers of basis functions

N

iii yx1

),(

33



#65

Hebbian learning

• The oldest and most famous learning rule (Hebb, 1949) – Formulated as associative learning in a neurobiological context

“When an axon of a cell A is near enough to exite a cell B and repeatedly or

persistently takes part in firing it, some growth process or metabolic changes take place

in one or both cells such that A’s efficiency as one of the cells firing B, is increased.”

– Strong physiological evidence for Hebbian learning in hippocampus,

important for long term memory and spatial navigation

• Hebbian learning (Hebbian synapse) – Time dependent, highly local, and strongly interactive mechanism to increase

synaptic efficiency as a function of the correlation between the presynaptic and

postsynaptic activities.

1. If two neurons on either side of a synapse are activated simultaneously, then the

strength of that synapse is selectively increased

2. If two neurons on either side of a synapse are activated asynchronously, then that

synapse is selectively weakned or eliminated

– Simplest form of Hebbian learning

)()()( txtytwx y



#66

Competitive learning

• Competitive learning network architecture 1. Set of inputs, connected to a layer of outputs

2. Each output neuron receives excitation from all inputs

3. Output neurons of a neural network compete to

become active by exchanging lateral inhibitory connections

4. Only a single neuron is active at any time

• Competitive learning rule – Neuron with the largest induced local field becomes a winning neuron

– Winning neuron shifts its synaptic weights toward the input

Individual neurons specialize on ensambles of similar patterns

feature detectors for different classes of input patterns

Inputs

34



#67


• Learning algorithm – Prescribed set of defined rules for the solution of a learning problem

• Learning paradigm – Manner in which a neural network relates to its environment

1. Supervised learning

2. Unsupervised learning

3. Reinforcement learning

1. Error correction learning 2. Memory-based learning 3. Hebbian learning 4. Competitive learning



#68

Supervised learning

• Learning with a teacher – Teacher has a knowledge of the environment

– Knowledge is represented by a set of input-output examples

• Learning algorithm – Error-correction learning

– Memory-based learning

Environment Teacher

Learning

system

+

- Σ

Error signal

Target response = optimal action

35



#69

Unsupervised learning

• Unsupervised or self-organized learning – No external teacher to oversee the learning process

– Only a set of input examples is available, no output examples

– Unsupervised NNs usually perform some kind of data compression, such as

dimensionality reduction or clustering

• Learning algorithms – Hebbian learning

– Competitive learning

Environment Learning

system



#70

Reinforcement learning

– No teacher, environment only offers primary reinforcement signal

– System learns under delayed reinforcement

• Temporal sequence of inputs which result in the generation of a reinforcement signal

– Goal is to minimize the expectation of the cumulative cost of actions taken over

a sequence of steps

– RL is realized through two neural networks:

Critic and Learning system

– Critic network converts primary reinforcement

signal (obtained directly from environment)

into a higher quality heuristic reinforcement signal

which solves temporal credit assignment problem

Environment Critic

Learning

system

Actions

Primary

reinforcement

Heuristic

reinforcement

36



#71

2.6 Learning tasks (1/7)

1. Pattern Association – Associative memory is brain-like distributed memory that learns by association

– Two phases in the operation of associative memory

1. Storage

2. Recall

– Autoassociation

• Neural network stores a set of patterns by repeatedly presenting them to the network

• Then, when presented a distored pattern, neural network is able to recall the original

pattern

• Unsupervised learning algorithms

– Heteroassociation

• Set of input patterns is paired with arbitrary set of output patterns

• Supervised learning algorithms



#72

2. Pattern Recognition – In pattern recognition, input signals are assigned to categories (classes)

– Two phases of pattern recognition

1. Learning (supervised)

2. Classification

– Statistical nature of pattern recognition

• Patterns are represented in multidimensional

decision space

• Decision space is divided by separate

regions for each class

• Decision boundaries are determined by a

learning process

• Support-Vector-Machine example

Learning tasks (2/7)

37



#73

3. Function Approximation – Arbitrary nonlinear input-output mapping

y = f(x)

can be approximated by a neural network, given a set of labeled examples

{xi, yi}, i=1,..,N

– The task is to approximate the mapping f(x) by a neural network F(x)

so that f(x) and F(x) are close enough

||F(x) – f(x)|| < ε for all x, (ε is a small positive number)

– Neural network mapping F(x) can be realized by supervised learning

(error-correction learning algorithm)

– Important function approximation tasks

• System identification

• Inverse system




#74


• System identification

• Inverse system

Environment Unknown

System

Neural

network

+

- Σ

Error signal

Unknown system response

Environment System Neural

network

+

- Σ

Error signal

Inputs from the environment

38



#75

4. Control • Neural networks can be used to control a plant (a process)

• Brain is the best example of a paralled distributed generalized controller

• Operates thousands of actuators (muscles)

• Can handle nonlinearity and noise

• Can handle invariances

• Can optimize over long-range planning horizon

– Feedback control system (Model reference control)

• NN controller has to supply inputs that will drive a plant according to a reference




#76

– Model predictive control

• NN model provides multi-step ahead predictions for optimizer

39



#77

5. Filtering • Filter – device or algorithm used to extract information about a prescribed

quantity of interest from a noisy data set

• Filters can be used for three basic information processing tasks:

1. Filtering

• Extraction of information at discrete time n by using measured data up to and including

time n

• Examples: Cocktail party problem, Blind source separation

2. Smoothing

• Differs from filtering in:

a) Data need not be available at time n

b) Data measured later than n can be used to obtain this information

3. Prediction

• Deriving information about the quantity in the future at time n+h, h>0, by using data

measured up to including n

• Example: Forecasting of energy consumption, stock market prediction


o o o o o o o o o o

o o o o o o x o o o

o o o o o o o o o o x



#78

6. Beamforming – Spatial form of filtering, used to distinguish between the spatial properties of a

target signal and background noise

– Device is called a beamformer

– Beamforming is used in human auditory response and echo-locating bats

the task is suitable for neural network application

– Common beamforming tasks: radar and sonar systems

• Task is to detect a target in the presence of receiver noise and interfering signals

• Target signal originates from an unknown direction

• No a priori information available on interfering signals

– Neural beamformer, neuro-beamformer, attentional neurocomputers


40



#79

Adaptation

Learning has spatio-temporal nature – Space and time are fundamental dimensions of learning (control, beamforming)

1. Stationary environment – Learning under the supervision of a teacher, weights then frozen

– Neural network then relies on memory to exploit past experiences

2. Nonstationary environment – Statistical properties of environment change with time

– Neural network should continuously adapt its weights in real-time

– Adaptive system continuous learning

3. Pseudostationary environment – Changes are slow over a short temporal window

• Speech – stationary in interval 10-30 ms

• Ocean radar – stationary in interval of several seconds



#80

41



#81


• What is knowledge? – Stored information or models used by a person or machine to interpret, predict,

and appropriately respond to the outside world (Fischler & Firschein, 1987)

• Knowledge representation – Good solution depends on a good representation of knowledge

– Knowledge of the world consists of:

1. Prior information – facts about what is and what has been known

2. Observations of the world – measurements, obtained through sensors designed

to probe the environment

Observations can be:

1. Labeled – input signals are paired with desired response

2. Unlabeled – input signals only



#82

Knowledge representation in NN

• Design of neural networks based directly on real-life data – Examples to train the neural network are taken from observations

• Examples to train neural network can be – Positive examples ... input and correct target output

• e.g. sonar data + echos from submarines

– Negative examples ... input and false output

• e.g. sonar data + echos from marine life

• Knowledge representation in neural networks – Defined by the values of free parameters (synaptic weights and biases)

– Knowledge is embedded in the design of a neural network

– Interpretation problem – neural networks suffer from inability to explain how a

result (decision / prediction / classification) was obtained

• Serious limitation for safe-critical application (medicial diagnosis, air trafic)

• Explanation capability by integration of NN and other artificial intelligence methods

42



#83

Knowledge representation rules for NN

Rule 1 Similar inputs from similar classes should produce similar representations inside the network, and should be classified to the same category

Rule 2 Items to be categorized as separate classes should be given widelly different representations in the network

Rule 3 If a particular feature is important, then there should be a large number of neurons involved in the representation of that item in the network

Rule 4 Prior information and invariances should be built into the design of a neural network, thereby simplifying the network design by not having to learn them



#84

Prior information and invariances (Rule 4)

• Application of Rule 4 results in neural networks with

specialized structure – Biological visual and auditory networks are highly specialized

– Specialized network has smaller number of parameters

• needs less training data

• faster learning

• faster network throughput

• cheaper because of its smaller size

• How to build prior information into neural network

design – Currently no well defined rules, but usefull ad-hoc procedures

– We may use a combination of two techniques

1. Receptive fields restricting the network architecture by using local connections

2. Weight-sharing several neurons share same synaptic weights

43



#85

How to build invariances into NN

Character recognition example Transformations Pattern recognition system should be invariant to them

Techniques 1. Invariance by neural network structure

2. Invariance by training

3. Invariant feature space

Original Size Rotation Shift Incomplete image



#86

Invariant feature space

• Neural net classifier with invariant feature extractor

• Features – Characterize the essential information content of an input data

– Should be invariant to transformations of the input

• Benefits 1. Dimensionality reduction – number of features is small compared to the original

input space

2. Relaxed design requirements for a neural network

3. Invariances for all objects can be assured (for known transformations)

Prior knowledge is required!

Input Class estimate Invariant

feature

extractor

Neural

network

classifier

44



#87

Example 2A (1/4)

Invariant character recognition

• Problem: distinguishing handwritten characters ‘a’ and ‘b’

• Classifier design

• Image representation – Grid of pixels (typically 256x256) with gray level [0..1] (typically 8-bit coding)

Class estimate: ‘A’, ‘B’ Invariant

feature

extractor

Neural

network

classifier



#88

Example 2A (2/4)

Problems with image representation

1. Invariance problem (various transformations)

2. High dimensionality problem – Image size 256x256 65536 inputs

Curse of dimensionality – increasing input dimensionality leads to sparse data

and this provides very poor representation of the mapping

problems with correct classification and generalization

Possible solution – Combining inputs into features

Goal is to obtain just a few features instead of 65536 inputs

Ideas for feature extraction (for character recognition)

widthcharacter

heigth character1F

45



#89

Example 2A (3/4)

Feature extraction

• Extracted feature:

• Distribution for various samples from class ‘A’ and ‘B’

• Overlaping distributions: need for additional features – F1, F2, F3, ...

widthcharacter

heigth character1F

samples from

class ‘A’

samples from

class ‘B’

Decision

Class ‘A’ Class ‘B’

F1



#90

Example 2A (4/4)

Classification in multi feature space

• Classification in the space of two features (F1, F2)

• Neural network can be used for classification in the

feature space (F1, F2) – 2 inputs instead of 65536 original inputs

– Improved generalization and classification ability

F2

F1

Decision boundary

samples from

class ‘A’

samples from

class ‘B’

46



#91

Generalization and model complexity

• What is the optimal decision boundary?

– Best generalization is achieved by a model whose complexity is

neither too small nor too large

– Occam’s razor principle: we should prefer simpler models to more

complex models

– Tradeoff: modeling simplicity vs. modeling capacity

Linear classifier is insufficient,

false classifications

Optimal classifier ? Over-fitting, correct classification

but poor generalization



#92

2.8 Neural networks vs. stat. methods (1/3)

• Considerable overlap between neural nets and statistics – Statistical inference means learning to generalize from noisy data

– Feedforward nets are a subset of the class of nonlinear regression and discrimination models

– Application of statistical theory to neural networks: Bishop (1995), Ripley (1996)

• Most NN that can learn to generalize effectively from

noisy data are similar or identical to statistical methods – Single-layered feedforward nets are basically generalized linear models

– Two-layer feedforward nets are closely related to projection pursuit regression

– Probabilistic neural nets are identical to kernel discriminant analysis

– Kohonen nets for adaptive vector quantization are similar to k-means cluster analysis

– Kohonen self-organizing maps are discrete approximations to principal curves and surfaces

– Hebbian learning is closely related to principal component analysis

• Some neural network areas have no relation to statistics – Reinforcement learning

– Stopped training (similar to shrinkage estimation, but the method is quite different)

47



#93

Neural networks vs. statistical methods (2/3)

• Many statistical methods can be used for flexible nonlinear modeling

• Polynomial regression, Fourier series regression

• K-nearest neighbor regression and discriminant analysis

• Kernel regression and discriminant analysis

• Wavelet smoothing, Local polynomial smoothing

• Smoothing splines, B-splines

• Tree-based models (CART, AID, etc.)

• Multivariate adaptive regression splines (MARS)

• Projection pursuit regression, various Bayesian methods

• Why use neural nets rather than statistical methods?

– Multilayer perceptron (MLP) tends to be useful in similar situations as projection pursuit regression, i.e.:

• the number of inputs is fairly large,

• many of the inputs are relevant, but

• most of the predictive information lies in a low-dimensional subspace

– Some advantages of MLPs over projection pursuit regression • computing predicted values from MLPs is simpler and faster

• MLPs are better at learning moderately pathological functions than are many other methods with stronger smoothness assumptions



#94

Neural networks vs. statistical methods (2/3)

Neural Network Jargon – Generalizing from noisy data ....................................

– Neuron, unit, node ....................................................

– Neural networks .......................................................

– Architecture ..............................................................

– Training, Learning, Adaptation .................................

– Classification ............................................................

– Mapping, Function approximation ............................

– Competitive learning .................................................

– Hebbian learning ......................................................

– Training set ...............................................................

– Input .........................................................................

– Output .......................................................................

– Generalization ..........................................................

– Prediction .................................................................

Statistical Jargon Statistical inference

A simple linear or nonlinear computing element that

accepts one or more inputs and computes a

function thereof

A class of flexible nonlinear regression and

discriminant models, data reduction models,

and nonlinear dynamical systems

Model

Estimation, Model fitting, Optimization

Discriminant analysis

Regression

Cluster analysis

Principal components

Sample, Construction sample

Independent variables, Predictors, Regressors,

Explanatory variables, Carriers

Predicted values

Interpolation, Extrapolation, Prediction

Forecasting

48

MATLAB example

• nn02_neuron_output



#95

MATLAB example

• nn02_custom_nn



#96

nn02_neuron_output/nn02_neuron_output.m

nn02_custom_nn/nn02_custom_nn.m

nn02_custom_nn/nn02_custom_nn.m

49

MATLAB example

• nnstart



#97



#98

50

© 2012 Primož Potočnik NEURAL NETWORKS (3) Perceptrons and Linear Filters #99

3. Perceptrons and Linear Filters

3.1 Perceptron neuron


3.3 Perceptron network

3.4 Adaline


3.6 Adaline network

3.7 ADALINE vs. Perceptron


3.9 XOR problem


Introduction

• Pioneering neural network contributions – McCulloch & Pits (1943) – the idea of neural networks as computing machines

– Rosenblatt (1958) – proposed perceptron as the first supervised learning model

– Widrow and Hoff (1961) – least-mean-square learning as an important

generalization of perceptron learning

• Perceptron – Layer of McCulloch-Pits neurons with adjustable synaptic weights

– Simplest form of a neural network for classification of linearly separable patterns

– Perceptron convergence theorem for two linearly separable classes

• Adaline – Similar to perceptron, trained with LMS learning

– Used for linear adaptive filters

51


3.1 Perceptron neuron

• Perceptron neuron (McCulloch-Pits neuron):

hard-limit (threshold) activation function

• Perceptron output: 0 or 1 usefull for classification If y=0 pattern belongs to class A

If y=1 pattern belongs to class B

0 if

0 if

v

vvy

0

1)(

Rx

x

1

v y y

v


Linear discriminant function

• Perceptron with two inputs

– Separation between the two classes is a straight line, given by

– Geometric representation

– Perceptron represents linear discriminant function

)()( 2211 bxwxwfbwxfy

02211 bxwxw

2

1

2

12

w

bx

w

wx

2x

1x

1x

2x

v y

52


Matlab Demos (Perceptron)

• nnd2n2 – Two input perceptron

• nnd4db – Decision boundaries


How to train a perceptron?

• How to train weights and bias? – Perceptron learning rule

– Least-means-square learning rule or “delta rule”

• Both are iterative learning procedures 1. A learning sample is presented to the network

2. For each network parameter, the new value is computed by adding a correction

• Formulation of the learning problem – How do we compute Δw(t) and Δb(t) in order to classify the learning patterns

correctly?

)()()1(

)()()1(

nbnbnb

nwnwnw jjj1x

Rx

2x v y



../Matlab-Demos/nnd4db.bat

../Matlab-Demos/nnd4db.bat

53



• A set of learning samples (inputs and target classes)

• Objective: Reduce error e between target class d and neuron response y

(error-correction learning)

e = d - y

• Learning procedure 1. Start with random weights for the connections

2. Present an input vector xi from the set of training samples

3. If perceptron response is wrong: y≠d, e≠0, modify all connections w

4. Go back to 2

1,0,),(1 ii

N

iii dxdx


Three conditions for a neuron

• After the presentation of input x, the neuron can be in

three conditions:

– CASE 1:

If neuron output is correct, weights w are not altered

– CASE 2:

Neuron output is 0 instead of 1 (y=0, d=1, e=d-y=1)

Input x is added to weight vector w

• This makes the weight vector point closer to the input vector, increasing the chance that

the input vector will be classified as 1 in the future.

– CASE 3:

Neuron output is 1 instead of 0 (y=1, d=0, e=d-y=-1)

Input x is subtracted from weight vector w

• This makes the weight vector point farther away from the input vector, increasing the

chance that the input vector will be classified as a 0 in the future.

54


Three conditions rewritten

• Three conditions for a neuron rewritten – CASE 1: e = 0 Δw = 0

– CASE 2: e = 1 Δw = x

– CASE 3: e = -1 Δw = -x

• Three conditions in a single expression Δw = (d-y)x = ex

• Similar for the bias Δb = (d-y)(1) = e

• Perceptron learning rule

)()()1(

)()()()1(

nenbnb

nxnenwnw jjj

1x

1x

2x v y


Convergence theorem

• For the perceptron learning rule there exists a

convergence theorem:

Theorem 1 If there exists set of connection weights w which is able to perform the

transformation d=y(x), the perceptron learning rule will converge to some solution

in a finite number of steps for any initial choice of the weights.

• Comments – Theorem is only valid for linearly separable classes

– Outliers can cause long training times

– If classes are linearly separable, perceptron offers a powerfull pattern recognition

tool

55


Perceptron learning rule summary

1. Start with random weights for the connections w

2. Select an input vector x from the set of training samples

3. If perceptron response is wrong: y≠d, modify all

connections according to learning rule:

4. Go back to 2 (until all input vectors are correctly classified)

eb

xew


Matlab demo (Preceptron learning rule)

• nnd4pr – Two input perceptron

../Matlab-Demos/nnd4pr.bat

../Matlab-Demos/nnd4pr.bat

56


MATLAB example

nn03_perceptron

• Classification of linearly separable data with a perceptron


Matlab demo (Presence of an outlier)

• demop4 – Slow learning with the presence of an outlier

nn03_perceptron/nn03_perceptron.m

../Matlab-Demos/demop4.bat

57


Matlab demo (Linearly non-separable classes)

• demop6 – Perceptron attempts to classify linearly non-

separable classes


Matlab demo (Classification application)

• nnd3pc – Perceptron classification fruit example

../Matlab-Demos/demop6.bat

../Matlab-Demos/nnd3pc.bat

58


3.3 Perceptron network

• Single layer of perceptron neurons

• Classification in more than two linearly separable

classes


MATLAB example

nn03_perceptron_network

• Classification of 4-class problem with a 2-neuron perceptron

nn03_perceptron_network/nn03_perceptron_network.m

59


3.4 Adaline

• ADALINE = Adaptive Linear Element

• Widrow and Hoff, 1961:

LMS learning (Least mean square) or Delta rule

• Important generalization of perceptron learning rule

• Main difference with perceptron activation function – Perceptron: Threshold activation function

– ADALINE: Linear activation function

• Both Perceptron and ADALINE can only solve linearly

separable problems


Linear neuron

• Basic ADALINE element

vvy )(

Rx

x

1

v y

y

v

Linear transfer function

bwxy

60


Simple ADALINE

• Simple ADALINE with two inputs

• Like a perceptron, ADALINE has a decision boundary – defined by network inputs for which network output is zero

– see Perceptron decision boundary

• ADALINE can be used to

classify objects into categories

bxwxwbwxfy 2211)(

1x

2x

v y

02211 bxwxw



• LMS = Least Square Learning rule

• A set of learning samples (inputs and target classes)

• Objective: reduce error e between target class d and

neuron response y (error-correction learning)

e = d – y

• Goal is to minimize the average sum of squared errors

ii

N

iii dxdx ,),(1

N

n

nyndN

mse1

2)()(

1

61


LMS algorithm (1/3)

• LMS algorithm is based on approximate steepest decent

procedure – Widrow & Hoff introduced the idea to estimate mean-square-error

– by using square-error at each iteration

– and change the network weights proportional to the negative derivative of error

– with some learning constant η

22 )()()( nyndne

j

jw

nenw

)()(

2

N

n

nyndN

mse1

2)()(

1


LMS algorithm (2/3)

– Now we expand the expression for weight change ...

– Expanding the neuron activation y(n)

– and using the cosmetic correction

– we finaly obtain the weight change at step n

jjj

jw

nyndne

w

nene

w

nenw

)()()(2

)()(2

)()(

2

)()()()()( 11 nxwnxwnxwnWxny RRjj

)()()( nxnenw jj

2

62


LMS algorithm (3/3)

• Final form of LMS learning rule

– Learning is regulated by a learning rate η

– Stable learning learning rate η must be less then the reciprocal of the largest

eigenvalue of the correlation matrix xTx of input vectors

• Limitations – Linear network can only learn linear input-output mappings

– Proper selection of learning rate η

)()()1(

)()()()1(

nenbnb

nxnenwnw jjj


Matlab demo (LMS learning)

• pp02 – Gradient descent learning by LMS learnig rule

../Matlab-Demos/pp02.m

63


3.6 Adaline network

• ADALINE network = MADALINE

(single layer of ADALINE neurons)


3.7 ADALINE vs. Perceptron

• Architectures

• Learning rules LMS learning Perceptron learning

)()()1(

)()()()1(

nenbnb

nxnenwnw jjj

)()()1(

)()()()1(

nenbnb

nxnenwnw jjj

v y vy

ADALINE PERCEPTRON

64


ADALINE and Perceptron summary

• Single layer networks can be built based on ADALINE or

Perceptron neurons

• Both architectures are suitable to learn only linear input-

output relationships

• Perceptron with threshold activation function is suitable

for classification problems

• ADALINE with linear output is more suitable for

regression & filtering

• ADALINE is suitable for continuous learning



• ADALINE is one of the most widely used neural

networks in practical applications

• Adaptive filtering is one of its major application areas

• We introduce the new element:

Tapped delay line – Input signal enters from the left and passes through

N-1 delays

– Output of the tapped delay line (TDL) is an N-dimensional

vector, composed from current and past inputs

Input

65


Adaptive filter

• Adaptive filter = ADALINE combined with TDL

bikpwbWpkai

i )1()(


Simple adaptive filter example

• Adaptive filter with three delayed inputs

btpwtpwtpwta )2()1()()( 321

66


Adaptive filter for prediction

• Adaptive filter can be used to predict the next value of a

time series p(t+1)

p(t-2) p(t-1) p(t) p(t+1) Time

p(t-2) p(t-1) p(t) p(t+1) Time Learning

Operation

Now

Learning


Noise cancellation example

• Adaptive filter can be used to cancel engine noise in

pilot’s voice in an airplane

– The goal is to obtain a signal that contains

the pilot’s voice, but not the engine noise.

– Linear neural net is adaptively trained to

predict the combined pilot/engine signal m

from an engine signal n. Only engine noise

n is available to the network, so it only

learns to predict the engine’s contribution to

the pilot/engine signal m.

– The network error e becomes equal to the

pilot’s voice. The linear adaptive network

adaptively learns to cancel the engine noise.

– Such adaptive noise canceling generally

does a better job than a classical filter,

because the noise here is subtracted from

rather than filtered out of the signal m.

67


Single-layer adaptive filter network

• If more than one output neuron is required, a tapped

delay line can be connected to a layer of neurons


Matlab demos (ADALINE)

• nnd10eeg – ADALINE for noise filtering of EEG signals

• nnd10nc – Adaptive noise cancelation

../Matlab-Demos/nnd10eeg.bat

../Matlab-Demos/nnd10nc.bat

68


MATLAB example

nn_03_adaline

• ADALINE time series prediction with adaptive linear filter


3.9 XOR problem

• Single layer perceptron cannot represent XOR function – One of Minsky and Papert’s most discouraging results

– Example: perceptron with two inputs

– Only AND and OR functions can be represented by Perceptron

1x

2x 2

1

2

12

w

bx

w

wx

Discriminant function v y

nn03_adaline/nn03_adaline.m

69


XOR solution

• Extending single-layer perceptron to multi-layer

perceptron by introducing hidden units

• XOR problem can be solved but we no longer have a

learning rule to train the network

• Multilayer perceptrons can do everything How to train

them?

1x

2x

1,2w

3,2w

2,2w

5.0

15.0

21

11

2

3,21

2,22,1

1,21,1

b

wb

ww

ww

v y

Homework

• Create a two-layer perceptron to solve XOR problem – Create a custom network

– Demonstrate solution


70

© 2012 Primož Potočnik NEURAL NETWORKS (4) Backpropagation #139

4. Backpropagation

4.1 Multilayer feedforward networks

4.2 Backpropagation algorithm





Introduction

• Single-layer networks have severe restrictions – Only linearly separable tasks can be solved

• Minsky and Papert (1969) – Showed a power of a two layer feed-forward network

– But didn’t find the solution how to train the network

• Werbos (1974) – Parker (1985), Cun (1985), Rumelhart (1986)

– Solved the problem of training multi-layer networks by back-propagating the

output errors through hidden layers of the network

• Backpropagation learning rule

71


4.1 Multilayer feedforward networks

• Important class of neural networks – Input layer (only distributting inputs, without processing)

– One or more hidden layers

– Output layer

• Commonly referred to as multilayer perceptron


Properties of multilayer perceptrons

1. Neurons include nonlinear activation function – Without nonlinearity, the capacity of the network is reduced to that of a single

layer perceptron

– Nonlinearity must be smooth (differentiable everywhere), not hard-limiting as in

the original perceptron

– Often, logistic function is used:

2. One or more layers of hidden neurons – Enable learning of complex tasks by extracting features from the input patterns

3. Massive connectivity – Neurons in successive layers are fully interconnected

)exp(1

1

vy

72


Matlab demo

• nnd11nf – Response of the feedforward network with

one hidden layer


About backpropagation

• Multilayer perceptrons can be trained by

backpropagation learning rule – Based on error-correction learning rule

– Generalization of LMS learnig rule (used to train ADALINE)

• Backpropagation consists of two passes through the

network

1. Forward pass – Input is applied to the network and propagated to the output

– Synaptic weights stay frozen

– Based on the desired response, error signal is calculated

2. Backward pass – Error signal is propagated backwards from output to input

– Synaptic weights are adjusted according to the error gradient

../Matlab-Demos/nnd11nf.bat

73


4.2 Backpropagation algorithm (1/9)

• A set of learning samples (inputs and target outputs)

• Error signal at output layer, neuron j, learning iteration n

• Instantaneous error energy of output layer with R neurons

• Average error energy over all learning set

R

n

M

n

N

nnn dxdx ,),(1

R

j

j nenE1

2)(2

1)(

)()()( nyndne jjj

N

n

nEN

E1

)(1


Backpropagation algorithm (2/9)

• Average error energy represents a cost function as a

measure of learning performance

• is a function of free network parameters – synaptic weights

– bias levels

• Learning objective is to minimize average error energy

by minimizing free network parameters

• We use an approximation: pattern-by-pattern learning

instead of epoch learning – Parameter adjustments are made for each pattern presented to the network

– Minimizing instantaneous error energy at each step instead of average error energy

E

E

E

74



• Similar as LMS algorithms, backpropagation applies

correction of weights proportional to partial derivative

• Expressing this gradient by the chain rule

)(

)()(

nw

nEnw

ji

ji

)(

)(

)(

)(

)(

)(

)(

)(

)(

)(

nw

nv

nv

ny

ny

ne

ne

nE

nw

nE

ji

j

j

j

j

j

jji

output error

network output

induced local field

synaptic weight

Instantaneous error energy

iy

jv jyjiw

R

j

j

jjj

eE

yde

1

2

2

1



1. Gradient on output error

2. Gradient on network output

3. Gradient on induced local field

4. Gradient on synaptic weight

)()(

)(ne

ne

nEj

j

))(()(

)(nvf

nv

nyj

j

j

1)(

)(

ny

ne

j

j)()()( nyndne jjj

)()(

)(ny

nw

nvi

ji

jR

j

ijij nynwnv0

)()()(

iy

jv jyjiw

75



• Putting gradients together

• Correction of synaptic weight is defined by delta rule

)())(()(

)())(()1()(

)(

)(

)(

)(

)(

)(

)(

)(

)(

)(

nynvfne

nynvfne

nw

nv

nv

ny

ny

ne

ne

nE

nw

nE

ijj

ijj

ji

j

j

j

j

j

jji

)())(()()(

)(

)(

nynvfnew

nEnw i

n

jj

ji

ji

j

)()()( nynnw ijji

Local gradient Learning rate

iy

jv jyjiw



CASE 1 Neuron j is an output node – Output error ej(n) is available

– Computation of local gradient is straightforward

CASE 2 Neuron j is a hidden node – Hidden error is not available Credit assignment problem

– Local gradient solved by backpropagating errors through the network

))(()()( nvfnen jjj

)()(

)(

)(

)(

)(

)(

)(

)(

)(

)(

)(

ny

ji

j

n

j

j

j

j

jji

ij

nw

nv

nv

ny

ny

ne

ne

nE

nw

nE

))(()(

)(

)(

)(

)(

)()( nvf

ny

nE

nv

ny

ny

nEn j

jj

j

j

j

2))](exp(1[

))(exp())((

))(exp(1

1))((

nav

navanvf

navnvf

j

j

j

j

j

iy

jv jyjiw

))(()(

)(nvf

nv

nyj

j

j

derivative of output error energy E on hidden layer output yj ?

76



CASE 2 Neuron j is a hidden node ... – Instantaneous error energy of the output layer with R neurons

– Expressing the gradient of output error energy E on hidden layer output yj

kj

k

k

kjk

k

k

k

w

j

k

nvf

k

kk

k j

kk

j

w

wnvfe

ny

nv

nv

nee

ny

nee

ny

nE

kjk

))((

)(

)(

)(

)(

)(

)(

)(

)(

))((

R

k

k nenE1

2)(2

1)(

))(()(

)()()(

nvfnd

nyndne

kk

kkk

M

j

jkjk nynwnv0

)()()(

jy

kv kykjw



CASE 2 Neuron j is a hidden node ... – Finally, combining ansatz for hidden layer local gradient

– and gradient of output error energy on hidden layer output

– gives final result for hidden layer local gradient

kj

k

k

j

wny

nE

)(

)(

))(()(

)()( nvf

ny

nEn j

j

j

kj

k

kjj wnvfn ))(()(

77



• Backpropagation summary

1. Local gradient of an output node

2. Local gradient of a hidden node

)()()( nynnw ijji

Weight Learning Local Input of

correction rate gradient neuron j

))(()()( nvfnen kkk

kj

k

kjj wnvfn ))(()(

ix

jv jyjiw

kv kykjw


Two passes of computation

1. Forward pass Input is applied to the network and propagated to the output

Inputs Hidden layer output Output layer output Output error

2. Backward pass – Recursive computing of local gradients

Output local gradients Hidden layer local gradients

– Synaptic weights are adjusted according to local gradients

)()()( nynnw jkkj

))(()()( nvfnen kkk kj

k

kjj wnvfn ))(()(

ijij xwfy )()()( nyndne kkk)(nxi

)()()( nxnnw ijji

ix

jv jyjiw

kv kykjw

jkjk ywfy

78


Summary of backpropagation algorithm

1. Initialization – Pick weights and biases from the uniform distribution with zero mean and

variance that induces local fields between the linear and saturated parts of logistic function

2. Presentation of training samples – For each sample from the epoch, perform forward pass and backward pass

3. Forward pass – Propagate training sample from network input to the output

– Calculate the error signal

4. Backward pass – Recursive computation of local gradients from output layer toward input layer

– Adaptation of synaptic weights according to generalized delta rule

5. Iteration – Iterate steps 2-4 until stopping criterion is met


Matlab demo

• nnd11bc – Backpropagation calculation

../Matlab-Demos/nnd11bc.bat

79


Matlab demo

• nnd12sd1 – Steepest descent


• Using backpropagation learning for ADALINE – No hidden layers, one output neuron

– Linear activation function

• Backpropagation rule

• Original Delta rule

• Backpropagation is a generalization of a Delta rule

)()()()())(()()(

),()()(nxnenw

nenvfnen

xynynnwii

iiii

Backpropagation for ADALINE

1))(()())(( nvfnvnvf

)()()( nxnenw ii

Rx

x

1

v y

../Matlab-Demos/nnd12sd1.bat

80



• Efficient application of backpropagation requires some

“fine-tuning”

• Various parameters, functions and methods should be

selected – Training mode (sequential / batch)

– Activation function

– Learning rate

– Momentum

– Stopping criterium

– Heuristics for efficient backpropagation

– Methods for improving generalization


Sequential and batch training

• Learning results from many presentations of training

examples – Epoch = presentation of the entire training set

• Batch training – Weight updating after the presentation of a complete epoch

• Sequential training – Weight updating after the presentation of each training example

– Stochastic nature of learning, faster convergence

– Important practical reasons for sequential learning:

• Algorithm is easy to implement

• Provides effective solution to large and difficult problems

– Therefore sequential training is preferred training mode

– Good practice is random order of presentation of training examples

81


Activation function

• Derivative of activation function is required for

computation of local gradients – Only requirement for activation function: differentiability

– Commonly used: logistic function

– Derivative of logistic function

))(( nvf j

))(()(

2))](exp(1[

))(exp())((

nvfny

j

j

jjj

nav

navanvf

)(

,0

))(exp(1

1))((

nv

a

navnvf

jj

j

)](1)[())(( nynyanvf jjj

Local gradient can be calculated

without explicit knowledge of the

activation function


Other activation functions

• Using sin() activation functions

– Equivalent to traditional Fourier analysis

– Network with sin() activation functions can be trained by backpropagation

– Example: Approximating periodic function by

1

)sin()(k

kk kxcaxf

8 sigmoid hidden neurons 4 sin hidden neurons

82


Learning rate

• Learning procedure requires – Change in the weight space to be proportional to error gradient

– True gradient descent requires infinitesimal steps

• Learning in practice – Factor of proportionality is learning rate η

– Choose a learning rate as large as possible without leading to oscillations

)()()( nynnw ijji

010.0

035.0

040.0


Stopping criteria

• Generally, backpropagation cannot be shown to converge – No well defined criteria for stopping its operation

• Possible stopping criteria

1. Gradient vector

– Euclidean norm of the gradient vector reaches a sufficiently small gradient

2. Output error

– Output error is small enough

– Rate of change in the average squared error per epoch is sufficiently small

3. Generalization performance

– Generalization performance has peaked or is adequate

4. Max number of iterations

– We are out of time ...

83


Heuristics for efficient backpropagation (1/3)

1. Maximizing information content General rule: every training example presented to the backpropagation algorithm should

be chosen on the basis that its information content is the largest possible for the task at

hand

Simple technique: randomize the order in which examples are presented from one epoch

to the next

2. Activation function – Faster learning with antisimetric sigmoid activation functions

– Popular choice is:

67.0

72.1

)tanh()(

b

a

bvavf

1

1)0(

1)1(,1)1(

v

f

ff

atderivative secondmax

gain effective



3. Target values – Must be in the range of the activation function

– Offset is recommended, otherwise learnig is driven into saturation

• Example: max(target) = 0.9 max(f)

4. Preprocessing inputs a) Normalizing mean to zero

b) Decorrelating input variables (by using principal component analysis)

c) Scaling input variables (variances should be approx. equal)

Original a) Zero mean b) Decorrelated c) Equalized variance

84



5. Initialization – Choice of initial weights is important for a successful network design

• Large initial values saturation

• Small initial values slow learning due to operation only in the saddle point near origin

– Good choice lies between these extrem values

• Standard deviation of induced local fields should lie between the linear and saturated

parts of its sigmoid function

• tanh activation function example (a=1.72, b=0.67):

synaptic weights should be chosen from a uniform distribution with zero mean and

standar deviation

6. Learning from hints – Prior information about the unknown mapping can be included into the learning

proces

• Initialization

• Possible invariance properties, symetries, ...

• Choice of activation functions

2/1mv m ... number of synaptic weights


Generalization

• Neural network is able to generalize: – Input-output mapping computed by the network is correct for test data

• Test data were not used during training

• Test data are from the same population as training data

– Correct response even if input is slightly different than the training examples

Overfitting Good generalization

85


Improving generalization

• Methods to improve generalization 1. Keeping the network small

2. Early stopping

3. Regularization

• Early stopping – Available data are divided into three sets:

1. Training set – used to train the network

2. Validation set – used for early stopping,

when the error starts to increase

3. Test set – used for final estimation of

network performance and for comparison

of various models

Early stopping


Regularization

• Improving generalization by regularization – Modifying performance function

– with mean sum of squares of network weights and biases

– thus obtaining new performance function

– Using this performance function, network will have smaller weights and biases,

and this forces the network response to be smoother and less likely to overfit

N

n

jj nyndN

mse1

2))()((1

M

m

mwM

msw1

21

mswmsemsreg )1(

86


Deficiencies of backpropagation

Some properties of backpropagation do not guarantee the algorithm to be universally useful:

1. Long training process – Possibly due to non-optimum learning rate

(advanced algorithms address this problem)

2. Network paralysis – Combination of sigmoidal activation and very large weights can decrease

gradients almost to zero training is almost stopped

3. Local minima – Error surface of a complex network can be very complex, with many hills and

valleys

– Gradient methods can get trapped in local minima

– Solutions: probabilistic learning methods (simulated annealing, ...)



• Basic backpropagation is slow • Adjusts the weights in the steepest descent direction (negative of the gradient) in which

the performance function is decreasing most rapidly

• It turns out that, although the function decreases most rapidly along the negative of the gradient, this does not necessarily produce the fastest convergence

1. Advanced algorithms based on heuristics – Developed from an analysis of the performance of the standard steepest descent

algorithm

• Momentum technique

• Variable learning rate backpropagation

• Resilient backpropagation

2. Numerical optimization techniques – Application of standard numerical optimization techniques to network training

• Quasi-Newton algorithms

• Conjugate Gradient algorithms

• Levenberg-Marquardt

87


Momentum

• A simple method of increasing learning rate yet avoiding

the danger of instability

• Modified delta rule by adding momentum term

– Momentum constant

– Accelerates backpropagation in steady downhill directions

)1()()()( nwnynnw jiijji

10

Small learning rate Large learning rate

(oscillations)

Learning with momentum


Variable learning rate η(t)

• Another method of manipulating learning rate and

momentum to accelerate backpropagation

1. If error decreases after weight update:

• weight update is accepted

• learning rate is increased ............................................. η(t+1) = ση(t), σ >1

• if momentum has been previously reset to 0, it is set to its original value

2. If error increases less than ζ after weight update:

• weight update is accepted

• learning rate is not changed ......................................... η(t+1) = η(t),

• if momentum has been previously reset to 0, it is set to its original value

3. If error increases more than ζ after weight update:

• weight update is discarded

• learning rate is decreased ............................................ η(t+1) = ρη(t), 0<ρ<1

• momentum is reset to 0

Possible parameter values: 05.1,7.0%,4

88


Resilient backpropagation

• Slope of sigmoid functions approaches zero as the input gets large – This causes a problem when you use steepest descent to train a network

– Gradient can have a very small magnitude also changes in weights are small, even though the weights are far from their optimal values

• Resilient backpropagation – Eliminates these harmful effects of the magnitudes of the partial derivatives

– Only sign of the derivative is used to determine the direction of weight update, size of the weight change is determined by a separate update value

– Resilient backpropagation rules:

1. Update value for each weight and bias is increased by a factor δinc if derivative of the performance function with respect to that weight has the same sign for two successive iterations

2. Update value is decreased by a factor δdec if derivative with respect to that weight changes sign from the previous iteration

3. If derivative is zero, then the update value remains the same

4. If weights are oscillating, the weight change is reduced


Numerical optimization (1/3)

• Supervised learning as an optimization problem – Error surface of a multilayer perceptron, expressed by instantaneous error

energy E(n), is a highly nonlinear function of synaptic weight vector w(n)

w1 w2

E(w1,w2)

))(()( nwEnE

89



• Expanding the error energy by a Taylor series

))(()( nwEnE

)()()(2

1)()())(())()(( nwnHnwnwngnwEnwnwE TT

)(

2

2

)(

)()(

)()(

nww

nww

T

w

wEnH

w

wEngLocal gradient

Hessian matrix



• Steepest descent method (backpropagation) – Weight adjustment proportional to the gradient

– Simple implementation, but slow convergence

• Significant improvement by using higher order information – Adding momentum term crude approximation to use second order information

about error surface

– Quadratic approximation about error surface The essence of Newton’s method

– H-1 is the inverse of Hessian matrix

)()( ngnw

)()()( 1 ngnHnw

gradient descent

Newton’s method

90


Quasi-Newton algorithms

• Problems with the calculation of Hessian matrix – Inverse Hessian H-1 is required, which is computationally expensive

– Hessian has to be nonsingular which is not guaranteed

– Hessian for neural network can be rank defficient

– No convergence guarantee for non-quadratic error surface

• Quasi-Newton method – Only requires calculation of the gradient vector g(n)

– The method estimates the inverse Hessian directly without matrix inversion

– Quasi-Newton variants:

• Davidon-Fletcher-Powell algorithm

• Broyden-Fletcher-Goldfarb-Shanno algorithm ... best form of Quasi-Newton algorithm!

• Application for neural networks – The method is fast for small neural networks


Conjugate gradient algorithms

• Conjugate gradient algorithms – Second order methods, avoid computational problems with the inverse Hessian

– Search is performed along conjugate directions, which produces generally faster

convergence than steepest descent directions

1. In most of the conjugate gradient algorithms, the step size is adjusted at each iteration

2. A search is made along the conjugate gradient direction to determine the step size that

minimizes the performance function along that line

– Many variants of conjugate gradient algorithms

• Fletcher-Reeves Update

• Polak-Ribiére Update

• Powell-Beale Restarts

• Scaled Conjugate Gradient

• Application for neural networks – Perhaps the only method suitable for large scale problems (hundreds or

thousands of adjustable parameters) well suited for multilayer perceptrons

gradient descent

conjugate gradient

91


Levenberg-Marquardt algorithm

• Levenberg-Marquardt algorithm (LM) – Like the quasi-Newton methods, LM algorithm was designed to approach

second-order training speed without having to compute the Hessian matrix

– When the performance function has the form of a sum of squares (typical in

neural network training), then the Hessian matrix H can be approximated by

Jacobian matrix J

– where Jacobian matrix contains first derivatives of the network errors with

respect to the weights

– Jacobian can be computed through a standard backpropagation technique that is

much less complex than computing the Hessian matrix

• Application for neural networks – Algorithm appears to be the fastest method for training moderate-sized

feedforward neural networks (up to several hundred weights)

JJH T


Advanced algorithms summary

• Practical hints (Matlab related) – Variable learning rate algorithm is usually much slower than the other

methods

– Resiliant backpropagation method is very well suited to pattern

recognition problems

– Function approximation problems, networks with up to a few hundred

weights: Levenberg-Marquardt algorithm will have the fastest

convergence and very accurate training

– Conjugate gradient algorithms perform well over a wide variety of

problems, particularly for networks with a large number of weights

(modest memory requirements)

92

Training algorithms in MATLAB




• Approximation error is influenced by

– Learning algorithm used ... (discused in last section)

• This determines how good the error on the training set is minimized

– Number and distribution of learning samples

• This determines how good training samples represent the actual function

– Number of hidden units

• This determines the expressive power of the network. For smooth functions

only a few number of hidden units are needed, for wildly fluctuating functions

more hidden units will be needed

93


Number of learning samples

• Function approximation example y=f(x)

– Learning set with 4 samples has small training error but gives very poor

generalization

– Learning set with 20 samples has higher training error but generalizes well

– Low training error is no guarantee for a good network performance!

4 learning samples 20 learning samples


Number of hidden units

• Function approximation example y=f(x)

– A large number of hidden units leads to a small training error but not necessarily

to a small test error

– Adding hidden units always leads to a reduction of the training error

– However, adding hidden units will first lead to a reduction of test error but then to

an increase of test error ... (peaking efect, early stopping can be applied)

5 hidden units 20 hidden units

94


Size effect summary

Number of training samples Number of hidden units

Error

rate Error

rate

Test set Test set

Training set Training set

Number of training samples Number of hidden units

Optimal number of

hidden neurons


Matlab demo

• nnd11fa – Function approximation, variable number of

hidden units

../Matlab-Demos/nnd11fa.bat

95


Matlab demo

• nnd11gn – Generalization, variable number of hidden

units


../Matlab-Demos/nnd11gn.bat

96

© 2012 Primož Potočnik NEURAL NETWORKS (5) Dynamic networks #191

5. Dynamic Networks

5.1 Historical dynamic networks




5.5 NARX network



5.8 System identification

5.9 Model reference adaptive control


Introduction

• Time – An essential ingredient of the learning process

– Important for many practical tasks: speech, vision, signal processing, control

• Many applications require temporal processing – Time series prediction

– Noise cancelation

– Adaptive control

– System identification

– ...

– Linear systems well developed theories

– Nonlinear systems neural networks have the potential to solve such problems

97


Introduction

• How can we build time into the operation of neural

networks? – Extending static neural networks into dynamic neural networks

networks become responsive to the temporal structure of input signals

– Networks become dynamic by adding

TEMPORAL MEMORY and/or FEEDBACK

Feedback loop


Static / dynamic networks

Neural network categories

1. Static networks structural pattern recognition – Feedforward networks

– No feedback elements, no delays

– Output is calculated directly from the input through feedforward connections

2. Dynamic networks temporal pattern recognition – Output depends on

• current input to the network

• also on previous inputs

• previous network output

• previous network states

– Dynamic networks can be divided into two categories

1. Networks that have only feedforward connections

2. Networks with feedback or recurrent connections

A need for short-term memory and feedback

98


Memory

• Memory – Long-term memory

• Acquired through supervised learning and stored into synaptc weights

– Short-term memory

• Temporal memory, usefull to capture temporal dimension

• Implemented as time delays at various parts of the network

Long-term memory


Tapped delay line

• The simplest form of short-term memory – Already mentioned at linear adaptive filters

– Most commonly used for dynamic networks

– Tapped delay line (TDL) consists of N unit delay operators

– Output of TDL is an N+1 dimensional vector, composed from current and past

inputs

)](),...,1(),([))(( NnxnxnxnxTDL

)(nx)(nx

)1(nx

)( Nnx

99


5.1 Historical dynamic networks

Hopfield (1982)

Jordan (1986)

Elman (1990)


Hopfield network

• Hopfield network (Hopfield, 1982) – Network consists of N interconnected neurons which update their activation

values asynchronously and independently of other neurons

– All neurons are both input and output neurons

– Activation values are binary (-1, +1)

– Multiple-loop feedback system

interesting to study stability of the system

– Primary applications

• Associative memory

• Solving optimization problems

– MATLAB example: demohop1.m

demohop1/index.html

100


Jordan network

• Jordan network (Jordan, 1986) – Network outputs are fed back as extra inputs (state units)

– Each state unit is fed with one network output

– The connections from output to state units are fixed (+1)

– Learning takes place only in the

connections between input to hidden

units as well as hidden to output units

– Standard backpropagation learnig rule

can be applied to train the network


context

units

Elman network

• Elman network (Elman, 1990) – Similar to Jordan network, with the following differences:

1. Hidden units are fed back (instead of output units)

2. Context units have no self-connections

101



• The most straightforward dynamic network

feedforward network + tapped delay line at input – Temporal dynamics only at the input layer of a static network

– Nonlinear extention of linear adaptive filters

– Backpropagation training can be used

– The structure is suitable for time-series prediction


• Input delays = [0 6 12] Inputs {x(t), x(t-6), x(t-12)}

• Prediction horizon = 1 Output x(t+1)

Input delays = [12 6 0]

Prediction horizon = 1

Known world

Unknown world

TDL & prediction horizon

102


Online prediction application

Past Now Future


MATLAB example (1/3)

• Application of focused time-delay neural network for

prediction of chaotic MacKay-Glass time series

• Objective – Design Focused time-delay neural network for recursive one-step-ahead predictor

– Fixed network parameters

• Number of hidden layers: 1

• Hidden layer activation func.: Logistic

• Output layer activation func.: Linear

– Variable network parameters

• Input delays = ?

• Hidden layer neurons = ?

17,2.0,1.0)(1

)()()(

10cb

ty

tcytbyty

103



• Samples – 500 training samples

– 500 validation samples, recursive prediction

• Results

(A)



(B)

(C)

104



• Tapped delay lines distributed throughout the network – Distributted temporal dynamics ability to handle non-stationary environments

– Backpropagation training cannot be used any more

the need for temporal backpropagation

– Possible applications:

phoneme recognition, recognition of various frequency contents in signals


Temporal backpropagation

• Backpropagation algorithm – Suitable for static networks and focused time-delay neural networks

• Temporal backpropagation – Supervised learning algorithm

– Extension of backpropagation

– Required for distributed time delay neural networks

– Computationaly demanding

• Which form of backpropagation to use? – Based on the nature of the temporal processing task

1. STATIONARY ENVIRONMENT

Standard backpropagation + Focused time-delay neural networks

2. NON-STATIONARY ENVIRONMANT

Temporal backpropagation + Distributed time delay neural networks

105


Example (1/2)

• Wan (1994): Time series prediction by using a

connectionist network with internal delay lines – Winner of the “Santa Fe Institute Time-Series Competition”, USA (1992)

– Task: Nonlinear prediction of a nonstationary time series exhibiting chaotic

pulsations of NH3 laser


Example (2/2)

• Prediction results

http://www.neural.si/doc/1994Wan.pdf

106



• Layer recurrent network = Recurrent multilayer perceptron – One or more hidden layers

– Each computation layer has feedback link


Layer recurrent network structure

• Feedback loop with single delay for hidden layer – Can be trained by backpropagation

Elman (1990)

107


Example (1/3)

• Phoneme detection problem – Recognition of various frequency components

• Layer recurrent network – 1 hidden layer

– 8 neurons

– 5 delays


Example (2/3)

• Network training – Successful recognition of two “phonemes”

108


Example (3/3)

• Network testing – Unreliable generalization, works only on trained “phonemes”

OK OK


5.5 NARX network

• Networks discussed so far – Focused or distributed time delays

– Feedback only localized to specific network layers

• NARX = Nonlinear AutoRegressive Network with

EXogenous Inputs – Reccurent network with global feedback

– Feedback over several layers of the network

– Based on linear ARX model

– Defining equation for NARX model

• Output y is a nonlinear function of past outputs and past inputs

• Nonlinear function f can be implemented by a neural network

109


NARX structure

• NARX network with global feedback

• Possible application areas • Nonlinear prediction and modelling

• Adaptive equalization of communication channels

• Speech processing

• Automobile engine diagnostics


NARX training considerations

– NARX output is an estimate of the output of some nonlinear dynamic system

– Output is fed back to the input of the feedforward neural network parallel architecture

– True output is available during training possible to create series-parallel architecture

• True output is used instead of feeding back the estimated output

– Advantages of series-parallel architecture for training

1. Training input to the feedforward network is more accurate improved training accuracy

2. Resulting network is purely feedforward static backpropagation can be used

110


Example (1/5)

• Problem: Magnetic levitation

• Objective – to control the position of a magnet suspended above an

electromagnet, where the magnet can only move in the

vertical direction

• Equation of motion

y(t) = distance of the magnet above the electromagnet

i(t) = current flowing in the electromagnet

M = mass of the magnet

g = gravitational constant

β = viscous friction coefficient (determined by the material in which the magnet moves)

α = field strength constant, determined by the number of turns of wire on the

electromagnet and the strength of the magnet


Example (2/5)

• Data – Sampling interval: 0.01 sec

– Input: current i(t)

– Output: magnet position y(t)

• NARX network structure – 3 hidden neurons

– 5 input delays

– 5 global feedback delays

5

5

3 neurons

111


Example (3/5)

• Series-parallel training results for NARX network


Example (4/5)

• Parallel recursive prediction (1000 steps)

112


Example (5/5)

• Possible learning results: unstable learning, local minima

Case A: OK Case B: Unstable Case C: Local minimum



• Fully and partially connected recurrent networks

113


Theorems

Theorem I (Siegelmann & Sontag, 1991) – All Turing machines may be simulated by fully connected recurrent networks built

on neurons with sigmoid activation functions.

(Turing machine is a theoretical abstraction that is functionally as powerfull as

any computer, see http://aturingmachine.com )

Theorem II (Siegelmann et. al. 1997) – NARX networks with one layer of hidden neurons with BOSS* activation

functions and a linear output neuron can simulate fully connected recurrent

networks with BOSS* activation functions, except for a linear slowdown

Corollary to Theorem I and II – NARX networks with one hidden layer of neurons with BOSS* activation

functions and a linear output neuron are Turing equivalent.

* BOSS = bounded, one-sided saturated function



Two modes of training for recurrent networks

1. Epochwise training – For a given epoch, the recurrent network starts running from some initial state

until it reaches a new state, at which point the training is stopped and the

network is reset to initial state for the next epoch

– METHOD: Backpropagation through time

2. Continuous training – Suitable if no reset states are available or online learning is required

– Network learns while it is performing signal processing

– The learning process never stops

– METHOD: Real-time recurrent learning

http://aturingmachine.com/

http://www.neural.si/doc/1997Siegelmann.pdf




114


Backpropagation through time

• Extension of the standard backpropagation algorithm – Derived by unfolding the temporal operation of the network into a layered

feedforward network

– The topology grows by one layer at every time step

– EXAMPLE: unfolding the temporal operation of a 2-neuron recurrent network


Backpropagation through time example (1/2)

• Nguyen (1989): The truck backer-upper

http://www.neural.si/doc/1994Nguyen.pdf

115


... example (2/2)

Training

Generalization


5.8 System Identification

• System identification = experimental approach to

modeling the process with unknown parameters – STEP 1: Experimental planning

– STEP 2: Selection of a model structure

– STEP 3: Parameter estimation

– STEP 4: Model validation

• Unknown nonlinear dynamical process dynamic

neural networks can be used as identification model Two basic identification approaches:

1. System identification using state-space model

2. System identification using input-output model

116


System identification using state-space model

• “State” – A vital role in the mathematical formulation of a dynamical system

– State of a dynamical system defined as a “set of quantities that summarizes all

the information about the past behavior of the system that is needed to uniquely

describe its future behavior, except for the purely external effects arising from the

applied input.”

• Plant description by a state-space model

– State:

– Output:

– f, h : unknown nonlinear vector functions

– Two dynamic neural networks can be used to approximate f and h

)()(

)(),()1(

nxhny

nunxfnx


State-space solution to the identification problem

– Both networks are trained by gradient descent minimizing error signals eI and eII

Neural network (II) – Identification of plant output

– Actual state x(n) is used as input rather

than the predicted output

Neural network (I) – Identification of plant state

– State must be physically accessible!

117


System identification using input-output model

• If system state is not accessible

identification by input-output model

• Plant description by an input-output model

– f is unknown nonlinear vector function

– Input-output formulation is equivalent to NARX formulation

– NARX neural network can be used to approximate f

– q past inputs and outputs should be available

)1(,),(),1(,),()1(ˆ qnunuqnynyfny


Input-output solution to the identification problem

– NARX neural network can be used

as a dynamic identification model

– Series-parallel learning

• system output is used as feedback,

not the predicted output

– Parallel architecture for application

118


5.9 Model-reference adaptive control

• Dynamic networks are important for feedback control

systems MULTIPLE PROBLEMS:

– Nonlinear coupling of plant state with control signals

– Presence of unmeasured or random disturbances

– Possibility of a nonunique plant inverse

– Presence of unobservable plant states

• MRAC = Model reference adaptive control – Well suited for the use of neural networks

– Possible control methods:

• Direct MRAC

• Indirect MRAC


MRAC using direct control

• Unknown plant dynamics adaptive learning

• Controller + plant = closed loop feedback system – Controller and plant build externaly recurrent network

– How to get plant gradients indirect control

)(),()1(

),(),(),()(

nrnxgnd

wnrnynxfnu

r

pcc

model Reference

Controller

119


MRAC using indirect control

• Two step procedure to train the controller 1. Identification of the plant (identification model)

2. Using plant model to obtain dynamic

derivatives to train the controller

– Controller and plant model build

externaly recurrent network


Summary

Layer recurrent network

Focused time-delay neural network Distributed time-delay neural network

NARX network

120

© 2012 Primož Potočnik NEURAL NETWORKS (6) Radial Basis Function Networks #239

6. Radial Basis Function Networks

6.1 RBFN structure

6.2 Exact interpolation

6.3 Radial basis functions

6.4 Radial basis function networks

6.5 RBFN training

6.6 RBFN for classification





Introduction

• RBFN = Radial Basis Function Network

• New class of neural networks – Multilayer perceptrons output is a nonlinear function of the scalar product of

input vector and weight vector

– RBFN activation of a hidden unit is determined by distance between input

vector and prototype vector

• RBFN theory forms a link between – Function approximation

– Regularization

– Noisy interpolation

– Density estimation

– Optimal classification theory

121


6.1 RBFN structure

Feedforward network with two computation layers

1. Hidden layer implements a set of radial basis functions (e.g. Gaussian functions)

2. Output layer implements linear summation functions (as in MLP)


RBFN properties

• Two-stage training procedures

1. Training of hidden layer weights

2. Training of output layer weights

Training/learning is very fast

• RBFN provides excellent interpolation

122


6.2 Exact interpolation (1/3)

• Exact interpolation task = mapping of every input vector exactly

onto the corresponding output vector in the multi-dimensional space

• The goal is to find a function that will map input vectors x into target

vectors t

• Radial basis function approach (Powell, 1987) introduces a set of N

basis functions, one for each data point xp, in the form

• Basis functions Φ are nonlinear, and depend on the distance

measure between input x and stored prototype xp

22

11 )()( p

MM

pp xxxx xx


Exact interpolation (2/3)

• Output is a linear combination of basis functions

• Goal is to find the weights wp such that the function goes through all

data points

• We introduce the matrix formulation

• Provided that inverse of Φ exist, the weights are obtained by any

standard matrix inversion techniques

123


Exact interpolation (3/3)

• For large class of functions Φ, the matrix is indeed non-singular

provided that the data points are distinct

• Solution represents a continuous diferentiable surface

that passes exactly through each data point

• Both theoretical and empirical studies confirm (in the context of

exact interpolation) that many properties of the interpolating function

are relatively insensitive to the precise form of the basis functions

• Various forms of basis functions can be used

rΦΦ pxx


6.3 Radial basis functions (1/2)

1. Gaussian

2. Multi-Quadratic

3. Generalized Multi-Quadratic

4. Inverse Multi-Quadratic

124


Radial basis functions (2/2)

5. Generalized Inverse Multi-Quadratic

6. Thin Plate Spline

7. Cubic

8. Linear


22

11 )()( p

MM

pp xxxxr xx

125


Properties of radial basis functions

• Gaussian and Inverse Multi-Quadric basis functions are localised

• Localised property is not strictly necessary all the other functions

(Multi-Quadratic, Cubic, Linear, ...) are not localised

• Note that even the Linear Function is still non-

linear in the components of x In one dimension, this leads to a

piecewise-linear interpolating function which performs the simplest

form of exact interpolation

• For neural network mappings, there are good reasons for preferring

localised basis functions we will focus on Gaussian basis functions


Exact interpolation example (1/2)

Interpolation problem – We would like to find a function

which fits all data points

Solution approach – Supperposition of Gaussian

radial basis functions

126


Exact interpolation example (2/2)

σ = 0.02

σ = 1

σ = 20


6.4 Radial basis function networks

• Exact interpolation model using RB functions can already be

described as a radial basis function network

• N training inputs directly determine hidden layer prototypes (centers

of hidden layer neurons)

• Training inputs and outputs also directly determine output weights

127


Problems with exact interpolation

1. Exact interpolation of noisy data is highly oscillatory function

such interpolating functions are generally undesirable

2. Number of basis functions is equal to the number of data patterns

exact RBF networks are not computationally efficient


RBF neural network model

Introduced by Moody & Darken (1989) by several

modifications of exact interpolation procedure

– Number M of basis functions (hidden units) need not equal the number N

of training data points. In general it is better to have M much less than N.

– Centers of basis functions do not need to be defined as the training data

input vectors. They can instead be determined by a training algorithm.

– Basis functions need not all have the same width parameter σ. These

can also be determined by a training algorithm.

– We can introduce bias parameters into the linear sum of activations at the

output layer. These will compensate for the difference between the

average value over the data set of the basis function activations and the

corresponding average value of the targets.

128


Improved RBFN

• Including the proposed changes + expanding to the multidimensional output

• Which can be simplified by introducing an extra basis function Φ0 = 1

• For the case of a Gausian RBF

centers

widths

Φ1 ΦM

μ1 μM


RBFN in Matlab notation

RBF neuron

RBF network

center

width

centers

widths biases

output weights

129


Computational power of RBFN

• Hartman et al. (1990) – Formal proof of universal approximation property for networks with Gaussian

basis functions in which the widths are treated as adjustable parameters

• Girosi & Poggio (1990) – Showed that RBF networks posses the best approximation property which

states: in the set of approximating functions there is one function which has

minimum approximating error for any given function to be approximated.

This property is not shared by multilayer perceptrons!

• As with the corresponding proofs for MLPs, RBFN proofs rely on the

availability of an arbitrarily large number of hidden units (i.e. basis functions)

• However, proofs provide a theoretical foundation on which practical

applications can be based with confidence


6.5 RBFN training

• Key aspect of RBFN:

different roles of first and second computational layer

• Training process can be divided into two stages

1. Hidden layer training

2. Output layer training

• Hidden layer can be trained by unsupervised methods (random

selection, clustering, ...)

• Output layer has linear activation output weights are determined

analitically by solving a set of linear equations

• Gradient descent learning is not needed for RBFN, therefore

training is very fast!

130


Hidden layer training

• One major advantage of RBF networks is the possibility of choosing

suitable hidden unit (basis function) parameters without having to

perform a full non-linear optimization of the whole network

• Methods for unsupervised selection of basis function centers

– Fixed centres selected at random

– Orthogonal least squares

– K-means clustering

• Problems with unsupervised methods

– Selection of number of centers M

– Selection of center widths σ

• It is also possible to perform a full supervised non-linear optimization

of the network instead


Fixed centres selected at random

• Simplest and quickest approach to setting RBFN parameters – Centers fixed at M points selected randomly from the N data points

– Widths fixed to be equal at an appropriate size for the distribution of data points

• Specifically, we can use RBFs centred at {μj} defined by

• Widths σj are all related in the same way to the maximum or

average distance between the chosen centres μj

– Common choices are

– which ensure that the individual RBFs are neither too wide, nor too narrow, for the given

training data

– For large training sets, this approach gives reasonable results

131


Orthogonal least squares

• A more principled approach to selecting a sub-set of data points as

the basis function centres is based on the technique of orthogonal

least squares

1. Sequential addition of new basis functions, each centred on one of the data

points

2. At each stage, we try out each potential Lth basis function by using the N–L

other data points to determine the networks output weights

3. The potential Lth basis function which gives the smallest output error is used,

and we move on to choose which L+1th basis function to add

• To get good generalization we generally use cross-validation to

stop the process when an appropriate number of data points have

been selected as centers


K-means clustering

• A potentially even better approach is to use clustering techniques to

find a set of centres which more accurately reflects the distribution of

the data points

• K-Means Clustering Algorithm – Select the number of centres (K) in advance

– Apply a simple re-estimation procedure to partition the data points {xp} into K disjoint sub-

sets Sj containing Nj data points to minimize the sum squared clustering function

– where μj is the mean/centroid of the data points in set Sj given by

• Once the basis centres have been determined in this way, the

widths can then be set according to the variances of the points in the

corresponding cluster

132


K-means clustering example


Output layer training

• After training the hidden layer neurons (selection of centers and

widths), output layer training essentially means optimization of a

single layer linear network

• As with MLPs, a sum-squared output error can be defined

• At the minimum of E, gradients with respect to weights wki are zero

133


Computing the output weights

• Equations for the weights are most conveniently written in matrix

form by defining matrices

• which gives

• and the formal solution for the weights is

• here we have the standard pseudo inverse of Φ

• Network weights can be computed by fast linear matrix inversion

techniques

– In practice, singular value decomposition (SVD) is often used to avoid possible

ill-conditioning of Φ, i.e. ΦTΦ being singular or near singular


Supervised RBFN training

• Supervised training of basis function parameters can give good

results, but the computational costs are usually enormous

• Obvious approach is to perform gradient descent on a sum squared

output error function as in MLP backpropagation learning. Error

function would be

• Supervised RBFN training would iteratively update the weights

(basis function parameters) using gradients

134


Supervised RBFN training

• By using the Gaussian basis functions

derivatives of error function

become very complex and therefore computationally inefficient

• Additionally, we get all the problems of choosing the learning rates,

avoiding local minima ... that we had for training MLPs by

backpropagation

• And there is a tendency for the basis function widths to grow large

leaving non-localised basis functions


Regularization theory for RBFN

• Alternative approach to prevent overfitting in RBFN

• Based on the theory of regularization, which is a method of

controlling the smoothness of mapping functions

• We can have one basis function for each training data point as in the

case of exact interpolation, but add an extra term to the error

measure which penalizes mappings which are not smooth

135


Regularization term in error measure

• In regularization approach, error measure is modified with additional regularization term that is composed from – differential operator P, and

– regularization parameter λ

• Regularization parameter λ determines the relative importance of smoothness compared with error

• Differential operator P can have many possible forms, but the general idea is that mapping functions which have large curvature should yield large regularization term and hence contribute a large penalty in the total error function


RBFN training summary

Option 1) Exact interpolation model + Regularization

Option 2) Supervised RBFN training

Option 3) Two-stage hybrid training 3a) Hidden layer training

Fixed centres selected at random

Orthogonal least squares

K-means clustering

3b) Output layer training

Linear matrix operation

Where to start? Two stage hybrid training with K-means clustering and linear

matrix operation for output layer

136


6.6 RBFN for classification

• Key insight into RBFN can be obtained by using such networks for

classification problems

• Suppose we have data set with three classes

MLP RBFN

• Multilayer perceptron can separate classes by using hidden units to

form hyperplanes in the input space

• Alternative approach is to model the separate class distributions by

localised radial basis functions


Implementing RBFN for classification

• Define an output function yk(x) for each class k with appropriate targets

• RBFN is trained with input patterns x and corresponding target classes t

• Underlying justification for using RBFN for classification is found in Cover’s theorem which states

A complex pattern classification problem cast in a high dimensional space non-linearly is more likely to be linearly separable than in a low dimensional space.

Once we have linear separable patterns, the classification problem can be solved by a linear layer

137



• Similarities between RBF networks and MLPs 1. They are both non-linear feed-forward networks

2. They are both universal approximators for arbitrary nonlinear functional mappings

3. They can be used in similar application areas

There always exists an RBF network capable of accurately mimicking a specified

MLP, or vice versa.

MLP RBFN


RBFN / MLP differences

MLP

1. Can have any number of hidden layers

2. Computation nodes (processing units) in different layers share a common neuronal model, though not necessarily the same activation function

3. Argument of each hidden unit activation function is the inner product of the input and the weights

4. Usually trained with a single global supervised algorithm

5. Construct global approximations to non-linear input-output mappings with distributed hidden representations

6. Require a smaller number of parameters

RBFN

1. Single hidden layer

2. Hidden nodes (basis functions) operate very differently, and have a different purpose compared to the output nodes

3. Argument of each hidden unit

activation function is the distance between the input and the “weights” (RBF centres)

4. Usually trained one layer at a time with the first layer unsupervised

5. Use localised non-linearities (Gaussians) at the hidden layer to construct local approximations

6. Fast training

138



• Probabilistic neural networks (PNN) can be used for classification problems

• First layer computes distances from the input vector to the training input vectors (prototypes) and produces a vector whose elements indicate how close the input is to a training input

• Second layer sums these contributions for each class of inputs to produce as its net output a vector of probabilities

• Finally, a competitive output layer picks the maximum of these probabilities, and produces “1” for that class and “0” for the other classes


PNN example 1

Three training patterns

Classifying new sample

PNN division of the input space

139


PNN example 2 (1/4)


PNN example 2 (2/4)

140


PNN example 2 (3/4)


PNN example 2 (4/4)

141


PNN considerations

• Probabilistic neural networks are specialized to classification (less

general than RBFN or MLP)

• PNN are sensitive to the selection of spread parameter spread

can be optimized by leave-one-out cross-validation technique 1. Leave one training sample out, train PNN and test on the omitted sample

2. Repeat procedure for all samples and save results

3. Find optimal spread that yields minimal average classification error

• Benefits – Little or no training required (except spread optimization)

– Beside classifications, PNN also provides Bayesian posterior probabilities solid theoretical

fundation to support confidence estimates for the network’s decisions

– Robust against outliers outliers have no real effect on decisions

• Drawbacks – PNN performance depends strongly on a thoroughly representative training set

– Entire training set must be stored large memory and poor execution speed



• GRNN can be well explained by reviewing the regression problem:

How to use measured values (independent variables) to predict the

value of a dependent variable?

Linear regression is OK Linear regression fails

142


Simple linear regression

• Simple linear regression is expressed with

• Given the training data, the slope a and bias b are computed as – Compute sum of squares

– Compute slope and bias

• Resulting linear equation will minimize mean squared error of predicted values y in the training set

baxy

yyxxSSxxSS i

i

ixy

i

ix ,2

xayb

SSSSa xxy


Multiple regression

• Several independent variables x1, x2, x3, ...

• Matrix notation

• Pack training data into matrices

• Parameter can be expressed as

• Final solution is usually obtained numerically by singular value

decomposition method (SVD)

4332211 bxaxaxay

ayxxx xx 1,,, 321

aXY

YXXX1

a

143


• Best predictor for dependent variable y is defined by its conditional

expectation, given the independent variable x

• Joint density function fxy(x,y) is not known but can be approximated by

Parzen estimator

• By using the Parzen approximator with Gaussian kernels, we obtain

equation for GRNN predictor

General regression neural network


GRNN properties

• GRNN closely resembles RBFN with normalization term in

denominator it is sometimes called “Normalized RBFN”

• GRNN also resembles PNN but is used for regression (function

approximation), not for classification

• Width parameter spread must be selected, as in all RBF networks

• First layer has Gaussian kernels located at each training case and

computes distances from the input vector to the training input vectors

(prototypes)

• Second layer is a special linear layer with normalization operator

• Normalization makes GRNN a very robust predictor

144


GRNN architecture

Standard Radial basis layer Normalization Linear layer


RBFN vs. GRNN example (1/3)

145





146


Summary

RBFN PNN

GRNN


147

© 2012 Primož Potočnik NEURAL NETWORKS (7) Self-Organizing Maps #293

7. Self-Organizing Maps

7.1 Self-organization

7.2 Self-organizing maps

7.3 SOM algorithm


7.5 SOM discussion & examples


Introduction

1. We discussed so far a number of networks which were trained to

perform a mapping

INPUTS OUTPUTS

which corresponds to supervised learning paradigm

2. However, problems exist where target outputs are not available

the only information is provided by a set of input patterns

INPUTS ??

which corresponds to unsupervised learning paradigm

148


Examples of problems

• Clustering Input data are grouped into clusters for any input, neural net should return a

corresponding cluster label

• Vector quantization Continuous space has to be discretised neural net has to find optimal

discretisation of the input space

• Dimensionality reduction Input data are grouped in a subspace with lower dimensionality than the original

data Neural net has to learn an optimal mapping such that most of the

variance in the input data is preserved in the output data

• Feature extraction System has to extract features from the input signal this often means a

dimensionality reduction as described above


7.1 Self-organization

• What is self-organization? – System structure appears without explicit pressure or involvement from outside

the system

– Constraints on form (i.e. organization) of interest to us are internal to the system,

resulting from the interactions among the components

– The organization can evolve in either time or space, maintain a stable form or

show transient phenomena

Self-organizing yogurt lids.flv

149


Self-organization properties

• Typical features include (in rough order of generality) – Autonomy (absence of external control)

– Dynamic operation (evolution in time)

– Fluctuations (noise / searches through options)

– Symmetry breaking (loss of freedom)

– Global order (emergence from local interactions)

– Dissipation (energy usage / far-from-equilibrium)

– Instability (self-reinforcing choices / nonlinearity)

– Multiple equilibria (many possible attractors)

– Criticality (threshold effects / phase changes)

– Redundancy (insensitivity to damage)

– Self-maintenance (repair / reproduction metabolisms)

– Adaptation (functionality / tracking of external variations)

– Complexity (multiple concurrent values or objectives)

– Hierarchies (multiple nested self-organized levels)

John Conway’s Game of Life

• John Conway (1970), published paper in Scientific American

• Game of Life:

– infinite two-dimensional grid of square cells,

– each cell is in one of two possible states, alive or dead,

– every cell interacts with its eight neighbours

– RULES:

1. Alive cell with less than 2 or more than 4 neighbours dies (loneliness / overcrowding)

2. Dead cell with 3 neighbours turns alive (reproduction)


Glider Gun creating gliders

John Conway's Game of Life Demo.flv

http://en.wikipedia.org/wiki/File:Game_of_life_loaf.svg

150


Self-organization in neural networks

• Self-organizing networks are based on competitive learning – output neurons of the network compete to be activated and only one neuron can

become a winning neuron

• Self-organizing maps (SOM) – learn to recognize groups of similar input vectors in such a way that neurons

physically near each other in the neuron layer respond to similar input vectors

• Learning vector quantization (LVQ) – a method for training competitive layers in a supervised manner

– learns to classify input vectors into target classes chosen by the user


Neurobiological motivation

• Neurobiological studies indicate that different sensory inputs (tactile, visual, auditory, etc.) are mapped onto different areas of the cerebral cortex in an ordered fashion

• This form of a map, known as a topographic map, has two important properties:

1. At each stage of representation, or processing, each piece of incoming information is kept in its proper context / neighbourhood

2. Neurons dealing with closely related pieces of information are kept close together so that they can interact via short synaptic connections

• Our interest is in building artificial topographic maps that learn through self-organization in a neurobiologically inspired manner

• We shall follow the principle of topographic map formation: The spatial location of an output neuron in a topographic map corresponds to a particular domain or feature drawn from the input space

151


7.2 Self-organizing maps (SOM)

• Neurons are placed at the nodes of a lattice, usually 1D or 2D

• Neurons are trained by self-organized competitive learning rule

• Neurons become selectively tuned to various input patterns or

classes of input patterns

• Locations of neurons become ordered in a way that a meaningfull

topographic map of input patterns is created

• The process of ordering is automatic (self-organized) without

guidance from outside

• Self-organizing maps are inherently nonlinear a nonlinear

generalization of principal component analysis (PCA)


Organization of a self-organizing map

• Points x from the input space are mapped to points

I(x) in the output space (self-organizing map)

• Each point I in the output space will map to a corresponding point

w(I) in the input space

152


Kohonen network

• Kohonen (1982) : Self-organized formation of topologically correct

feature maps. Biological Cybernetics

• Kohonen network or Self-Organizing Map (SOM) has a single

computational layer arranged in rows and columns

– 1D, 2D, 3D

• Each neuron is fully connected to all source nodes in the input layer


SOM architecture

Calculating the distance

between inputs and

neurons

dist

Competitive layer

selection of a winning

neuron and its

neighborhood

dist, linkdist, mandist,

boxdist

Topologies:

1D, 2D, 3D

153


7.3 SOM algorithm

1. Initialization Define SOM topology, then initialize weights with small random values

2. Competition For each input pattern, neurons compute their values of a distance function which provides the basis for competition. A neuron with the smallest distance to the input pattern is declared the winner.

3. Cooperation Winning neuron determines the topological neighbourhood of excited neurons, thereby providing the basis for cooperation among neighbouring neurons

4. Adaptation Excited neurons decrease their distance to the input pattern through adjustment of synaptic weights response of the winning neuron to the subsequent application of a similar input pattern is enhanced


Competition - Cooperation - Adaptation

• We have m-dimensional input space

• Synaptic weight vector of each neuron in the network has the same dimension as input space

• The best match of the input vector x with the synaptic weight vectors wj can be found by comparing the Euclidean distance between input vector x and each neuron j

• Neuron whose weight vector comes closest to the input vector (i.e. is most similar to it) is declared the winning neuron

• In this way the continuous input space can be mapped to the discrete output space of neurons by a simple process of competition between the neurons

],,,[ 21 mxxxx

Kjwwww jmjjj ,,1,],,,[ 21

jj wxxd )(

154



• Winning neuron locates the center of a topological neighborhood of

cooperating neurons

• Neurobiological studies confirm that there is lateral interaction

within a set of excited neurons

– When one neuron fires, its closest neighbours tend to get excited more than

those further away

– Topological neighbourhood decays with distance

• We define a similar neurobiologically correct topological

neighbourhood for the neurons in SOM and assume two

requirements:

1. Topological neighborhood is symetric around the winning neuron

2. Amplitude of the topological neighborhood decreases monotonically with

increasing lateral distance

(and decaying to zero in the limit d∞ which is neccessary for covergence)



• A typical choise of a topological neighbourhood function that covers

both requirements is defined by Gaussian function

Gaussian function is translation invariant

(independent of the location of the winning neuron)

2

2

,

,2

exp)(ij

ij

dxh

Effective width of the

topological neighborhood

155



• For cooperation to be effective, topological neighborhood must

depend on lateral distance between winning neuron and its

neughbors in the output space and NOT on distance measure in the

original input space

Winning neuron

Neighbours

Distance:

dist

linkdist

mandist

boxdist



• Another special feature of the SOM algorithm is that size of the

topological neighborhood shrinks with time

• Shrinking requirement is fulfilled by decreasing the width σ of the

Gaussian neighborhood function with time. Popular choice is

exponential temporal decay

• Consequently, topological neighborhood function assumes time-

varying form

• Time increases width decreases neighborhood shrinks

,...,2,1,exp)(1

0 nn

n

,...,2,1,)(2

exp),(2

2

,

, nn

dnxh

ij

ij

156



• Time increases width decreases neighborhood shrinks



• Clearly, SOM must involve some kind of adaptation or learning by

which the outputs become self-organised and the feature map

between inputs and outputs is formed

• Meaning of the topographic neighbourhood is that not only the

winning neuron gets its weights updated, but its neighbours will

have their weights updated as well

• Learning rule for adaptation

the rule is applied to all neurons inside the topological

neighbourhood of the winning neuron i

• Adaptation moves the synaptic weights wj of the chosen neurons

toward the input vector x

))()(,()()()1( , nwxnxhnnwnw jijjj

157



• Adaptation algorithm leads to a topological ordering of the feature

map – neurons that are neighbours in the lattice will tend to have

similar weight vectors

• Learning parameter η(n) should be decreasing with time for a proper

convergence

• Thus, SOM algorithm requires choice of several parameters:

Even if not optimal, section of parameters usually leads to the

formation of the feature map in a self-organized manner

,...,2,1,exp)(2

0 nn

n

2010 ,,,



Adaptation process can be decomposed in two phases

1. Self-organizing or ordering phase

Topological ordering of weight vectors

typically cca 1000 iterations of SOM algorithm

needs proper choice of neighbourhood function and learning rate

2. Convergence phase

Feature map fine-tuning

provides statistical quantification of the input space

typically the number of iterations at least 500 times the number of neurons

Result of SOM algorithm

Starting from the initial state of complete disorder, SOM algorithm

gradually leads to an organized representation of activation

patterns drawn from the input space

– However, it is possible to end up in a metastable state in which the feature map

has a topological defect

158


SOM algorithm essentials

Essential characteristics of the SOM algorithm:

• Continuous input space of activation patterns that are generated

according to a certain probability distribution

• Discrete output space in a form of a lattice of neurons

• Shrinking neighborhood function h that is defined around a

winning neuron

• Decreasing learning rate that is exponentially decreasing with time


SOM algorithm summary

1. Initialization Choose random values for the initial weight vectors wj

2. Sampling Draw a sample training input vector x from the input space

3. Competition Find the winning neuron with weight vector closest to input vector

4. Cooperation Select neurons in the topological neighbourhood of the winning neuron

5. Adaptation Adjust synaptic weights of the selected neurons

6. Iteration Continue with step 2 until the feature map stops changing

))()(,()()()1( , nwxnxhnnwnw jijjj

159


Visualizing the SOM algorithm (1/2)

Step 1

Suppose we have four data points (x)

in our continuous 2D input space, and

want to map this onto four points in a

discrete 1D output space (o). The

output nodes map to points in the input

space (o). Random initial weights start

the circles at random positions in the

centre of the input space.

Step 2

We randomly pick one of the data

points for training (). The closest

output point represents the winning

neuron ( ). That winning neuron is

moved towards the data point by a

certain amount, and the two

neighbouring neurons move by smaller

amounts ().


Visualizing the SOM algorithm (2/2)

Step 3

Next we randomly pick another data

point for training (). The closest

output point gives the new winning

neuron ( ). The winning neuron

moves towards the data point by a

certain amount, and the one

neighbouring neuron moves by a

smaller amount ().

Step 4

We carry on randomly picking data

points for training (). Each winning

neuron moves towards the data point

by a certain amount, and its

neighbouring neuron(s) move by

smaller amounts (). Eventually the

whole output grid unravels itself to

represent the input space.

160


Example: 1D Lattice driven by 2D distribution

2D input data distribution Initial condition of 1D lattice

End of ordering phase End of convergence phase


Parameters for 1D example

(a) Exponential decay of

neighborhood width

σ(n)

(b) Exponential decay of

learning rate η(n)

(c) Initial neighborhood

function (spanning

over 100 neurons)

(d) Final neighborhood

function at the end of

the ordering phase

161


Example: 2D Lattice driven by 2D distribution

Initial condition of 2D lattice

End of ordering phase End of convergence phase

2D input data distribution


Matlab examples

• nnd14fm1 – 1D feature map

• nnd14fm2 – 2D feature map

../Matlab-Demos/nnd14fm1.bat

../Matlab-Demos/nnd14fm2.bat

162



Property 1: Approximation of the input space

Property 2: Topological ordering

Property 3: Density matching

Property 4: Feature selection


Property 1: Approximation of the input space

The feature map Φ represented by the set of weight vectors {wi} in the output space, provides a good approximation to the input space

Goal of SOM can be formulated as to store a large set of input vectors {x} by a smaller set of prototypes {wi} that provide a good approximation to the original input space.

Goodness of the approximation is given by the total squared distance which we wish to minimize

If we work through gradient descent style mathematics we do end up with the SOM weight update algorithm, which confirms that it is generating a good approximation to the input space

163


Property 2: Topological ordering

The feature map Φ computed by the SOM algorithm is topologically

ordered in the sense that the spatial location of a neuron in the output

lattice corresponds to a particular domain or feature of input patterns

The topological ordering property is a direct consequence of the weight

update equation:

– Not only the winning neuron but also the neurons in the topological

neighbourhood are updated

– Consequently the whole output space becomes appropriately ordered

Visualise the feature map Φ as elastic net ...


Property 3: Density matching

The feature map Φ reflects variations in the statistics of the input distribution: Regions in the input space from which the sample training vectors x are drawn with high probability of occurrence are mapped onto larger domains of the output space, and therefore with better resolution than regions of input space from which training vectors are drawn with low probability.

We can relate the input vector probability distribution p(x) to the magnification factor m(x) of the feature map. Generally, for two dimensional feature maps the relation cannot be expressed as a simple function, but in one dimension we can show that

So the SOM algorithm doesn’t match the input density exactly, because of the power of 2/3 rather than 1.

As a general rule, the feature map tend to over-represent the regions with low input density and to under-represent regions with high input density.

164


Property 4: Feature selection

Given data from an input space with a non-linear distribution, the self-organizing map is able to select a set of best features for approximating the underlying distribution

This property is a natural culmination of properties 1,2,3

Principal Component Analysis (PCA) is able to compute the input dimensions which carry the most variance in the training data. It does this by computing the eigenvector associated with the largest eigenvalue of the correlation matrix.

PCA is fine if the data really does form a line or plane in input space, but if the data forms a curved line or surface, linear PCA is no good, but a SOM will overcome the approximation problem by virtue of its topological ordering property.

The SOM provides a discrete approximation of finding so-called principal curves or principal surfaces, and may therefore be viewed as a non-linear generalization of PCA


• SOM is a neural network built around 1D or 2D lattice of neurons for

capturing important features contained in input data

• SOM provides a structural representation of input data by neurons’

weight vectors as prototypes

• SOM is neurobiologically inspired and incorporates self-organizing

mechanisms

– Competition

– Cooperation

– Adaptation

• SOM is simple to implement yet mathematically difficult to analyze

7.5 SOM discussion & examples

165


4 clusters, 1D SOM


4 clusters, 2D SOM

166


Uniform distribution, square

0.5 1 1.5 2 2.50.5

1

1.5

2

2.5

W(i,1)

W(i,2

)

Weight Vectors

0.5 1 1.5 2 2.50.5

1

1.5

2

2.5 50 neurons

Uniform distribution of

1000 points in a square

0.5 1 1.5 2 2.50.5

1

1.5

2

2.5

W(i,1)

W(i,2

)Weight Vectors

10x10 neurons


Uniform distribution of

1000 points in a circle

Uniform distribution, circle

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

W(i,1)

W(i,2

)

Weight Vectors

-1 -0.5 0 0.5 1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

W(i,1)

W(i,2

)

Weight Vectors

50 neurons

10x10 neurons

167


Gaussian distribution,

square

-1.5 -1 -0.5 0 0.5 1 1.5-1.5

-1

-0.5

0

0.5

1

1.5

-1.5 -1 -0.5 0 0.5 1 1.5-1.5

-1

-0.5

0

0.5

1

1.5

W(i,1)

W(i,2

)

Weight Vectors

-1.5 -1 -0.5 0 0.5 1 1.5-1.5

-1

-0.5

0

0.5

1

1.5

W(i,1)

W(i,2

)

Weight Vectors

Gaussian distribution of

1000 points in a square

50 neurons

10x10 neurons


Complex distribution

0.5 1 1.5 2 2.50.5

1

1.5

2

2.5

0.5 1 1.5 2 2.50.5

1

1.5

2

2.5

W(i,1)

W(i,2

)

Weight Vectors

0.5 1 1.5 2 2.50.5

1

1.5

2

2.5

W(i,1)

W(i,2

)

Weight Vectors

50 neurons

10x10 neurons

Complex distribution

in a square

168


4 clusters, 2D SOM

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5-1

-0.5

0

0.5

1

1.5

2

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5-1

-0.5

0

0.5

1

1.5

2

W(i,1)

W(i,2

)

Weight Vectors

4 classes with uniform distribution

1000 points in each class Net – 8x8 neurons


169

© 2012 Primož Potočnik NEURAL NETWORKS (8) Practical considerations #337

8. Practical Considerations

8.1 Designing the training data

8.2 Preparing data


8.4 Data encoding



8.7 Generalization

8.8 General guidelines


Introduction

• Neural network could, in principle, map raw input data into required

outputs in practice, this will generally give poor results

• For most applications, some data manipulations are recommended:

Preparing data

– designing the training data

– handling missing and extreme data

– incorporating invariances and prior knowledge

Preparing inputs

– pre-processing, rescaling, normalizing, standardizing, detrending

– dimensionality reduction: principal component analysis

– feature selection, feature extraction

Preparing outputs

– encoding of classes, post-processing, rescaling, standardizing

170


8.1 Designing the training data

• Good training data are required to train a NN – Neural nets are not good at extrapolation

• Training data must be representative for the problem considered

• For pattern recognition – Every class must be represented

– Within each class, statistical variation must be adequately represented

– Potato chips factory example:

• NN must be trained on 1) normal chips, 2) burned chips, 3) uncooked chips, ...

• Large training set prevents overfitting – Overfitting = perfect fit to a small number of training data

– Three-layer feedforward network example:

• With 25 inputs and 10 hidden neurons over 260 free parameters

• Apply at least 500-1000 training samples (preferably more) for proper training


8.2 Preparing data

• Some data transformation are usually necessary to achieve good

neural network results

• Rescaling – Add/subtract a constant and then multiply/divide by a constant

– Example: convert a temperature from Celsius to Fahrenheit

• Standardizing – Subtracting a measure of location and dividing by a measure of scale

– Example: subtracting a mean and dividing by standard deviation, thereby obtaining a

"standard normal" random variable with mean 0 and standard deviation 1

• Normalizing – Dividing a vector by its norm

– Example: make the Euclidean length of the vector equal to one.

– In the NN literature, "normalizing" often refers to rescaling into [0,1] range

• Which operations should be applied to data? It depends!

171


Rescaling

• Rescaling inputs – Often recommended rescaling of inputs to interval [0,1] is a misconception, there

is in fact no such requirement.

– Interval [0,1] is usually a bad choice, rescaling to [-1,1] interval is better

– Standardizing inputs is better than rescaling ...

• Rescaling outputs 1. For bounded activation functions (range [0,1] or [-1,1] ) the target values must lie

within that range

The alternative is to use an activation function suited to the distribution of the

targets, for example linear activation function.

2. It is essential to rescale the multidimensional targets so that their variability

reflects their importance, or at least is not in inverse relation to their importance.

If the targets are of equal importance, they should typically be rescaled or

standardized to the same range or the same standard deviation.


Standardizing

• Standardizing usually reffers to transforming data into zero mean with standard deviation one

– Statistics (mean, std) are computed from training data, not from validation data

– Validation data must be standardized using the statistics computed from training data

• Standardizing inputs – Often very benefitial for MLP and RBFN networks

RBFN – inputs are combined with via a distance function (Euclidean) therefore it is important to standardize them into similar range

MLP – standardizing enables utilization of steep parts of transfer functions faster learning and avoidance of saturation

• Standardizing outputs – Typically more a convenience for getting good initial weights than a necessity

– Important for the equal relevance of targets

– Note: use rescaling for bounded activation functions, not standardizing!

172


Time series transformations

• Detrending – Removing linear trend from the time series

– After neural network application, original trend is added to the results

Carefull: it is too easy to create trend where none belongs!

• Removing seasonal components – Yearly, monthly, weekly, daily, hourly cycles can be removed before the

application of neural networks

– Decomposition methods

• Differencing – Working with differences between successive samples can sometimes bring

good results

– Example: daily stock-market values convey one sort of information, the change

from one day to the next conveys entirely different information

– Differencing can be applied at inputs and outputs, powerfull option is to apply raw

and diferrenced inputs!


Why detrending is dangerous?

173


Example of time series preparation

1. Original data x – nonstationary mean

– nonstationary variance

2. Log(x) – stationary variance

3. Differencing – stationary mean

– stationary variance


Time series decomposition

• Time series can often be decomposed into components – Trend (T)

– Seasonal cycle (S)

– Residual (E)

• Decomposition can be aditive or multiplicative – Aditive: Y = T + S + E

– Multiplicative: Y = T * S * E

• Methods – X-12-ARIMA (U.S. Census Bureau, Statistical Research Divison )

– STL (Seasonal Trend Decomposition based on Loess)

174


Example of STL decomposition

Original data

Trend

Weekly cycle

Residual

Daily energy consumption


Missing data and outliers

• Handling missing data is difficult – If not many data are missing, discard missing samples

– Substituting the missing data with mean values

– Input vector with a missing single variable:

• Find similar input vectors (without missing variable) based on a distance measure

• Take the missing value as an average of the variable contained in the similar input vectors

• Outliers can appear due to – Natural variation of the variable's distribution

– Noise in data acquisition chain

– Defects

• Carefull examination of the experiment is required to confirm validity of outliers – If outliers have some significance, keep them in the training data

• Some abnormality is normal! – Do not reject a point unless it is really wild

175



Importance of inputs which inputs should be selected for best

results (classification or prediction)

Several aspects of importance:

Predictive importance

Increase in generalization error when an input is omitted from a network

Causal importance

How much the outputs change if inputs are changed (also called sensitivity)

Marginal importance

Considers inputs in isolation

Easy to compute without even training a neural net ... (Pearson correlation, rank correlation, mutual information, ...)

Marginal importance is of little practical use other than for a preliminary

description of the data


How to measure importance of inputs

• How to measure importance of inputs: very difficult! – Comparing weights in linear models can be misleading

– Comparing standardized weights in linear models can be misleading

– Comparing changes in the error function in linear models can be misleading

– Statistical p-values can be misleading

– Comparing weights in MLPs can be misleading

– Sums of products of weights in MLPs can be misleading

– Partial derivatives can be misleading

– Average partial derivatives over the input space can be misleading

– Average absolute partial derivative can be misleading



176


Methods of input selection

• Practical approach: Selection of inputs based on cross-validation

• General framework 1. Select a subset of inputs

2. Train and validate the network based on the selected subset of inputs

3. Based on the validation result, decide upon further inclusion/rejection of inputs

4. Continue iterating until good results are obtained

• Direct search methods – Exhaustive search

– Forward selection

– Backward elimination

– Selection by genetic algorithms, ...

• Prunning methods – Removing nonrelevant inputs during the neural network construction


8.4 Data encoding (1/3)

• Numeric variables – No need for encoding check the need for rescaling or standardizing

• Ordinal variables – Discrete data with natural ordering (e.g. 'small', 'medium', 'big')

– Ordinal variables can often be represented by a single variable

• Small | 1 |

• Medium | 2 |

• Big | 3 |

– Thermometer coding (using ‘dummy’ variables)

• Small | 0 0 1 |

• Medium | 0 1 1 |

• Big | 1 1 1 |

– Improved thermometer coding faster learning

• Small | -1 -1 1 |

• Medium | -1 1 1 |

• Big | 1 1 1 |

177


Data encoding (2/3)

• Categorical variables – Discrete data without ordering (e.g. 'apple', 'banana', 'orange' )

– 1-of-C coding

• Red | 0 0 1 |

• Green | 0 1 0 |

• Blue | 1 0 0 |

– 1-of-(C-1) ... if the network has bias

• Red | 0 0 |

• Green | 0 1 |

• Blue | 1 0 |

– 1-of-C coding with a softmax activation function

will produce valid posterior probability estimates

– It is very important NOT to use a single variable for an unordered

categorical target


• Circular discontinuity – How to encode variables that are fundamentaly circular? ...

• e.g. Angle 0..360

– Day of the week (Mon=1, ... Sun=7) we have discontinuity when passing

from 7 to 1, although Sunday and Monday are very close

– Solutions

1. Discretizing and using any of the categorical coding (1-of-C)

2. Encoding with two dummy variables (sin,cos)

Data encoding (3/3)

sin cos

178



• In some situations, dimension of the input vector is large, but the components of the vectors are highly correlated

• It is useful in this situation to reduce the dimension of the input vectors feature extraction

• An effective procedure for performing this operation is principal component analysis (PCA)

• PCA is a vector space transform used to reduce multidimensional data to lower dimensions for analysis

• PCA method generates a new set of variables, called principal components


Calculation of principal components

• Input matrix X is represented as a linear combination of

principal components

• Projection vectors αp are eigen vectors of the covariance

matrix XXT

• Each principal component zp is obtained as a product of input

matrix with projection vectors

• Each principal component is a linear combination of original

variables

• All the principal components are orthogonal to each other, so

there is no redundant information

179


PCA example

• Original data: x1, x2

• Principal components: z1, z2

• Variability z1 95%

z2 5%

• Benefit – Dimensionality reduction

by using only the first principal

component (z1) instead of

original 2D data (x1, x2)


Properties of principal components

• Principal components form an orthogonal basis for the data

• 1st principal component variance of this variable is the

maximum among all possible choices of the first axis

• 2nd principal component is perpendicular to the 1st

principal component. The variance of this variable is the

maximum among all possible choices of this second axis

• Often, the first few principal components explain the majority

of the total variance these few new variables can be taken

as low-dimensional input to neural network instead of high-

dimensional original data

180


How to use PCA for neural networks

1. Load original data X

2. Compute principal components z

3. Plot variance explained

4. Decide how much variance to keep ... 90%, 95%?

5. Keep only a few selected principal components, discard the

rest data dimensionality is reduced


Intrinsic dimensionality

• Suppose we apply PCA on d-dimensional data and discover

that first n-eigenvalues have significantly larger values (n < d)

• Consequently, data can be represented with high accuracy by

first n-eigenvalues effective dimensionality is only n

• Generally, data in d-dimensions have intrinsic dimensionality

equal to n if data lies entirely within a n-dimensional subspace

181


Neural nets for dimensionality reduction

• Multilayer feedforward neural networks can be used to perform

nonlinear dimensionality reduction

• Auto-associative multilayer perceptron with extra hidden layers

can perform a general nonlinear dimensionality reduction

Number of neurons: 1024 300 50 300 1024

Nonlinear dimensionality reduction

32 x 32 pixels 32 x 32 pixels



• In many practical situations we have, in addition to the

data itself, also a priori knowledge – General information about the form of the mapping

– Prior probabilities of class membership

– Information about constraints

– Knowledge about invariances

• How to build invariances into neural networks? 1. Invariance by neural network structure

• Shared weights, Higher-order neural networks

2. Invariance by training

• Include a large number of translated inputs to train NN

3. Invariant feature space

• Extract features that are invariant for the problem considered

• Review of the Lecture NN-02 feature extraction ...

182


Handwritten character recognition problem

• Recognize handwritten characters ‘a’ and ‘b’

• Image representation – Grid of pixels (typically 256x256) 65536 inputs

– Gray level [0..1] (typically 8-bit coding)

• Extraction of the features:

• Solving two problems 1. Invariance problem (translations)

2. Curse of dimensionality problem

,heigth character

area closed,

widthcharacter

heigth character 21 FF


8.7 Generalization

• Goal of network training is not to exactly fit the data but

to build a statistical model that generates the data

• Well trained network is able to generalize to make good

predictions on new inputs

– Here it is assumed that the test data are drawn from the same

population used to generate the training data

• Neural network designed to generalize well will produce

correct input-output mapping even if new inputs differ slightly

from the samples used to train the network

• Overfitting problem neural net learns the complete training

set but not the underlaying function

183


Generalization in classification

• The task of our network is to learn a classification decision boundary

• If we know that training data contains noise, we don’t necessarily want the training data to be classified totally accurately, as that is likely to reduce the generalization ability

Good generalization Overfitting


Generalization in function approximation

• Function approximation based on noisy data samples

• We can expect the neural network output to give a better representation of the underlying function if its output curve does not pass through all the data points

• Again, allowing a larger error on the training data is likely to lead to better generalization

Good generalization Overfitting

184


Overfitting, underfitting

• Overfitting – Neural network perfectly learns the training data but gives poor results on test

data

• Underfitting – Neural network is unable to properly learn the data due to insufficient number of

neurons or due to extreme regularization

– Such network also generalizes poorly

Underfitting Overfitting


Improving generalization

How to prevent underfitting 1. Provide enough hidden units – to represent the required mappings

2. Train the network for long enough – so that the sum squared error cost

function is sufficiently minimised

How to prevent overfitting 3. Design the training data properly – use large training set

4. Cross-validation – check generalization ability on test data 5. Early stopping – before NN had time to learn the training data too well

6. Restrict the number of adjustable parameters the network has

a) Reduce the number of hidden units, or

b) Force connections to share the same weight values

7. Add regularization term to the error function to encourage smoother

network mappings

8. Add noise to the training patterns to smear out the data points

185


Cross-validation

• Cross-validation is used to estimate generalization error

based on “resampling”

• Available data are randomly partitioned into a

– Training set, and

– Test set

• Training set is further partitioned into

– Estimation subset, used to train the model

– Validation subset, used to validate the model

Training set is used to build and validate various candidate models and to

choose the “best” one

• Generalization performance of the selected model is tested on

the test set which is different from the validation subset


Variants of cross-validation

• If only a small set of data exists ...

• Multifold cross-validation – Divide available N samples into K subsets

– Model is trained on all subsets except one

– Validation error is measured on the subset left out

– Procedure is repeated K-times

– Model performance is obtained by averaging K trials

• Leave-one-out cross-validation – Extreme form of cross-validation

– N-1 samples are used for training

– Model is validated on the sample left out

– Procedure is repeated N-times

– Result is averaged over N-trials

Trial

1 2 ... K

186


Early stopping

• Neural networks are often set up with more than enough parameters which

can cause over-fitting

• For the iterative gradient descent based network training procedures

(backpropagation, conjugate gradients, ...), the training set error will

naturally decrease with increasing numbers of epochs of training

• The error on the unseen validation and testing data sets, however, will start

off decreasing as the under-fitting is reduced, but then it will eventually

begin to increase again as over-fitting occurs

• The natural solution to get the best

generalization, i.e. the lowest error

on the test set, is to use the

procedure of early stopping


Early stopping procedure

• How to perform learning with early stopping? 1. Divide the training data into estimation and validation subsets

2. Use a large number of hidden units

3. Use very small random initial values

4. Use a slow learning rate

5. Compute the validation error rate periodically during training

6. Stop training when the validation error rate starts increasing

• Since validation error is not a good estimate of the generalization error, a third test set must be applied to estimate generalization performance

• Available data are divided as in cross-validation

– Training set

• Estimation subset

• Validation subset

– Test set

187


Practical considerations of early stopping

• One potential problem: validation error may go up and down

numerous times during training the safest approach is generally

to train to convergence, saving the weights at each epoch, and then

go back to weights at the epoch with the lowest validation error

• Early stopping resembles regularization with weight decay which

indicates that it will work best if the training starts with very small

random initial weights

• General practical problems

– How to best split available training data into training and validation subsets?

– What fraction of the patterns should be in the validation set?

– Should the data be split randomly, or by some systematic algorithm?

Such issues are problem dependent ...

Default Matlab parameters (train, validation, test): 70%, 15%, 15%


Weight restriction and weight sharing

• Perhaps the most obvious way to prevent over-fitting in neural

networks is to restrict the number of free parameters

• The simplest solution is to restrict the number of hidden units, as this

will automatically reduce the number of weights. Optimal number for

a given problem can be determined by cross-validation.

• Alternative solution is to have many weights in the network, but

constrain certain groups of them to be equal

a) If there are symmetries in the problem, we can enforce hard weight sharing by

building them into the network in advance

b) In other problems we can use soft weight sharing where sets of weights are

encouraged to have similar values by the learning algorithm

one way to implement soft weight sharing is to add an appropriate term to

the error function regularization

188


Regularization

• Regularization technique encourages smoother network mappings

by adding a penalty term to the standard (sum-squared-error) cost

function

• where the regularization parameter λ controls the trade-off between

reducing the error Esse and increasing the smoothing

• This modifies the gradient descent weight updates

• The resulting neural network mapping is a compromise between

fitting the data and minimizing the regularizer Ω


Regularization by weight decay

• One of the simplest forms of regularizer is called weight decay and consists of the sum of squares of the network weights

• In conventional curve fitting this regularizer is known as ridge regression. We can see why it is called weight decay when we observe the extra term in the weight updates

In each epoch the weights decay in proportion to their size

• Empirically, this leads to significant improvements in generalization. Weight decay keeps the weights small and hence the mappings are smooth

189


Training with noise / Jittering

• Adding noise (jitter) to the inputs during training was also found

empirically to improve network generalization

• Noise will ‘smear out’ each data point and make it difficult for the

network to fit the individual data points precisely, and consequently

reduce over-fitting

• Jittering is accomplished by generating new inputs by using original

inputs and small amounts of jitter. Adding jitter to the targets will not

change the optimal weights, it will just slow down training.

• Jittering is also closely related to regularization methods such as

weight decay and ridge regression


Generalization summary

Preventing underfitting 1. Provide enough hidden units

2. Train the network for long enough

Preventing overfitting 3. Design the training data properly

4. Cross-validation

5. Early stopping

6. Restrict the number of adjustable parameters

7. Regularization

8. Jittering

190


8.8 General guidelines

General guidelines for designing successful neural

network solutions:

1. Understand and specify your problem

2. Acquire and analyze data, define inputs and outputs, remove outliers, apply

preprocessing methods (rescale, standardize, normalize), properly encode

outputs, ...

3. Acquire prior knowledge and apply it in terms of feature selection, feature

extraction, selection of neural network type, neural network complexity, etc.

4. Start with simple neural network architectures – few layers, few neurons

5. Train the network and make sure it performs well on its training data.

If this doesn’t work, increase the complexity of the network.

6. Test its generalization by checking its performance on new test data.

If this doesn’t work, check your data, check partitioning of data into train/test

sets, check and modify network architecture, ...