Multilayer Perceptrons - Sahand University of Technologyfa.ee.sut.ac.ir/Downloads/AcademicStaff/1/Courses/63/Lect4-2.pdf · Typically, binary target values are used: d kj = 1 when

Afshin EbrahimiAfshin EbrahimiFaculty of Electrical EngineeringSahand University of Technology

Multilayer PerceptronsMultilayer Perceptrons

In theory we need M outputs for an M class classification problem In theory, we need M outputs for an M‐class classification problem to represent all possible classification decisions.

Let xj denote the jth m‐dimensional prototype to be classified by a jmultilayer perceptron (MLP) network.

Let us denote by Ck the kth class. Denote the kth output of the network by Denote the kth output of the network by

yk,j = Fk(xj), k = 1, . . . ,Mcorresponding to the prototype xj .

f The function Fk(.) is the corresponding input‐output mapping learned by the network.

We can present these M mappings conveniently in vector formWe can present these M mappings conveniently in vector formyj = F(xj)

Where yj = [y1,j , y2,j , . . . , yM,j ]T ,F( ) [F ( ) F ( ) F ( )]TF(xj) = [F1(xj), F2(xj), . . . , FM(xj)]T .

2009 Afshin Ebrahimi, Sahand University of Technology 2

Basic question: what should be the optimum decision rule for classifying the M outputs of a MLP network after training?fy g p f f g

The continuous vector‐valued function y = F(x) minimizes the empirical risk functional

Here dj is again the desired (target) output pattern for the prototype xj .

N is the total number of training vectors (prototypes) N is the total number of training vectors (prototypes). The risk R is in essence similar to the average squared error was used as a cost function in deriving the back‐propagation

avav

algorithm in Section 4.3.


Typically, binary target values are used:dkj = 1 when xj belongs to class Ck,j jdkj = 0 when xj does not belong to class Ck.

Thus the class Ck is represented by the M‐dimensional target vectorvector

[0, . . . , 0, 1, 0, . . . , 0]T

This is the kth unit vector; only the kth element 1 is nonzero.k b k f h h In Haykin’s book, justifications are given showing that a MLP

classifier approximates the a posteriori class probabilities. A posteriori probability for the class Cj is the probability that a A posteriori probability for the class Cj is the probability that a

vector x with an unknown class actually belongs to the jth class.


Prerequisites for this result:‐ The logistic nonlinearity is used.‐ The size of the training set is large enough.‐ Back‐propagation learning does not get stuck at a local minimum.

Hence an appropriate decision rule is the (approximate) Bayes rule Hence an appropriate decision rule is the (approximate) Bayes rule generated by the a posteriori class probability estimates:

Classify x to the class Ck if

If the underlying posterior class distributions are distinct, a unique largest output value exists with probability 1.largest output value exists with probability 1.

Some less important comments on the derived approximate Bayes rule have been presented at the end of the section 4.7.


In back‐propagation learning, as many training examples as possible are typically used.

It is hoped that the network so designed generalizes well. A network generalizes well when its input‐output mapping is

(almost) correct also for test data(almost) correct also for test data. The test data is not used in creating or training the network. Assumption: test data comes from the same population

d b h d(distribution) as the training data. Training of a neural network may be viewed as a curve fitting

(nonlinear mapping) problem.(nonlinear mapping) problem. The network can simply be considered as a nonlinear input‐output

mapping.


G li ti b t di d i t f th li Generalization can be studied in terms of the nonlinear interpolation ability of the network.

MLP networks with continuous activation functions perform puseful interpolation because they have continuous outputs.

Figure 4.19a shows an example of good generalization for a data vector not used in trainingvector not used in training.

Using too many training examples may lead to a poor generalization ability.y

This is called overfitting or overtraining. The network then learns even unwanted noise present in the

training datatraining data. More generally, it learns “features” which are present in the training

set but actually not in the underlying function to be modeled.


S b f l See Figure 4.19b for an example. Basic reason for overfitting: there are more hidden neurons than

necessary in the network.ecessa y t e et o k Similar phenomena appear in other modeling problems if the

chosen model is too complicated, containing too many free tparameters.

For example least‐squares fitting, autoregressive modeling, etc. Occam’s razor principle in model selection: select the simplest p p p

model which describes the data adequately. In the neural network area, this implies choosing the smoothest

function that approximates the input output mapping for a given function that approximates the input‐output mapping for a given error criterion.

Such a choice generally demands the fewest computational resources.




Three factors affect generalization:Three factors affect generalization:1. The size and representativeness of the training set.2. The architecture of the neural network. Th h i l l it f th bl t h d3. The physical complexity of the problem at hand.

Only the first two factors can be controlled. The issue of generalization may be viewed from two different g y

perpectives: –The architecture of the network is fixed. Determine the size of the

training set for a good generalizationtraining set for a good generalization. –The size of the training set is fixed. Determine the best architecture of

network for achieving a good generalization.f g g g Here we focus on the first viewpoint, hoping that the fixed

architecture matches the complexity of the problem. Distribution free worst case formulas are available for estimating the Distribution‐free, worst‐case formulas are available for estimating the

size of sufficient training set for a good generalization performance.2009 Afshin Ebrahimi, Sahand University of Technology 11

See section 2 14; skipped in this course See section 2.14; skipped in this course. However, these formulas give often poor results. A practical condition for a good generalization: The size N of the training set must satisfy the condition

Here W is the total number of free parameters (weights and biases) in the network.

f f f denotes the fraction of classification errors permitted on test data (as in pattern classification).

O(.) is the order of quantity enclosed within it.

( ) q y Example: If an error of 10% is permitted, the number of training examples

should be about 10 times the number of free parameters. Justifications for this empirical rule are presented in the next section Justifications for this empirical rule are presented in the next section.


A MLP network trained with back‐propagation is a practical tool for performing a general nonlinear input‐output mapping.

Let m0 be the number of input nodes (neurons), and M = mL the number of output nodes.

The input‐output mapping of the MLP network is from m ‐ The input output mapping of the MLP network is from m0dimensional input space to M‐dimensional output space.

If the activation function is infinitely continuously differentiable, h lthe mapping is also.

A fundamental question: What is the minimum number of hidden layers in a MLP network providing an approximate realization of any layers in a MLP network providing an approximate realization of any continuous mapping?


h l h f l The universal approximation theorem for a nonlinear input‐output mapping provides the answer.

The theorem is presented in Haykin’s book, pp. 230‐231.e t eo e s p ese ted ay s boo , pp 30 3 Its essential contents are as follows: Let be a nonconstant, bounded, and monotonically

i i ti f ti(.)

increasing continuous function. Let Im0 denote the m0‐dimensional unit hypercube [0, 1]m0 , and

C(Im0 ) the space of continuous functions on Im0 .( m0 ) p m0 For any given function f C(Im0 ) and , there exist an integer

M and sets of real constants , bi and wij , where i = 1, . . . ,m1 and j 1 m so that

0i

= 1, . . . ,m0 so that:

is an approximate realization of the function f(.).


That is, for all x1, x2, . . . , xm0 that lie in the input space.

Th i l i ti th i di tl li bl t ltil The universal approximation theorem is directly applicable to multilayer preceptrons.

The logistic function is a nonconstant, bounded, )]exp(1/[1)( vv and monotonically increasing function.

Furthermore, Eq. (4.86) represent the output of a MLP network described as follows:

1. The network has m0 input nodes with inputs x1, x2, . . . , xm0 , and a single hidden layer consisting of m1 neurons.

2. Hidden neuron i has synaptic weights wi1, . . . ,wim0 , and bias bi.2. Hidden neuron i has synaptic weights wi1, . . . ,wim0 , and bias bi. 3. The network output is a linear combination of the outputs of the

hidden neurons, with defining the weights of the output layer. The universal approximation theorem is an existence theorem

11,..., m The universal approximation theorem is an existence theorem.


In effect, the theorem states that a MLP network with a single hidden layer is sufficient for uniform approximation g y ppwith accuracy .

However, the theorem does not say that a single hidden l i i l i h

layer is optimal with respect to:

‐ learning time‐ ease of implementationease of implementation‐ generalization ability (most important property).Bounds on Approximation Errorspp This theoretical treatment is not essential in this course. A result worth mentioning: the size of the hidden layer m1

t b l f tti d i timust be large for getting a good approximation.


This part is skipped, too, though it contains some interesting results.

One important matter: multilayer perceptron are more effective than for example polynomials or trigonometric f i i i ifunctions in approximation.

That is, the number of terms required for sufficient approximation grows slower with the dimension of the approximation grows slower with the dimension of the problem.

The reason is basically that nonlinearities are used in an yefficient way in MLP networks.


The universal approximation theorem is important from a theoretical viewpoint.

It gives a rigorous mathematical foundation for using multilayer perceptrons in approximating nonlinear mappings.

However the theorem is not constructive However, the theorem is not constructive. It does not actually tell how to specify a MLP network with the

stated approximation properties.d h h l Some assumptions made in the theorem are unrealistic in most

practical applications:‐The continuous function to be approximated is given.The continuous function to be approximated is given.‐A hidden layer of unlimited size is available. A problem with MLP’s using a single hidden layer: the hidden

t d t i t t l b llneurons tend to interact globally.


In complex situations improving the approximation at one point In complex situations, improving the approximation at one point typically worsens it at some other point.

With two hidden layers, the approximation (nonlinear mapping) b blprocess becomes more manageable.

One can proceed as follows:1. Local features are extracted in the first hidden layer.f f y‐The input space is divided into regions by some neurons.‐Other neurons in the first hidden layer learn the local features

h t i i th icharacterizing these regions.2. Global features corresponding to each region are extracted in the

second hidden layer.y‐Neurons in this layer combine the outputs of the neurons describing

certain region. This procedure somehow corresponds to piecewise polynomial This procedure somehow corresponds to piecewise polynomial

(spline) approximation in curve fitting.2009 Afshin Ebrahimi, Sahand University of Technology 19

It is hoped that a MLP network trained with back‐propagation learns enough from the past to generalize to the future.

How the network parameterization should be chosen for a How the network parameterization should be chosen for a specific data set?

This is a model selection problem: choose the best one of a set of candidate model structures (parameterizations).

A useful statistical technique for model selection: cross‐validation. The available data set is first randomly partitioned into a training The available data set is first randomly partitioned into a training

set and a test set. The training set is further partitioned into two disjoint subsets:

i i b d l h d l– Estimation subset, used to select the model.–Validation subset, used to test or validate the model. The motivation is to validate the model on a data set different The motivation is to validate the model on a data set different

from the one used for parameter estimation.2009 Afshin Ebrahimi, Sahand University of Technology 20

In this way the training set can be used to assess the performance of In this way, the training set can be used to assess the performance of various model.

The “best” of the candidate models is then chosen. This procedure ensures that a model which might in the worst case

end up with overfitting the validation subset is not chosen. The use of cross‐validation is appealing when one should design a The use of cross validation is appealing when one should design a

large network with good generalization ability. For MLP networks, cross‐validation can be used to determine:

f‐ the optimal number of hidden neurons.‐when it is best to stop training.

These matters are described in the next subsections:These matters are described in the next subsections: ‐Model Selection ‐ Early Stopping Method of Training

V i t f C V lid ti ‐Variants of Cross‐Validation They are skipped in this course.


Back‐propagation (BP) is the most popular algorithm for supervised training of multilayer perceptrons (and neural networks in general).

It is basically a gradient (derivative) technique, not an optimization methodmethod.

Back‐propagation has two distinct properties:– It is simple to compute locally.

f h i di d i i h– It performs stochastic gradient descent in weight space. These two properties are responsible for the advantages and

disadvantages of back‐propagation learning.disadvantages of back propagation learning.


Connectionism BP is an example of connectionist paradigms. They use local computations only for processing information in a neural network.

l d f l f Locality constraint: a computing neuron needs information only from neurons connected physically to it.

Local computations are preferred in the design of artificial neural networks for th i i l three principal reasons:1. Biological neural networks use local computations.2. Local computations are fault‐tolerant against hardware errors.3 Local computations favor the use of computationally efficient parallel 3. Local computations favor the use of computationally efficient parallel architectures.

BP has been realized using VLSI architectures and parallel computers (point 3). Also point 2 holds on certain conditions Also point 2 holds on certain conditions. However, back‐propagation learning is not biologically plausible (point 1) for

several reasons given in Haykin’s book, p. 227. Anyway, BP learning is important from an engineering point of viewAnyway, BP learning is important from an engineering point of view.


Feature Detection The hidden neurons of a multilayer perceptron trained by BP act

as feature detectors (Section 4.9). This property can be exploited by using a MLP network as a

replicator or identity mapreplicator or identity map. Figure 4.23 shows how this can be accomplished in the case of a

single hidden layer.l iStructural constraints:

‐The input and output layers have the same size, m.‐The size of the hidden layer, M, is smaller than m.The size of the hidden layer, M, is smaller than m.‐ The network is fully connected.‐ The desired response is the same as the input vector x.

Th t l t t ˆ i th ti t f The actual output ˆx is the estimate of x.


The network is trained otherwise normally by using as an error vector e = x − ˆx.

Actually this is a form of unsupervised learning (there is no teacher).Th li k f d i i i The replicator network performs data compression in its hidden layer.

After learning the network provides a coding/decoding After learning, the network provides a coding/decoding system for the input vectors x.

See Figures 4.23 (a), (b), and (c).g



Function Approximation A multilayer perceptron trained with the back‐propagation

algorithm is a nested sigmoidal scheme. Its output vector y can be written in the form

W1, . . . ,WL are the weight matrices of the L layers of the network, l d bincluding bias terms.

fi(.) are vector‐valued functions; their each component is the common sigmoid activation function .(.)common sigmoid activation function .

W denotes the set of all the weight matrices. The dimensions of the weight matrices Wi and vectors fi are

ll diff t i diff t l i L

(.)

generally different in different layers i = 1, . . . ,L.


However, they must fit each other. The mapping F(x,W) is an universal approximator. Multilayer perceptrons can approximate not only smooth Multilayer perceptrons can approximate not only smooth,

continuously differentiable functions, but also piecewise differentiable functions.

Computational Efficiency The computational complexity of an algorithm is usually measured

by counting the number of multiplications, additions and storage by counting the number of multiplications, additions and storage required per iteration.

The back‐propagation algorithm is computationally efficient.Thi th t it t ti l l it i l i l This means that its computational complexity is polynomial as a function of adjustable parameters.

If a MLP network contains a total of W weights, the computational g , pcomplexity of the BP algorithm is linear in W.


Sensitivity Analysis The sensitivity analysis of the input‐output mapping provided by The sensitivity analysis of the input output mapping provided by

BP can be carried out efficiently. A more detailed discussion of this property is skipped in our

course.Robustness The back‐propagation algorithm is locally robust on certain The back propagation algorithm is locally robust on certain

conditions. This means that disturbances with small energy can only give rise

t ll ti ti to small estimation errors.


Convergence The back‐propagation algorithm uses an instantaneous estimate

of the gradient of the error surface. The direction of the instantaneous gradient fluctuates from

iteration to iterationiteration to iteration. Such stochastic approximation algorithms converge slowly. Some fundamental causes of the slow convergence:

f h f f l fl l h d 1. If the error surface is fairly flat along a weight dimension, many iterations may be required to reduce the error significantly.

2. The direction of the negative instantaneous gradient vector may 2. The direction of the negative instantaneous gradient vector may point away from the minimum.

‐This leads to a correction in a wrong direction for that iteration.


The slow convergence of the back‐propagation algorithm may make it computationally excruciating.

Basic reason: large‐scale neural network training problems are inherently very difficult and ill‐conditioned.

No supervised learning strategy is alone feasible. p g gy

Suitable preprocessing may be necessary in practice.

For example Principal Component Analysis or whitening For example Principal Component Analysis or whitening.

See Haykin’s Chapter 8, and a problem in exercises.


Local Minima

The error surface has local minima in addition to the global The error surface has local minima in addition to the global minimum.

It is clearly undesirable if the learning process terminates at a It is clearly undesirable if the learning process terminates at a local minimum.

E i ll if thi i l t d f b th l b l i i ‐ Especially if this is located far above the global minimum.

Basic reason for the existence of local minima: nonlinearities.

If linear activation functions were used in BP, no local minima exist, but the network can then learn only linear mappings.


Scaling Scaling problem: How well the network behaves as the g p

computational task increases in size and complexity? One can typically consider:

Th i i d f i i ‐The time required for training. ‐The best attainable generalization performance. There exists many possible ways to measure the complexity There exists many possible ways to measure the complexity

or size of a computational task. Most useful measure: predicate order.p Predicate is a binary function having only two values 0

and 1, or FALSE and TRUE.)(X


An empirical study: how well a MLP network trained with back‐propagation learns the parity function

numberoddanisXifX ,1)(

biXifX 0)(

The order of the parity function is equal to the number of inputs.d h h d l h f

numberevenanisXifX ,0)(

It turned out that the time BP required to learn the parity function scales exponentially with the number of inputs.

An effective method of alleviating the scaling problem:An effective method of alleviating the scaling problem: Incorporate prior knowledge into the design of the network.


The supervised training of a multilayer perceptron (MLP) is now viewed as a numerical optimization problem.

The error surface of a MLP is a highly nonlinear function of the weight vector w.

Let E (w) denote the cost function averaged over the training Let Eav(w) denote the cost function, averaged over the training sample set.

Recall now from Section 3.3 the Taylor series expansion of a scalar f f h f hfunction E(w) of (the components of) the vector w:

Here is a (small) correction or update term.w


g is the gradient vector of E(w), and H its Hessian matrix, both evaluated at the point w.

When applied to Eav(w) with dependences on n included, the Taylor series expansion becomes

h l l d d f d b l h The local gradient g(n) is defined by evaluating the quantity at the point w = w(n).

The local Hessian matrix H(n) is similarly defined by

l d l h i i ( )evaluated also at the operating point w = w(n).


Because the ensemble averaged cost function Eav(w) is used here, a batch mode of learning is presumed.

In the steepest descent method, the adjustment applied to the weight vector w(n) is defined by the negative gradient vector

)(nw

where is the learning‐rate parameter.l f h h b k l h h b h

An example of this is the back‐propagation algorithm in the batch

mode. In effect, the steepest descent method uses a linear In effect, the steepest descent method uses a linear

approximation of the cost function around the operating point There the gradient vector is the only source of local information

b t th f

)(nw

about the error surface.


Advantage: simplicity of implementation. Drawback: convergence can be very slow in large‐scale problems. Inclusion of the momentum term is a crude attempt to use some

second‐order information about the error surface. It helps somewhat but makes the training process more delicate It helps somewhat, but makes the training process more delicate

to manage. Reason: the designer must “tune” one more parameter.

f l h d f For improving significantly the convergence speed of MLP training, one must use higher‐order information.

This can be done by using a quadratic approximation of the error This can be done by using a quadratic approximation of the error surface around the current point w(n).

It is easy to see that the optimum value of the update of th i ht t ( ) i

)(nwthe weight vector w(n) is


Here it is assumed that the inverse H−1(n) of the Hessian matrix H(n) exists.

The above formula is the essence of Newton’s method (Section 3.3)

However the practical application of Newton’s method to However, the practical application of Newton s method to supervised training of MLP’s is handicapped by the following factors:

l l h h h b–One must calculate the inverse Hessian H−1(n), which can be computationally expensive.

–The Hessian matrix H(n) must be nonsingular so that H−1(n) exists. The Hessian matrix H(n) must be nonsingular so that H (n) exists. This condition does not necessarily always hold.

–When the cost function Eav(w) is nonquadratic, there is no t f th f N t ’ th dguarantee for the convergence of Newton’s method.


Some of these difficulties can be overcome by using a quasi‐Newton method.

This requires an estimate of the gradient vector g only. However, even the quasi‐Newton methods are computationally

too expensive except for the training of very small‐scale neural too expensive except for the training of very small scale neural networks.

They are described at the end of Section 4.18; skipped in our course.

Another class of second‐order optimization methods: conjugate‐gradient methods.gradient methods.

Somewhat intermediate between the steepest descent and Newton’s method.


They need not the Hessian matrix, avoiding the difficulties associated with its evaluation, storage, and inversion.

Essential idea of conjugate‐gradient methods: a more sophisticated j g g pupdate direction than the gradient is used.

Conjugate‐gradient methods are applicable also to large‐scale problems involving hundreds or thousands of adjustable parametersinvolving hundreds or thousands of adjustable parameters.

They are well suited for the training of a MLP network, too. Conjugate‐gradient methods are discussed on pp. 236‐243. A summary of a nonlinear conjugate gradient algorithm for training a A summary of a nonlinear conjugate‐gradient algorithm for training a

MLP is presented in Table 4.8 on page 243. A detailed discussion is skipped in this course due to the lack of time.

N t Th G N t th d di d i S ti i il bl i Note: The Gauss‐Newton method discussed in Section 3.3 is available in MATLAB for training MLP networks.

The stabilized Gauss‐Newton formula (3.23) is called there Levenberg‐Marquardt’s algorithm.


Documents

Multilayer Perceptrons - Sahand University of Technologyfa.ee.sut.ac.ir/Downloads/AcademicStaff/1/Courses/63/Lect4-2.pdf · Typically, binary target values are used: d kj = 1 when